Fixing Errant HTML Content

TiG
@tig
5 years ago
184 posts

We are coming across bad content from various sites. This is HTML that is technically wrong - in particular missing end tags. An ideal PHP solution would be to tap into Tidy functionality but that does not seem to be installed. The existing JR functions such as jrCore_closetags() are only good for small fragments where adding tags to the end is semantically correct.

Our problem deals with the raw content of an arbitrary web-page captured via cURL. We are making extensive use of HTMLPurifier but there seems to be no JR-resident function available to configure HTMLPurifier to deal with unclosed tags. DOMDocument, by the way, parses the incorrect HTML and produces an incorrect DOM; so it does not help here.

My question: is there a way within the Jamroom platform to correctly fix unclosed HTML tags in an arbitrary (and typically large) string of HTML?

Thanks

--
TiG
updated by @tig: 07/29/20 05:09:32PM

@michael
5 years ago
7,806 posts

The jrCore_strip_html() function deals a little with html purifier and there is a listener in there for 'html_purifier' to allow other modules to adjust the settings of html purifier without adjusting core code. The purifier is run when HTML that would be stripped is combined with the quota based settings found at:
ACP -> CORE -> SYSTEM CORE -> QUOTA CONFIG -> Allowed HTML Tags.

Other functions that may be useful if you're building something:
* jrCore_clean_html() may be what you need to close out any open dom tags.

it calls: jrCore_balance_html_tags() and jrCore_closetags() which may be of use in setting something up.

TiG
@tig
5 years ago
184 posts

Michael

I have already investigated that functionality. The HTML tag closing is what I described, it works if the closing can be accomplished by simply adding closing tags to the end. In arbitrary HTML that is typically not sufficient - the errant tags are within the document.

We might be unique in the Jamroom community in our use of HTML, but if you folks ever consider installing Tidy functionality we would make use of it.

Thanks!

--
TiG

@michael
5 years ago
7,806 posts

Is this coming in to the editor? because tinymce has a tidy option built in but is disabled by default because it can cause conflicts with nl2br if the skin is not expecting to get pretty html.

my guess is its not, i suspect you're just scraping and processing then going straight to database. Could "Tidy" be built as a module, if so im sure it could be dropped in.

TiG
@tig
5 years ago
184 posts

Michael

You have guessed correctly. We are acquiring content from a URL provided by the user and populating articles (e.g. discussions and group discussions). The user then will do some editing, add their own commentary and publish to the NT site for comments.

Tidy is an installed PHP extension:

 https://www.php.net/manual/en/book.tidy.php

There likely is a reason it was not opted to be installed (HTML Purifier makes use of it when installed). Maybe Brian will weigh in since I think we are moving into his area of focus.

Thanks

--
TiG

@brian
5 years ago
10,149 posts

Yeah - just use jrCore_clean_html() - it will fix up errant HTML for the most part - i.e.

$corrected_html = jrCore_clean_html($malformed_html);

It uses DOMDocument:

https://www.php.net/DOMDocument

to basically "rewrite" the HTML which ensures everything gets closed.

If you are loading offsite URLs and need to get the HTML, I would highly recommend not using cURL yourself directly and instead just use the jrUrlScan_get_url_with_wget() function that is provided by the Media URL Scanner module - it does a lot of extra work to look like a "real" browser to the remote site. If you are just using curl you will find a lot of sites that will not load properly and/or will just "hang" - these sites use 3rd party CDN's and accelerators like Cloudflare that see the curl call like a "bot" and will reject it. The jrUrlScan_get_url_with_wget() module is setup to work like a "real" browser in that it fully accepts cookies, session cookies, uses a real user agent, accepts gzipped content, etc. - all the headers and stuff it sends out masquerades as a real browser and you will get the HTML. So I would do this:

if ($possibly_bad_html =  jrUrlScan_get_url_with_wget('https://someurl.com')) {
    $cleaned_up_html =  jrCore_clean_html($possibly_bad_html);
}

Let me know if that helps.

--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

TiG
@tig
5 years ago
184 posts

Brian

jrCore_clean_html($malformed_html);

I did try this function before writing my original post and found that its algorithm for closing tags basically closes them at the end. In my case, I have an unclosed div right in the middle of the HTML. The semantic effect of this unclosed div is to cause the main section of the article to be incorrectly enclosed within an aside block. This is how DOMDocument perceives this flaw and it is also how the clean_html resolves the unclosed tag. (My logic already makes heavy use of DOMDocument.) Since I prune off all aside blocks, this malformed HTML causes the article content (in this case) to be tossed along with the asides.

I can write a routine to fix the HTML as needed, but that seems like reinventing the wheel. Ergo my question about being able to install the Tidy extension to PHP. Tidy claims to have this problem solved.

On another note, I will absolutely switch to

jrUrlScan_get_url_with_wget()

since I have experienced the ill effects of being viewed as a mere bot. So that promises to help substantially.

Thanks yet again to you and Michael for the great info.

--
TiG

TiG
@tig
5 years ago
184 posts

Brian

I have done extensive testing using

jrUrlScan_get_url_with_wget()

vs. cURL and it seems to be working perfectly. I am now able to access Forbes, et. al. content rather than dealing with a 403 error. Bloomberg remains inaccessible but 9x% of the sites NT uses seem to be within reach.

Nicely done!

Thanks again,

--
TiG

solved Fixing Errant HTML Content

Tags