Fixing Errant HTML Content
Jamroom Developers
Brian
jrCore_clean_html($malformed_html);
I did try this function before writing my original post and found that its algorithm for closing tags basically closes them at the end. In my case, I have an unclosed div right in the middle of the HTML. The semantic effect of this unclosed div is to cause the main section of the article to be incorrectly enclosed within an aside block. This is how DOMDocument perceives this flaw and it is also how the clean_html resolves the unclosed tag. (My logic already makes heavy use of DOMDocument.) Since I prune off all aside blocks, this malformed HTML causes the article content (in this case) to be tossed along with the asides.
I can write a routine to fix the HTML as needed, but that seems like reinventing the wheel. Ergo my question about being able to install the Tidy extension to PHP. Tidy claims to have this problem solved.
On another note, I will absolutely switch to
jrUrlScan_get_url_with_wget()
Thanks yet again to you and Michael for the great info.