sitemap crawls neverending weird url generation

soaringeagle
@soaringeagle
9 years ago
3,304 posts
i do not have specific urls since restarting since last crawl
however on gallery image pages there were thousands of pages with repeating /topic/topic/topic/topic/topic/ tacked onto the url why topic is in a gallery image url is beyond me but this has been a huge hassle
other repeating url sequences (cant remember what site sections) were /%3Cbr%20/topic/%3Cbr%20/topic/%3Cbr%20/topic/%3Cbr%20/topic/%3Cbr%20/topic/

i have not gotten a clean crawl lately because i know i should have just under 1.4 million pages when its getting closer to 1.5 i look at the end of the list and find all tbese repeating weird urls
so have to exclude the repeating sequesnces and start over




--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 10/29/15 07:35:46PM
michael
@michael
9 years ago
7,715 posts
That sequence when reverted to normal html looks like this:
/<br /topic/<br /topic/<br /topic/<br /topic/<br /topic/

Looks to me like there is an unclosed BR inside a loop somewhere.

If it was me, Id run a search on my code base for '/topic/' and find where that missing closing BR is. and blame that for the issue, see whats needed to fix it.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
by search of my code base you mean what exactly
the repeating topics are somewhere in the galery module it could possibly be in my custom galery pages but i dont believe so i believe it was in the gallery itself and why thered be a topic in the url at all in the gallery confuses me


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
sorry. I use phpstorm for my development. It contains all the code that exists on my server.

So I have a the code on my server that runs the site, it connects to the database.

By "search my codebase" I mean use the key command ctrl+shift+f (in phpstorm) to search over all the text contained in the files to locate that pattern.

That should help locate where the br that is missing the closing tag is.

It should be:
<br>
but it probably is
<br
or
<br /

Its probably in a .tpl file in your skin. (unless your using the TEMPLATE tab for alterations, then it will be harder to locate because you dont have the site wide search function like an IDE would.)
soaringeagle
@soaringeagle
9 years ago
3,304 posts
i use both the template tabs and creating custom skins
ok ill try to find php storm and try it out
'


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
doesnt need to be phpstorm, thats just the IDE I like. any text search function would do. Just need to be able to find where you have made that error to be able to fix it.

I think even normal system search has the ability to limit itself to a speciic folder and search within text files.

What I like about phpstorm is it keeps itself in sync with the filesystem on your server so you know you have an exact copy of the files on your server. I find that handy and comforting.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
I tried using it. I got the trial version because the full version was way more than I can afford. I cannot figure it out and gave me memory errors and then crashed


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
I'm using grep through ssh had to start over using the output file but so far away looks like was most of them were being found in cache and in a particular discussion but I'm sure there will be way more once he gets farther through it


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
Yikes the file with just the repeating topics is 162 megs, so there's a hell of a lot of those weird URLs


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
The goal is to figure out what code is generating that error and fix it. So memory is also a tool, can you remember changing any templates lately? take a walk through those changes and see if you miss typed a character or two.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
i could see the br being a mistype but the repeating topics thats not anything i would have done
the file was taking forever to open so ill try again
alot i saw were from cache though
ill try sifting through it today


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
only in unmodified 404 ill go through all my modified templates in the template tabs


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
9 years ago
3,304 posts
seems to be only 2 discussions generating it though i know i saw it in gallery too
and each has something about cache attached
heres an example its poorly formatted

/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/<br /<br /<br /<br /<br /topic/topic/<br /<br /topic/topic/topic/<br /<br /<br /
/home/greentechnologyf/topic.txt:78594:/home/greentechnologyf/tmp/analog/cache:5900718:1	1	0	0	0	24036136	0	0	24036136	0	0	8108	/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/topic/<br /<br /<br /<br /<br /<br /<br /topic/<br /<br /<br /<br /topic/topic/<br /p=8
/home/greentechnologyf/topic.txt:78595:/home/greentechnologyf/tmp/analog/cache:5900719:1	1	0	0	0	24035918	0	0	24035918	0	0	7484	/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/topic/<br /<br /<br /<br /<br /<br /<br /topic/<br /<br /<br /<br /topic/topic/<br /p=7
/home/greentechnologyf/topic.txt:78596:/home/greentechnologyf/tmp/analog/cache:5900723:1	1	0	0	0	24035758	0	0	24035758	0	0	8756	/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/topic/<br /<br /<br /<br /<br /<br /<br /topic/<br /<br /<br /<br /topic/topic/<br /p=6
/home/greentechnologyf/topic.txt:78597:/home/greentechnologyf/tmp/analog/cache:5900724:1	1	0	0	0	24035620	0	0	24035620	0	0	7895	/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/topic/<br /<br /<br /<br /<br /<br /<br /topic/<br /<br /<br /<br /topic/topic/<br /p=5



--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 09/23/15 10:31:40AM
soaringeagle
@soaringeagle
9 years ago
3,304 posts
no errors in any templates
cant find the repeating topic in the gallery although it was there
its a very odd issue
heres the entire grep results searhing /topic/topic/
zip
topic.zip  •  8.1MB




--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 09/23/15 11:27:52AM
michael
@michael
9 years ago
7,715 posts
does your sitemap software tell you which url has that link on it?

There doesnt look to be anything wrong on the destination page:
http://dreadlockssite.com/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads

So I'm wondering if its on the page that contains the link
soaringeagle
@soaringeagle
9 years ago
3,304 posts
No it does not I know both of those post the lost most of the images when I ran Paul's script that was supposed to remove duplicate images from the gallery but accidentally ran amok and started deleting every image everywhere. It deleted 800,000 images.

"There were a lot of quoted post with images linking back and forth between each other. Maybe the screwup somewhere in there. I did block them from being crawled in the site map crawler, but maybe I should also block them in the robots text so web crawlers don't get lost too


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
I think its going to be a code that looks something like this:
{foreach $_items as $item}
<a href="http://site.com/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/
<br /
topic/ 
{/foreach}
and there should be a "> after the href but there isnt.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
CONFUSED why would there be such a code and where
i mean arent those urls built dynamicly with smarty codes and therefore it should affect all forum urls? like
{jamroomurl}/{userid}/{$murl}/{categoryurl}/{itemid}/{topicurl}



--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
This is what should be being built
/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads
and this is what is actually being built
/dreadlocks-forums/forum/dreadlock-picture-gallery/19407/before-and-after-dreadlocks-pictures-comparing-dreads-no-dreads/<br /<br /<br /<br /<br /topic/topic/<br /<br /topic/topic/topic/<br /<br /<br /
Im pretty sure that thats not intentional, so its a bug.

Whether its a bug that exists coming from stuff that the-jamroom-network put out or whether its something you've added in is an important question.

I think its something that you've added in when doing a template alteration (but I might be wrong) and so am suggesting ways to locate where that might be.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
its absolutely nowhere in any templates
i believe it must be either from an early build of the ning importer, or when all the images were wiped out theres some currupted link looping going on
cause its only in 2 picture heavy discussions nowhere else
but i checked every possible template thouroughly

i do have insite that might find the weird links but it takes forever to run (never even got 10% of the site checked) and even with only 10% of the site checked generating the reports took i think over a day and was like 4-6 gigs
if i shut off everything but link checking the reports are still huge


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
Need to figure out what page that link is being found on i reckon. Once you know where its being found, then you can see what is there and trace that back to what template its coming from.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
i checked every template
i wonder if its currupted in cache or something


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
9 years ago
7,715 posts
need to find a page where that url comes out in the source code. Without knowing where your sitemap system is finding the link, its hard to correct it.
soaringeagle
@soaringeagle
9 years ago
3,304 posts
id imagine since the loop seems never ending its coming from evry page in those 2 discussions
but also suspect it might be a curruption in the files in jr cache


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

Tags