more repeating weirdness added to urls

soaringeagle
@soaringeagle
10 years ago
3,304 posts
it seems everytime i run a sitemap crawl i find more of these and cant finish 1 because they create never ending loops of repeatting weirdnesses added to the urls
where its coming from is confusing it affects random pages seems to come from the pagination in the forums
the odd part is the repeating text strings are rather random
heres the latest example

http://www.dreadlockssite.com/dreadlocks-forums/forum/dread-products/69706/best-dreadlocks-shampoo-weekly-giveaway-new-updates/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_player/a_p



see it keeps repeating till it reaches the max length for the url
but not onlty does it keep adding the text string a_player in this case
but each instance of a_player has a run on loop of p=#
because no real page ends in a_player it goes to the directory /best-dreadlocks-shampoo-weekly-giveaway-new-updates/ opens that page and even though that page is only 2 pages long each a_player i think will oaginate indefinately
so not only do the urls add a_player indefinately (till max url length) it adds p=1212 to each a_player wich all redirect to page 1



--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 01/14/16 11:22:36PM
brian
@brian
10 years ago
10,149 posts
This is an issue with your crawler - I don't think this is a Jamroom issue. The string "a_player" is not found in Jamroom, so you're crawler is messing up and generating bad URLs it is trying to follow.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
10 years ago
3,304 posts
view source code on the page and you see the string in the pages source code


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
10 years ago
10,149 posts
soaringeagle:
view source code on the page and you see the string in the pages source code

Then is sounds like your crawler is picking it up and messing it up ... ?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
10 years ago
3,304 posts
could be ill contact them once again


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
10 years ago
3,304 posts
responce from inspyder
note it seems to be an issue both with jr leaving "code snippets' behind when the url scanner detects a url and converts it
then the crawler interperates that as a relative url and adds it onto the url

I noticed that there are some links in the forum that are causing issues. We are looking into adding a feature to ignore processing blocks of content based on class type.

In your case:

http://www.dreadlockssite.com/dreadlocks-forums/forum/dread-products/69706/best-dreadlocks-shampoo-weekly-giveaway-new-updates/a_player/p=2

There is a link "..." following "Sam Newton said:"

The href is:
<a href="a_player" target="_blank">

This is a relative URL, so it is added to the base URL. This leads to page one of the topic, then it crawls to page 2 finding the same link again.

We will try to have a test build ready for you to try later today that will hopefully prevent this by not processing forum post text links.


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 10/14/15 08:29:32AM
michael
@michael
10 years ago
7,800 posts
They've found the source in that code:
<a href="a_player" target="_blank">
But not indicated the URL where that is coming out at. It doesn't look like a default skin thing, is it something you've added?

---
guess:
It looks like this url is what @sam-newton posted:
http://www.youtube.com/watch?v=Ij19UWYHUvM&feature=youtube_gdata_player
That looks like its being cut off, perhaps by URL Scanner to
http://www.youtube.com/watch?v=Ij19UWYHUvM&feature=youtube_gdat
and leaving the remainder of that
a_player
being set into its own new URL. I'll check that out. Thanks.
michael
@michael
10 years ago
7,800 posts
The problem looks to be those three dots there (screenshot).

Just need to figure out what is getting them there.
screenshot_dots.png
screenshot_dots.png  •  354KB

michael
@michael
10 years ago
7,800 posts
ok, found something irregular:
* When I click on the UPDATE TOPIC for the post with the three dots in it "..." what I see is this:

See the 2 screenshots. What I expected to see when I opened the edit view for that post was that the [ quote ] parts would be there, what is there instead seams to be directly input HTML. Not sure how that got there, but that being there is the root cause of the wrong URL.

On my site, the Media URL Scanner does not convert the urls that are inside the quote.

Tags