solved weird sitemap crawl results after urlscanner update

soaringeagle
@soaringeagle
7 years ago
3,304 posts
for the last week (may have been fixed in latest update sitemap crawl takes 4-5 days before i'll know) i was geting alot orunaway scans and weird urls like www.mysit.com/someforumpost> http://www.somesitelinkedinpost.com> http://www.anotherlink.com> http://www.3rdlink.com/ etc

i had to add *> http* to my exclusion list but am afraid now it will also block the post these scanned urls appear on
was this fixed? (i will not be able to tellnow that i excluded them)
however on both sites i had seen visitors on these weird urls and bots crawling them



--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 04/28/17 10:39:08AM
michael
@michael
7 years ago
7,715 posts
You're using some non-jamroom software to create the sitemap?
soaringeagle
@soaringeagle
7 years ago
3,304 posts
yes cause jr sitemap creator i don't believe will list all pages correctly
plus doesnt alow you to set priorities change frequencies or custom exclusions
i use inspyder sitemap creator
but there was no inspyder update to explain the change in behavior only a jr update
inspyders been amazing (like you guys) at fixing bugs and adding custom features i sually get a beta version within hours when i make a suggestion or report a bug

theres been no upgrades in a fairly long time so the jr update is certainly the cause
the latest url scaner id say something in the change log about urls being handled oddly in some cases..so maybe its fixed


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
7 years ago
3,304 posts
ugh just found this too
1/20/2017 8:06:12 PM - Warning: Request Timeout (try reducing "Number of Crawlers" in Advanced Settings): http://www.freedomswings.org/soaringeagle/youtube/83/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.youtube.com/!http:/www.facebook.com/!http:/www.facebook.com/!http:/www.facebook.com/!http:/www.facebook.com/!http:/www.facebook.com/!http:/www.facebook.com/!http:/www.facebook.com/Majsternmajster

now last week freedomswings had 23,000 urls in the sitemap
now its at 805,000 and still crawling
it gets stuck in these loops i think adding more and more urls to the intended url

verified
i checked there was an issue in the yt description that included ! before http
so i excluded /http already now had to exclude /!http too
but it definately loops adding the url on over and over and over
i will file a report with inspider as well


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

updated by @soaringeagle: 01/22/17 04:59:45PM
michael
@michael
7 years ago
7,715 posts
No problem if you are using inspider, just good to know to understand the problem.

It looks like the URL's it is finding are the ones in the breadcrumbs. Does it have the ability to exclude sections by class name perhaps, to exclude any links found in the .breadcrumbs div? (guessing, havent used the software.)

I cant see how the media url scanner would be working on the sitemap pages created by the software as those pages would likely be outside of jamroom control so the Media URL Scanner should not be effecting the links I would expect.

yes there were changes to the Media URL Scanner lately, and also the jrMeta module too. A lot of the detail pages got og:tags added to them so they can be shared via Facebook and twitter. The Media URL Scanner module tries to use those og:tags to build cards for any links.

Should only be doing that in pages inside the jamroom system though. Wouldnt expect it to effect any external pages like the ones built with inspider.
soaringeagle
@soaringeagle
7 years ago
3,304 posts
yea its odd behavior and just started with 1 of the recent updates
it seems like it just adds any external links to the end of the internal link crawls it it comes up as the same page itself and it adds the external links on again, thereby getting stuck in a loop crawling the same page over and over just adding to the url (if you click that link even though it has many urls added on it still gies to the original page)


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
michael
@michael
7 years ago
7,715 posts
since jamroom doesnt care about the human readable part of the url, adding anything extra to it probably would make inspider think its a different url, while jamroom thinks its the same url.

eg:
http://www.freedomswings.org/soaringeagle/youtube/83/
http://www.freedomswings.org/soaringeagle/youtube/83/no-name
http://www.freedomswings.org/soaringeagle/youtube/83/some-other-name
http://www.freedomswings.org/soaringeagle/youtube/83/this-part-is-just-for-human-users-to-read
http://www.freedomswings.org/soaringeagle/youtube/83/it-has-no-effect-on-what-content-loads
http://www.freedomswings.org/soaringeagle/youtube/83/happy-monday
http://www.freedomswings.org/soaringeagle/youtube/83/top-gun

Those links are all the same piece of content. so if your sitemap creator software adds bits and pieces to the URL expecting to get a different page or a 404, thats not going to happen.

Seams like that area is where the hiccup is occurring.
soaringeagle
@soaringeagle
7 years ago
3,304 posts
got ya well the 1st 1 i found was after the human readable part
i think
it was after a username so might have been on her profile itself but thought it was a blog post
no actually i remember it was in an activity feed or timeline
i did contact inspyder but might take a couple days to get a reply
then they will likely need to run a crawl and see the results i see
it did seem to start though with 1 of the recent jr updates so you could probly try the trial version on a smaller site and see


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
7 years ago
3,304 posts
inspyder identified an issue with youtube module very often adding the last charachter before urls in descriptions
this seems to be the cause of the issue
example if someone put ( http://www.somesite.com/) the url code is
<a href="(http://www.somesite.com/">
think spaces are handled ok but i think something like
visit our site!
http://whatever
the ! is included even though its on a diferent line (unverified but because there seemed to be a whole lot of them i can't imagine they were all due to no spaces b4 the http)


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
derrickhand300
@derrickhand300
7 years ago
1,353 posts
I had the same issue and use inspyder...The sitemap I created after the update had around 800,000 pages- previous to that it had 5200.....I ran it again the following day and the sitemap it created had only 2700 pages... I have not looked into it any further though....
soaringeagle
@soaringeagle
7 years ago
3,304 posts
i dont think it would have ever stopped... it seemed to be stuck in an endless loop


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
Reading through this I don't see how this is a Jamroom issue. Maybe you should open a ticket with the inspyder developers and have them make sure their software can't get stuck in a loop?


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
already talked to them but they did find an issue with at least the urls causing the loop and they were youtube descriptions that seem to all include the character before the url in the description
somehow the urls are not being handled corectly


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
Sounds like they need to get their crawler handling URLs correctly...


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
brian
@brian
7 years ago
10,148 posts
Just re-reading this thread and I see you say:

Quote:
yes cause jr sitemap creator i don't believe will list all pages correctly

What pages do you feel it is not listing? Thanks!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
if you look at the affected pages the youtube descriptions have malformed urls with the last letter of the text added before the httpin the url
the crawlers handeling whats on the page, the only way ti handle it diferently is thriugh exclusions
the problem is theres deformed urls on the page thats causing the issue
and this only began after an update


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
Can you point me to a JR page that has a malformed URL for YouTube?

Thanks!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
brian:
Just re-reading this thread and I see you say:

Quote:
yes cause jr sitemap creator i don't believe will list all pages correctly

What pages do you feel it is not listing? Thanks!

hard to say but i remember last i used it (its been a long time) it listed a few thousand pages while the inspyder listed over a million
i can activate it run it and see how many urls it picks up since i removed the rewrite condition from htaccess
wheres the jr created sitemaps stored on the server


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
soaringeagle:
hard to say but i remember last i used it (its been a long time) it listed a few thousand pages while the inspyder listed over a million

But which one is accurate? If you actually DO have a million pages on your site then that's fine, but there's no reason to list them all in your sitemap. The Sitemap file is used by crawlers to identify the starting point for a crawl. There is no need to list ALL PAGES in a sitemap file, since Google will automatically follow any URLs that are found on a page.

So for example - in your site map file you only need the URL to a profile. You do not need to list every single page in the profile. Google/Bing/etc will easily find all those sub pages without being "told" they are there.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
actually that goes completely against what google says and the purpose of sitemaps
sitemas are meant to list all pages in your site since google may not find links, might get lost in links structures example a site containing a calander yiu add /calander in the sitemap and the bot gets lost crawling 30 years 12 months a year 30-31 days a month all empty pages
sitemaps guide the bots to where pages are
tell the bots these pages are important , these are worthless ignore them these change constantly, this never changes so only visit it once every couple years
the entire reason for sitemaps is to tell the bots THE COMPLETE LINK STRUCTURE after all.. the domain is a starting point that the bots should find all the links in
but that was not the case so they created sitemaps to tell the bots what all the pages are.. from there it might find a few new links that were not included in the sitemap

this was my biggest issue with nings sitemap
it included /profiles /forums /blogs /video /photos
only main features 100-14 urls for a site that at that point contained 4 million urls

ps at that time google indexed about 2.5 million urls


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
soaringeagle:
since google may not find links, might get lost in links structures

Unless your URLs are hidden, this just doesn't happen.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
Benefits to using a xml sitemap
The first set of benefits revolve around being able to pass extra information to the search engines.

Your sitemap can list all URLs from your site. This could include pages that aren't otherwise discoverable by the search engines.
Giving the search engines priority information. There is an optional tag in the sitemap for the priority of the page. This is an indication of how important a given page is relevant to all the others on your site. This allows the search engines to order the crawling of their website based on priority information.
Passing temporal information. Two other optional tags (lastmod and changefreq) pass more information to the search engines that should help them crawl your site in a more optimal way. "lastmod" tells them when a page last changed, and changefreq indicates how often the page is likely to change.

Being able to pass extra information to the search engines *should* result in them crawling your site in a more optimal way. Google itself points out the information you pass is considered as hints, though it would appear to benefit both webmasters and the search engines if they were to use this data to crawl the pages of your site according to the pages you think have a high priority. There is a further benefit, which is that you get information back.

Google Webmaster Central gives some useful information when you have a sitemap. For example, the following graph shows googlebot activity over the last 90 days. This is actually taken from a friend of ours in our building who offers market research reports.


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
soaringeagle:
Your sitemap can list all URLs from your site. This could include pages that aren't otherwise discoverable by the search engines.

Again - it's to help search engines find pages that are NOT discoverable - i.e. they are not linked anywhere from your site. I suspect you have very little URLs that fall into this category.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
and another very detailed article on sitemaps and the importance inseo https://searchenginewatch.com/sew/how-to/2048706/the-site-map-gateway-optimization


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
You're correct - they help with SEO. But none of it refutes what I have posted. You're line of reasoning is that Google bot is flawed and can't figure out how to follow a URL. with 20+ years experience crawling the web, that's not an issue Google is still dealing with :)


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
its also to guide them through your site assigning importance to some ages recomending others be ignored etc
but yes it does happen
while on ning i wrote a post on sitemap use and why nings sitemaos were useless
i had about 30 people impliment proper sitemaps and in 2 weeks they saw 50% increase in pages indexed and trafic
after a ciuple months most were up by 200%

a proper sitemap including all urls is the second thing seo experts check for
in seo scoring on automated seo checks it ranks up there in importance almost as high as proper and unique titles


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
You're continuing to post here without reading my follow ups.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
google can follow a urk ofcourse but can fgoogle tell what url it shoud be fillowing/ how often?
example
ning had a 'share this" page for every page of content on the site
without a sitemap these share pages often ranked higher then the pages they were meant to share
google got lost crawling share pages and indexed a million of them
changing the priority of them to 0 and change freq to never drasticly decrased how often they were crawled and whether they were indexed or ranked above the actual pages

and like i said in the case of a calander of events google bot can crawl millions of empty useless pages, looking for events from 1900 to 2525
googlebots doing its job following links but doesnt do it efficiently ..it gets lost in useless pages and doesnt get to the ones that are important

case in point i have dating.dreadlockssite.com the old version had a calander
for weeks and weeks i watched the bot activity.. every last bot was lost in the calander ..constantly, inly once every few days did i see any other page on the site being crawled
consequently it took months before any pages of importance were indexed


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
7 years ago
3,304 posts
nope was reading and responding to them


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
OK - I won't comment anymore on it, but let's just say we "agree to disagree" :)

My focus is on JR's sitemap module. I'll check it out and see if there are some things we can do to improve it - just note that it's not meant to compete with stand alone sitemap builders that give you a hundred knobs to turn. Our goal is to have a module that works just fine for 95% of our users.

Thanks!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net

updated by @brian: 01/27/17 09:36:27AM
soaringeagle
@soaringeagle
7 years ago
3,304 posts
i can get you those malformed urls but will take time
the easiest way would be to run a crawl without the exclusions and find them that way
i was looking around randomly but got distracted
give me a lil time to find a few
i'll also compare them to whats on youtube itself to verify the urls arent being imported already currupted

i'll try to do as much of the debufgging as i can before sending ya the info


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
7 years ago
3,304 posts
and i apprecuate that your efforts were far and beyond what ning offered and i have high praise for it
it lacked a little of the fine tuning i got used to
otherwise it was a great improvement
i a going to test it again because an on server solution is by far superiour then crawling over the net

what i willdo is activate it on fredomswings then i can email you both sets of sitemaps as a comparison


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
soaringeagle
@soaringeagle
7 years ago
3,304 posts
1 last thing whats the htacess i need to change touse the jr sitemap vs my uploded 1
i think i just removed xml from the files that use the router//


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
Just make sure your DirectoryIndex line is like:

DirectoryIndex index.html index.php sitemap.xml modules/jrCore/router.php

Let me know if that helps.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
hats what i thought but then i have to delete the uloaded ones in the root or the root displays the sitemap instead of index page


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
Correct - if you have an actual sitemap.xml file (Jamroom does not) then that will happen.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
brian
@brian
7 years ago
10,148 posts
You know - that may not even need to be there anymore, I will check it out.


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
brian
@brian
7 years ago
10,148 posts
Yep - not needed. So leave that out. However, if you want Jamroom's sitemap.xml to show you have to remove any existing sitemap.xml file from your root directory (or the request doesn't make it into JR's router).


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
ok after loking at the generated jr sitemap i see a few issues (i believe we discussed some things while developing the module)
during development we discussed it adding pages from db urls from each module and adding new pages "on the fly' using the queue whenever pages were added too fast to write to the xml
i see 2 files 1 contains profiles 1 contains modules
1 module listed in sitemal is seamless with this url http://www.freedomswings.org/seamless
and page content that obviosly dont need to be crawled

ihnstead the sitemap module should harvest urls that use the seamless lists like http://www.freedomswings.org/soaring-videos-by-category
additionally every single page has priority 1

now does the sm module know that if you get over 50,000 profiles it needs to start a new file? sitemaps have a size and url mlimit 50k urls but sites with long urls might need to linmit it t 4ok so the long urls dont push it past size limits

an updated on the fly sitemap completely server side is a dream come true especially if it alows management of priorities and change fregeuencies at least on a module evel and independently on sb created pages
this is a good effort but i could suggest some grwat improvements that would make it worth paying for


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities
brian
@brian
7 years ago
10,148 posts
soaringeagle:
now does the sm module know that if you get over 50,000 profiles it needs to start a new file? sitemaps have a size and url mlimit 50k urls but sites with long urls might need to linmit it t 4ok so the long urls dont push it past size limits

Yes - it adheres to all the sitemap limits and specifications.

Quote:
ihnstead the sitemap module should harvest urls that use the seamless lists like http://www.freedomswings.org/soaring-videos-by-category

The sitemap has no idea you created this custom page right now. What I think needs to be done here is it needs to be updated to work with SiteBuilder for sure, as well as look for custom skin templates. That should handle most pages - any remaining will be found by the crawler if linked to.

Quote:
additionally every single page has priority 1

I'm not sure what to do here. For profiles we can certainly change it based on updated time, but for the rest of the site I'd prefer to NOT make the site owner have to decide. So we need a good default, and daily works. I'll think about it.

Thanks!


--
Brian Johnson
Founder and Lead Developer - Jamroom
https://www.jamroom.net
soaringeagle
@soaringeagle
7 years ago
3,304 posts
add updating it to your todo list and i can offer some things to consider at tht time but make it a lower priority
the ecomerce foxy cart modules most desperately in need of an upgrade


--
soaringeagle
head dreadhead at dreadlocks site
glider pilot student and member/volunteer coordinator with freedoms wings international soaring for people with disabilities

Tags