Google Sitemaps

Discuss
at
vBwebmasters

Google have recently introduced XML based sitemaps to their range of webmaster tools. It’s an exciting initiative and not before time – webmasters should be hoping the feed gets accepted generically.

So why is it so good?

  • Google doesn’t need to crawl an entire site to find the new content saving your bandwidth and theirs
  • Google can find new content even if it’s buried deep within the site.
  • You can have a navigation system designed for humans and a sitemap for the bots

So how did it work before?

  • In the really early days Google relied on the date a page was last modified – that’s a server setting. That was fine when all pages were static html but webmasters learnt how to spoof the date and dynamic sites gave the date of the main file, not the date of the data it was dynamically presenting.
  • Google would index each link it found on a page and search through. If it hit stop words, bad html, bad links or size limits then your page wouldn’t be indexed as well as it might have been
  • Webmasters learnt to have sitemap pages to help both humans and bots and there were alot of compromises

Just what is a Google sitemap?

A Google sitemap is an xml document (or series of documents) which lists every page on your site that you want Google to know about. You generate it so you only include the pages you want indexed. It doesn’t replace the robots exclusion protocol and the robots.txt document – you still need that.

The xml document lists the site name and it’s modification date, then every page, it’s modification date, priority and check frequency. In just a few bytes Google can get an accurate picture of your site and set some priorities.

There is no naming convention for the files, that should come, so you need to login to Google https://www.google.com/webmasters/sitemaps/ and register your sitemap

Will Google rank me higher if I have provide a sitemap?

No, it would be unrealistic to expect Google to give preferential treatment to webmasters and site owners who provide sitemaps. What will work in your favour though is that Google will have a better idea of your site, will index new content more quickly and will have a better idea of when to return. Every time I login to check my sitemaps they have checked the sitemap within the last 24 hours. That doesn’t necessarily equate to actual indexing but it does mean your site stands a better chance of more accurate indexing more often.

How do I get a Google sitemap?

Google have produced their own tool written in Python. What’s that? Good question. Your webhost probably doesn’t have it installed and isn’t likely to either. Within the PHP world the simple answer is to write one yourself, or work from an existing script.

One of the reasons webmasters like using the big systems like vBulletin is that there is a strong community working together to provide the tools to make our sites better.

Michael Brandon, a fellow kiwi, has written a great vBulletin tool which seems to do it all. You can pick the script up from http://forum.time2dine.co.nz/showthread.php?t=3976 and it’s truly plug and play.

I’ve installed it on v3.0.1 and v3.0.7 sites and after some tiny changes (which Michael has incorporated into the script) I ran the script and out popped the sitemap. He’s created a handy-dandy user interface with all the links you need to submit the feed to Google.

He’s taken a conservative line and splits the sitemap files at less than the 10meg limit, all gzipped up and sorted nicely. Private forums are excluded and permissions are respected. I’d intended to write installation instructions but there’s really no need.

All that’s left to do is remember to run it regularly – or set up a cron job to run it for you daily.

But I can’t do Cron Jobs!

There are scripts out there to emulate standard cron jobs such as pseudo-cron and PHP Cron – but that’s another article for another day.

What if my site isn’t 100% vBulletin?

Don’t worry, the vBulletin script will allow you to add other information via it’s vbsitemap-xtrafile.php script, so you can incorporate your entire script.

And what if I miss a page or 100?

Google tell us that they will still do a full index of our sites and pages which are excluded won’t be penalised.

Other types of sites

  • If you are using a CMS or standard type of system go to that website and look for the forum or community links and see if someone has already submitted a google sitemap generator
  • Or you can use Xenu to generate it’s own sitemap and then search on google for a converter. There are a few using excel.
  • And finally, sit down and write one which matches your site’s requirements.

And Finally Ethics

  • Webmasters with generic sites are going to learn how to create enormous sitemaps pretty quickly with dinky names like http://www.mysite.com/find/something-about-nothing.html and then parse the file name and call a search engine for results and pretend the page actually existed. You know the type, you find them all the time. This won’t be news to Google and I’m confident (fingers crossed) that they have a plan for dealing with this. After all using this philosophy you can have a handful of scripts and an enormous sitemap and nothing else!
  • Modified dates will remain open for abuse by some webmasters. The script mentioned above uses the database settings for when a post is actually saved. It doesn’t adjust the date for when a signature on the page is updated, for when a displayed RSS feed is updated or when the blog entry is updated. That’s good, because those are artificial page fresheners which have been used in the past to fool googlebot into thinking there was fresh content. Unscrupulous webmasters will find ways to manipulate Google but don’t think for a minute that the geeks at Google aren’t thinking just as hard.

From my blog