Reply to comment

check if your site validates using your google sitemap

i was playing with the google sitemap (means i was activating the google sitemap support in django :)

and then i had an idea… this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.

but this requires to be able to get some kind of machine-readable output from the w3 validator.

as it turns out, there are 2 ways:

  • the normal validating url looks like : http://validator.w3.org/check?uri=[escaped url] . if you add “&output=soap12” to it, then it returns an xml file, which describes the validation results
  • regardless of the output-method (human-readable html, or machine-readable xml), the validator adds several custom headers to the http response. one of them is “x-w3c-validator-errors”, which returns the number of validation errors. if there are no errors, it’s zero.

i decided to use the second approach, mostly because it’s simpler.

so i wrote a simple python script, that:

  1. fetches the sitemap file
  2. submits every url in it to the validator, and extracts the validation-error-count.

the script turned out to be very simple:

from urllib2 import urlopen
from httplib import HTTPConnection
from urllib import quote
from elementtree import ElementTree
from time import sleep
 
SITEMAP_URL = 'http://nekomancer.net/sitemap.xml'
VALIDATOR_URL = 'validator.w3.org'
 
LOC_LOCATION = '{http://www.google.com/schemas/sitemap/0.84}url/{http://www.google.com/schemas/sitemap/0.84}loc'
 
def get_urls(sitemap_url):
    c = urlopen(sitemap_url)
    data = c.read()
    c.close()
    tree = ElementTree.fromstring(data)
    return [loc.text.strip() for loc in tree.findall(LOC_LOCATION)]
 
def get_error_count(url):
    c = HTTPConnection(VALIDATOR_URL)
    c.request('HEAD','/check?uri=%s' % quote(url))
    r = c.getresponse()
    c.close()
    return int(r.getheader('x-w3c-validator-errors'))
 
urls = get_urls(SITEMAP_URL)
 
for url in urls:
    print get_error_count(url),url
    sleep(2)

to parse the sitemap.xml i chose to use ElementTree, but i could have used also regular expressions, to match the <loc>URL</loc> parts

Trackback URL for this post:

http://www.nekomancer.net/trackback/145

Reply

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.