i was playing with the google sitemap (means i was activating the google sitemap support in django :)
and then i had an idea… this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.
but this requires to be able to get some kind of machine-readable output from the w3 validator.
as it turns out, there are 2 ways:
i decided to use the second approach, mostly because it’s simpler.
so i wrote a simple python script, that:
the script turned out to be very simple:
from urllib2 import urlopen from httplib import HTTPConnection from urllib import quote from elementtree import ElementTree from time import sleep SITEMAP_URL = 'http://nekomancer.net/sitemap.xml' VALIDATOR_URL = 'validator.w3.org' LOC_LOCATION = '{http://www.google.com/schemas/sitemap/0.84}url/{http://www.google.com/schemas/sitemap/0.84}loc' def get_urls(sitemap_url): c = urlopen(sitemap_url) data = c.read() c.close() tree = ElementTree.fromstring(data) return [loc.text.strip() for loc in tree.findall(LOC_LOCATION)] def get_error_count(url): c = HTTPConnection(VALIDATOR_URL) c.request('HEAD','/check?uri=%s' % quote(url)) r = c.getresponse() c.close() return int(r.getheader('x-w3c-validator-errors')) urls = get_urls(SITEMAP_URL) for url in urls: print get_error_count(url),url sleep(2)
to parse the sitemap.xml i chose to use ElementTree, but i could have used also regular expressions, to match the <loc>URL</loc> parts