<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Gábor's blog</title>
  <link rel="alternate" type="text/html" href="http://www.nekomancer.net/blog/archives/validate-using-sitemap"/>
  <link rel="self" type="application/atom+xml" href="http://www.nekomancer.net/node/145/atom/feed"/>
  <id>http://www.nekomancer.net/node/145/atom/feed</id>
  <updated>2008-03-28T19:34:54-05:00</updated>
  <entry>
    <title>check if your site validates using your google sitemap</title>
    <link rel="alternate" type="text/html" href="http://www.nekomancer.net/blog/archives/validate-using-sitemap" />
    <id>http://www.nekomancer.net/blog/archives/validate-using-sitemap</id>
    <published>2006-11-26T15:27:42-06:00</published>
    <updated>2008-03-28T19:34:54-05:00</updated>
    <author>
      <name>gabor</name>
    </author>
    <category term="django" />
    <category term="42" />
    <category term="computers" />
    <summary type="html"><![CDATA[<p>i was playing with the google sitemap (means i was activating the google sitemap support in <a href="http://www.djangoproject.com">django</a> :)</p>

<p>and then i had an idea&#8230; this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.</p>

<p>but this requires to be able to get some kind of machine-readable output from the w3 validator.</p>

<p>as it turns out, there are 2 ways:</p>

<ul>
<li>the normal validating url looks like : http://validator.w3.org/check?uri=[escaped url] .</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>i was playing with the google sitemap (means i was activating the google sitemap support in <a href="http://www.djangoproject.com">django</a> :)</p>

<p>and then i had an idea&#8230; this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.</p>

<p>but this requires to be able to get some kind of machine-readable output from the w3 validator.</p>

<p>as it turns out, there are 2 ways:</p>

<ul>
<li>the normal validating url looks like : http://validator.w3.org/check?uri=[escaped url] . if you add &#8220;&amp;output=soap12&#8221; to it, then it returns an xml file, which describes the validation results</li>
<li>regardless of the output-method (human-readable html, or machine-readable xml), the validator adds several custom headers to the http response. one of them is &#8220;x-w3c-validator-errors&#8221;, which returns the number of validation errors. if there are no errors, it&#8217;s zero.</li>
</ul>

<p>i decided to use the second approach, mostly because it&#8217;s simpler.</p>

<p>so i wrote a simple python script, that:</p>

<ol>
<li>fetches the sitemap file</li>
<li>submits every url in it to the validator, and extracts the validation-error-count.</li>
</ol>

<p>the script turned out to be very simple:</p>

<div class="geshifilter"><pre class="geshifilter-python"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">urllib2</span> <span style="color: #ff7700;font-weight:bold;">import</span> urlopen
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">httplib</span> <span style="color: #ff7700;font-weight:bold;">import</span> HTTPConnection
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">urllib</span> <span style="color: #ff7700;font-weight:bold;">import</span> quote
<span style="color: #ff7700;font-weight:bold;">from</span> elementtree <span style="color: #ff7700;font-weight:bold;">import</span> ElementTree
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">time</span> <span style="color: #ff7700;font-weight:bold;">import</span> sleep
&nbsp;
SITEMAP_URL = <span style="color: #483d8b;">'http://nekomancer.net/sitemap.xml'</span>
VALIDATOR_URL = <span style="color: #483d8b;">'validator.w3.org'</span>
&nbsp;
LOC_LOCATION = <span style="color: #483d8b;">'{http://www.google.com/schemas/sitemap/0.84}url/{http://www.google.com/schemas/sitemap/0.84}loc'</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> get_urls<span style="color: black;">&#40;</span>sitemap_url<span style="color: black;">&#41;</span>:
    c = urlopen<span style="color: black;">&#40;</span>sitemap_url<span style="color: black;">&#41;</span>
    data = c.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    c.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    tree = ElementTree.<span style="color: black;">fromstring</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>loc.<span style="color: black;">text</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> loc <span style="color: #ff7700;font-weight:bold;">in</span> tree.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>LOC_LOCATION<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> get_error_count<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>:
    c = HTTPConnection<span style="color: black;">&#40;</span>VALIDATOR_URL<span style="color: black;">&#41;</span>
    c.<span style="color: black;">request</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'HEAD'</span>,<span style="color: #483d8b;">'/check?uri=%s'</span> <span style="color: #66cc66;">%</span> quote<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    r = c.<span style="color: black;">getresponse</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    c.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">int</span><span style="color: black;">&#40;</span>r.<span style="color: black;">getheader</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'x-w3c-validator-errors'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
urls = get_urls<span style="color: black;">&#40;</span>SITEMAP_URL<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">for</span> url <span style="color: #ff7700;font-weight:bold;">in</span> urls:
    <span style="color: #ff7700;font-weight:bold;">print</span> get_error_count<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>,url
    sleep<span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span></pre></div>

<p>to parse the sitemap.xml i chose to use ElementTree, but i could have used also regular expressions, to match the <code>&lt;loc&gt;URL&lt;/loc&gt;</code> parts</p>
    ]]></content>
  </entry>
</feed>
