<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>django</title>
  <link rel="alternate" type="text/html" href="http://www.nekomancer.net/taxonomy/term/10"/>
  <link rel="self" type="application/atom+xml" href="http://www.nekomancer.net/taxonomy/term/10/atom/feed"/>
  <id>http://www.nekomancer.net/taxonomy/term/10/atom/feed</id>
  <updated>2008-03-28T18:46:28-05:00</updated>
  <entry>
    <title>check if your site validates using your google sitemap</title>
    <link rel="alternate" type="text/html" href="http://www.nekomancer.net/blog/archives/validate-using-sitemap" />
    <id>http://www.nekomancer.net/blog/archives/validate-using-sitemap</id>
    <published>2006-11-26T15:27:42-06:00</published>
    <updated>2008-03-28T19:34:54-05:00</updated>
    <author>
      <name>gabor</name>
    </author>
    <category term="django" />
    <category term="42" />
    <category term="computers" />
    <summary type="html"><![CDATA[<p>i was playing with the google sitemap (means i was activating the google sitemap support in <a href="http://www.djangoproject.com">django</a> :)</p>

<p>and then i had an idea&#8230; this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.</p>

<p>but this requires to be able to get some kind of machine-readable output from the w3 validator.</p>

<p>as it turns out, there are 2 ways:</p>

<ul>
<li>the normal validating url looks like : http://validator.w3.org/check?uri=[escaped url] .</li>
</ul>
    ]]></summary>
    <content type="html"><![CDATA[<p>i was playing with the google sitemap (means i was activating the google sitemap support in <a href="http://www.djangoproject.com">django</a> :)</p>

<p>and then i had an idea&#8230; this sitemap basically describes all the urls of a site. with this, i could simply check all my urls on the w3 validator. automatically. every night, for example.</p>

<p>but this requires to be able to get some kind of machine-readable output from the w3 validator.</p>

<p>as it turns out, there are 2 ways:</p>

<ul>
<li>the normal validating url looks like : http://validator.w3.org/check?uri=[escaped url] . if you add &#8220;&amp;output=soap12&#8221; to it, then it returns an xml file, which describes the validation results</li>
<li>regardless of the output-method (human-readable html, or machine-readable xml), the validator adds several custom headers to the http response. one of them is &#8220;x-w3c-validator-errors&#8221;, which returns the number of validation errors. if there are no errors, it&#8217;s zero.</li>
</ul>

<p>i decided to use the second approach, mostly because it&#8217;s simpler.</p>

<p>so i wrote a simple python script, that:</p>

<ol>
<li>fetches the sitemap file</li>
<li>submits every url in it to the validator, and extracts the validation-error-count.</li>
</ol>

<p>the script turned out to be very simple:</p>

<div class="geshifilter"><pre class="geshifilter-python"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">urllib2</span> <span style="color: #ff7700;font-weight:bold;">import</span> urlopen
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">httplib</span> <span style="color: #ff7700;font-weight:bold;">import</span> HTTPConnection
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">urllib</span> <span style="color: #ff7700;font-weight:bold;">import</span> quote
<span style="color: #ff7700;font-weight:bold;">from</span> elementtree <span style="color: #ff7700;font-weight:bold;">import</span> ElementTree
<span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">time</span> <span style="color: #ff7700;font-weight:bold;">import</span> sleep
&nbsp;
SITEMAP_URL = <span style="color: #483d8b;">'http://nekomancer.net/sitemap.xml'</span>
VALIDATOR_URL = <span style="color: #483d8b;">'validator.w3.org'</span>
&nbsp;
LOC_LOCATION = <span style="color: #483d8b;">'{http://www.google.com/schemas/sitemap/0.84}url/{http://www.google.com/schemas/sitemap/0.84}loc'</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> get_urls<span style="color: black;">&#40;</span>sitemap_url<span style="color: black;">&#41;</span>:
    c = urlopen<span style="color: black;">&#40;</span>sitemap_url<span style="color: black;">&#41;</span>
    data = c.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    c.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    tree = ElementTree.<span style="color: black;">fromstring</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: black;">&#91;</span>loc.<span style="color: black;">text</span>.<span style="color: black;">strip</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #ff7700;font-weight:bold;">for</span> loc <span style="color: #ff7700;font-weight:bold;">in</span> tree.<span style="color: black;">findall</span><span style="color: black;">&#40;</span>LOC_LOCATION<span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> get_error_count<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>:
    c = HTTPConnection<span style="color: black;">&#40;</span>VALIDATOR_URL<span style="color: black;">&#41;</span>
    c.<span style="color: black;">request</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'HEAD'</span>,<span style="color: #483d8b;">'/check?uri=%s'</span> <span style="color: #66cc66;">%</span> quote<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
    r = c.<span style="color: black;">getresponse</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    c.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">int</span><span style="color: black;">&#40;</span>r.<span style="color: black;">getheader</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'x-w3c-validator-errors'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
urls = get_urls<span style="color: black;">&#40;</span>SITEMAP_URL<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">for</span> url <span style="color: #ff7700;font-weight:bold;">in</span> urls:
    <span style="color: #ff7700;font-weight:bold;">print</span> get_error_count<span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>,url
    sleep<span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span></pre></div>

<p>to parse the sitemap.xml i chose to use ElementTree, but i could have used also regular expressions, to match the <code>&lt;loc&gt;URL&lt;/loc&gt;</code> parts</p>
    ]]></content>
  </entry>
  <entry>
    <title>feeds and time</title>
    <link rel="alternate" type="text/html" href="http://www.nekomancer.net/blog/archives/feeds-and-time" />
    <id>http://www.nekomancer.net/blog/archives/feeds-and-time</id>
    <published>2006-11-01T17:50:08-06:00</published>
    <updated>2008-03-28T19:31:30-05:00</updated>
    <author>
      <name>gabor</name>
    </author>
    <category term="django" />
    <category term="computers" />
    <summary type="html"><![CDATA[<p>i use a lot of rss/atom feeds. by use i mean that i am reading many news-sites, blogs, etc. using their feeds.</p>

<p>for some reason those feeds always contain the last <em>n</em> entries, where <em>n</em> is a fixed number.</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>i use a lot of rss/atom feeds. by use i mean that i am reading many news-sites, blogs, etc. using their feeds.</p>

<p>for some reason those feeds always contain the last <em>n</em> entries, where <em>n</em> is a fixed number. so a feed contains let&#8217;s say the last 10 entries.</p>

<p>this is fine as long as the page (which is represented in the feed) does not get updated too often.</p>

<p>otherwise the following scenario might happen:</p>

<ul>
<li>let&#8217;s say the feed contains the last 10 entries</li>
<li>and you fetch the feed once a day using your feed-reader program</li>
<li>now, for some reason, one day there will be 11 new entries</li>
<li>when you fetch the feed next day, only the last 10 entries will be in the feed</li>
<li>so you missed one article on that website</li>
</ul>

<p>the problem is, that these feeds do not provide any guarantees about their &#8220;completeness&#8221;.</p>

<p>(this is not an &#8220;in theory only&#8221; problem. once i had this exact problem.a website&#8217;s feed only contained the last x (50, iirc) entries, and much more arrived daily. so i kept missing them, and if i wanted to be sure that i saw all of them, i had to manually go through the entries on the site. kind of defeats the purpose of the feed. nowadays i&#8217;m using <a href="http://www.bloglines.com">bloglines</a>, which is fetching the feeds a lot more often than i did, so i do not have the problem with that site anymore)</p>

<p>what i would like to see, is feeds that contain all the entries for a specified timeframe.for example for the last 48 hours. they can contain of course more entries, but they would have to guarantee, that they contain at least all the entries for the last 48 hours. which would mean, that if you fetch the feed daily, you will not miss any entries.</p>

<p>for example, let&#8217;s do it in <a href="http://www.djangoproject.com">django</a>:</p>

<p>django contains a complete feed-framework, with documentation, so first go and <a href="http://www.djangoproject.com/documentation/syndication/">read the documentation</a></p>

<p>as you see, the whole issue of &#8220;showing the last n entries&#8221; is being handled in the <a href="http://www.djangoproject.com/documentation/syndication/#feed-classes">items() method</a>:</p>

<p>(the example from the documentation)</p>

<div class="geshifilter"><pre class="geshifilter-python"><span style="color: #ff7700;font-weight:bold;">def</span> items<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
    <span style="color: #ff7700;font-weight:bold;">return</span> NewsItem.<span style="color: black;">objects</span>.<span style="color: black;">order_by</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'-pub_date'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span>:<span style="color: #ff4500;">5</span><span style="color: black;">&#93;</span></pre></div>

<p>we want to return at least 10 entries, and we want to make sure that we return all the entries for the last 48 hours.</p>

<p>so we could use something like this:</p>

<div class="geshifilter"><pre class="geshifilter-python"><span style="color: #ff7700;font-weight:bold;">from</span> <span style="color: #dc143c;">datetime</span> <span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">datetime</span>, timedelta
&nbsp;
<span style="color: #ff7700;font-weight:bold;">def</span> items<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
    entries = NewsItem.<span style="color: black;">objects</span>.<span style="color: black;">order_by</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'-pub_date'</span><span style="color: black;">&#41;</span>
    two_days_ago = <span style="color: #dc143c;">datetime</span>.<span style="color: black;">now</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> - timedelta<span style="color: black;">&#40;</span>days=<span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
    last_entries = entries.<span style="color: #008000;">filter</span><span style="color: black;">&#40;</span>pub_date__gte = two_days_ago<span style="color: black;">&#41;</span>
    <span style="color: #ff7700;font-weight:bold;">if</span> last_entries.<span style="color: black;">count</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span> <span style="color: #66cc66;">&amp;</span>gt<span style="color: #66cc66;">;</span>= <span style="color: #ff4500;">10</span>:
        <span style="color: #ff7700;font-weight:bold;">return</span> last_entries
    <span style="color: #ff7700;font-weight:bold;">else</span>:
        <span style="color: #ff7700;font-weight:bold;">return</span> entries<span style="color: black;">&#91;</span>:<span style="color: #ff4500;">10</span><span style="color: black;">&#93;</span></pre></div>

<p>and we&#8217;re done.</p>

<p>now if only i could find out how to do it in wordpress (this blog is running on it)</p>

<p>P.S: the code i showed here was not tested extensively, so it might contain bugs. but the basic idea should work.</p>
    ]]></content>
  </entry>
  <entry>
    <title>Manipulators and raw_id_admin</title>
    <link rel="alternate" type="text/html" href="http://www.nekomancer.net/blog/archives/manipulators-and-raw_id_admin" />
    <id>http://www.nekomancer.net/blog/archives/manipulators-and-raw_id_admin</id>
    <published>2006-09-03T17:38:33-05:00</published>
    <updated>2008-03-28T18:46:28-05:00</updated>
    <author>
      <name>gabor</name>
    </author>
    <category term="django" />
    <category term="computers" />
    <summary type="html"><![CDATA[<p>for all Django users&#8230;</p>

<p>please, please, pretty please use <em>raw&#95;id&#95;admin</em> &#40;<a href="http://www.djangoproject.com/documentation/model_api/">django model docs</a>&#41;.</p>

<p>because, when you have a model that contains a ForeignKey (and people usually have ForeignKeys),
then when you use an automatic manipulator for the given model,
it will load in ALL THE DATA FROM ALL THE RELATED MODELS. for example,
if you have 20.000 entries in the related table, then it will load in all those 20.000 entries.</p>

<p>except, if you use <em>raw_id_admin</em>. it&#8217;s an attribute of the ForeignKey, and <a href="http://www.djangoproject.com/documentation/model_api/">contrary to the documentation</a>, it&#8217;s effect is not restricted to the admin-framework.</p>

<p>Joseph Heck made <a href="http://www.rhonabwy.com/wp/2006/08/31/a-subtle-django-performance-hit/">some speed-tests</a> regarding this issue</p>
    ]]></summary>
    <content type="html"><![CDATA[<p>for all Django users&#8230;</p>

<p>please, please, pretty please use <em>raw&#95;id&#95;admin</em> &#40;<a href="http://www.djangoproject.com/documentation/model_api/">django model docs</a>&#41;.</p>

<p>because, when you have a model that contains a ForeignKey (and people usually have ForeignKeys),
then when you use an automatic manipulator for the given model,
it will load in ALL THE DATA FROM ALL THE RELATED MODELS. for example,
if you have 20.000 entries in the related table, then it will load in all those 20.000 entries.</p>

<p>except, if you use <em>raw_id_admin</em>. it&#8217;s an attribute of the ForeignKey, and <a href="http://www.djangoproject.com/documentation/model_api/">contrary to the documentation</a>, it&#8217;s effect is not restricted to the admin-framework.</p>

<p>Joseph Heck made <a href="http://www.rhonabwy.com/wp/2006/08/31/a-subtle-django-performance-hit/">some speed-tests</a> regarding this issue</p>
    ]]></content>
  </entry>
</feed>
