More RSS Client Optimizations: Preventing Re-Fetch

Background: Has the Feed Changed?

I previously mentioned some work I did to cut down processing and IO on an RSS client. Yesterday, I was able to continue this effort with some more enhancements geared around checking if the feed has changed. These changes are not just important for my server’s performance, but also for being a good internet citizen and not hammering others’ machines with gratuitous requests. Note everything in this article will be basic hygiene for anyone whose written any kind of high-scale bot, but documenting here as it was useful learning to me.

Normally, a fetch requires the client to compare the incoming feed against what has been stored. This requires lookup on the database and a comparison process. It’s read-only, so not hugely expensive, but does require reading a lot — all items in the feed — and at frequent intervals.

All this comparison effort would be unnecessary if we could guarantee the feed hasn’t changed since the last fetch. And of course, most of the time, it won’t have changed. If we’re fetching feeds hourly, and the feed changes on average once a week, then we can theoretically skip the whole comparison 99.4% of the time!

So how can we check if the feed has changed?

Feed Hash

The brute-force way to check if the feed has changed is to compare the feed content with the one we received last time. We could store the incoming feed in a file, and if it’s the same as the one we just sucked down, we can safely next it.

Storing a jillion feed files is expensive and unnecessary. (Though some people might temporarily store them if they’ve separated the fetching from the comparison, to prevent blockages, which I haven’t done here). If all we need the files for is a comparison, we can instead store a hash. With a decent hash, the chance of a false positive is extremely low and the severity in this context also extremely low.

So the feed now has a new hash field.

  1. incoming_feed = fetch_feed(feed_record.url)
  2. incoming_hash = Digest::MD5.hexdigest(incoming_feed.body)
  3. return if incoming_hash == feed_record.hash # Files match, no comparison necessary
  4.  
  5. feed_record.title = incoming_feed.title
  6. feed_record.hash = incoming_hash # Save the new hash for next time
  7. # ... Keep processing the feed. Compare each item, etc.

HTTP if-not-modified-since

The HTTP protocol provides its own support for this kind of thing, via the if-not-modified-since request header. So we should send this header, and we can then expect a 304 response in the likely event no change has happened. This will save transferring the actual file as well as bypassing the hash check above. (However, since this is not at all supported everywhere, we still do need the above check as an extra precaution.)

  1. req = Net::HTTP::Get.new(feed_record.url)
  2. req.add_field("If-Modified-Since", last_fetched_at.rfc2822) if last_fetched_at
  3. ...
  4. res = Net::HTTP.new(...)
  5. return if res.code=='304' # We don't even need to compare hashes

ETag

Another HTTPism is ETag, a value that, like our hash, is guaranteed to change if the feed content changes. So to be extra-sure we’re not re-processing the same feed, and hopefully not even fetching the whole feed, we can save the ETag and include it in each request. It works like if-not-modified-since; if the server is still serving the same ETag, it will respond with an empty 304.

  1. req.add_field("If-None-Match", etag) if etag
  2. ...
  3. # Again, we return if res.code=='304'
  4. feed_record.etag = incoming_feed.etag # Save it for next time

For the record, about half of the feeds I’ve tested — mostly from fairly popular sources, many of them commercial — include ETags. And of those, at least some of them change the ETag unnecessarily often, which renders it useless in those cases (actually worse than useless, since it consumes unnecessary resources). Given that level of support, I’m not actually convinced it adds much value over just using if-not-modified-since, but I’ll leave it in for now. I’m sure managers of those servers which do support it would prefer it be used.

Integrated Google Plus on the Homepage

I’m getting more convinced Plus is the new Twitter, and also the new Posterous. I’ve been posting things on there I previously would have stuck on the Twitter or the Posterous, and so it was time to integrate Plus on my homepage alongside the existing Twitter and Posterous links.

Latest Post

It was pretty easy to integrate my latest Google Plus post (we don’t really have a name for a Plus post yet; a plust?), as I already have a framework in place for showing the last post from an Atom or RSS feed.

First, I found my Plus feed URL thanks to Russell Beattie’s unofficial Plus Atom Feed service:

http://plusfeed.appspot.com/106413090159067280619

Using MagpieRSS, you can easily get the last post.

  1. define('MAGPIE_CACHE_ON', false);
  2.   require_once('magpierss/rss_fetch.inc');
  3.   $feed = "http://plusfeed.appspot.com/106413090159067280619";
  4.   try {
  5.     $rss = fetch_rss($feed);
  6.     $recent_post = $rss->items[0];
  7.     $title = $recent_post[title] . " ...";
  8.     $link = "http://mahemoff.com/+", $recent_post[link];
  9.     $timeAgo = timeAgo(strtotime($recent_post[updated]));
  10.     // show the post
  11.   } catch(Exception $ex) {
  12.     // log exception
  13.   }

Me

Inside the CSS3-rendered vcard, there’s a link to my plus alongside twitter etc.:

  1. <a rel="me" class="url" href="https://plus.google.com/106413090159067280619">plus</a>

/+ …. redirect to Plus

Following Tim Bray’s suggestion, I redirected http://mahemoff.com/+ to the plus page. It’s nice to have a memorable URL.

Meta-Search with Ajax

I just discovered a new feed meta-search: TalkDigger (via Data Mining). It’s Ajax search all the way (buzzword overload!).

The site shows how ideal Ajax is for meta-search. Each time you enter a query, the browser fires off multiple queries – one for each engine it searches. That means the results all come back in parallel – no bottlenecks.

Back in the day, metacrawler and others were smart enough to start writing out the page straightaway, so users start seeing some results while others are still pending. The Ajax meta-search improves on the situation by directly morphing the result panels, so the page structure remains fixed even as all the results are populated. Each panel gets its own Progress Indicator.

This is an example of Multi-Stage Download – set up a few empty blocks and populate them with separate queries. When I initially created the pattern, it was pure speculation, but TalkDigger now makes the third real example I know of. I recently created a Multi-Stage Download Demo.

Another nice feature of TalkDigger, which fits well with meta-search, is the use of Microlinks: You can click on the results to immediately expand out a summary.

There’s some more features I’m hoping to see:

  • The results page definitely needs work – it’s nice seeing a brief summary of all results and having them expandable, but it’s difficult to get an overall feel. An “Expand All” would help, or showing at least one posting for each search engine.
  • The results are broken up by an ad. To me, that’s counter-productive as they look like two separate panels. I think most users will mentally filter out the ad anyway and just see the results as broken into two.
  • [Sortable columns](http://ajaxpatterns.org/Query-Report Table] – so I could sort by engine name or feed count.
  • Unique URLs Unique URLs are critical for a search engine. Unique URL Demo. Jon Udell mentioned the issue recently, regarding MSN Virtual Earth, Google Maps, and others’ lack thereof. This demo, based on Mike Stenhouse’s ideas shows it’s actually fairly straightforward to emulate standard URLs.