Backstory is I started writing it on Thursday night after seeing all the Reader tweetstorm and figured it’s probably of more general interest, so I submitted it there. The original draft was ~1400 words and I wasn’t sure how seriously they take their guideline fo ~800, so just left it at 1400, but turns out they are, in fact, serious. So we edited it down.
For the record (since some people asked), I used Bloglines for as long as I could cope with its downtime, as I always found Google Reader too magic (unpredictable) with its use of Ajax. Eventually Bloglines was outaging for hours and IIRC whole days, so I made the switch to Reader, but could never get into the web app – too much Ajax magic – and instead used Reeder, sync’d to Reader when it came along. When I switched to Android for my primary device, I couldn’t find a satisfactory app, so just used Reeder on the iPad occasionally.
Meanwhile, with podcasts, I preferred the cloud approach of Odeo and Podnova, but both sadly died. I tried podcasts with Reader, but it just wasn’t the right experience so I mostly used iTunes, and then on Android, mixed it up between several apps (DoggCatcher, BeyondPod, PocketCasts, etc…the usual suspects) until eventually creating my own (still in beta). I really had problems with Listen though, so again, no didn’t do the Reader sync.
So bottom line is I did use Reader “somewhat”, but mostly as an API; and it’s no great loss to me like I appreciate it is to others. The responses to this article certainly demonstrate how passionate people are about a product they get to know and love, and use on a daily basis. It’s never easy giving up on muscle memory. The bright side of the equation is exactly what people like about it: RSS and OPML are open, so at least people can move on to Feedly, Newsblur, and so on. And I truly believe this decision ultimately liberates the standard and allows it to thrive among smaller players.
I previously mentioned some work I did to cut down processing and IO on an RSS client. Yesterday, I was able to continue this effort with some more enhancements geared around checking if the feed has changed. These changes are not just important for my server’s performance, but also for being a good internet citizen and not hammering others’ machines with gratuitous requests. Note everything in this article will be basic hygiene for anyone whose written any kind of high-scale bot, but documenting here as it was useful learning to me.
Normally, a fetch requires the client to compare the incoming feed against what has been stored. This requires lookup on the database and a comparison process. It’s read-only, so not hugely expensive, but does require reading a lot — all items in the feed — and at frequent intervals.
All this comparison effort would be unnecessary if we could guarantee the feed hasn’t changed since the last fetch. And of course, most of the time, it won’t have changed. If we’re fetching feeds hourly, and the feed changes on average once a week, then we can theoretically skip the whole comparison 99.4% of the time!
So how can we check if the feed has changed?
The brute-force way to check if the feed has changed is to compare the feed content with the one we received last time. We could store the incoming feed in a file, and if it’s the same as the one we just sucked down, we can safely next it.
Storing a jillion feed files is expensive and unnecessary. (Though some people might temporarily store them if they’ve separated the fetching from the comparison, to prevent blockages, which I haven’t done here). If all we need the files for is a comparison, we can instead store a hash. With a decent hash, the chance of a false positive is extremely low and the severity in this context also extremely low.
return if incoming_hash == feed_record.hash # Files match, no comparison necessary
feed_record.title = incoming_feed.title
feed_record.hash = incoming_hash # Save the new hash for next time
# ... Keep processing the feed. Compare each item, etc.
The HTTP protocol provides its own support for this kind of thing, via the if-not-modified-since request header. So we should send this header, and we can then expect a 304 response in the likely event no change has happened. This will save transferring the actual file as well as bypassing the hash check above. (However, since this is not at all supported everywhere, we still do need the above check as an extra precaution.)
req.add_field("If-Modified-Since", last_fetched_at.rfc2822) if last_fetched_at
res = Net::HTTP.new(...)
return if res.code=='304' # We don't even need to compare hashes
Another HTTPism is ETag, a value that, like our hash, is guaranteed to change if the feed content changes. So to be extra-sure we’re not re-processing the same feed, and hopefully not even fetching the whole feed, we can save the ETag and include it in each request. It works like if-not-modified-since; if the server is still serving the same ETag, it will respond with an empty 304.
feed_record.etag = incoming_feed.etag # Save it for next time
For the record, about half of the feeds I’ve tested — mostly from fairly popular sources, many of them commercial — include ETags. And of those, at least some of them change the ETag unnecessarily often, which renders it useless in those cases (actually worse than useless, since it consumes unnecessary resources). Given that level of support, I’m not actually convinced it adds much value over just using if-not-modified-since, but I’ll leave it in for now. I’m sure managers of those servers which do support it would prefer it be used.
That’s a before-and-after shot of the database server’s CPU! I was watching it slowly creep up, planning to inspect it after some other work, before receiving mails from Linode that the virtual server is running over 102% capacity, then 110, 120, …
Three things made the difference in fixing this:
Feed Item Keys Must Be Unique
The most important thing was to nail down keys, which I noticed from looking at logs and the oddly cyclic nature of the graph above. I later on ran a query to see how many items were being stored for each feed, and sure enough, certain feeds had thousands of items and counting.
The RSS 2.0 spec (as official a spec as there is) says of individual items: “All elements of an item are optional, however at least one of title or description must be present.”. What’s missing there is a primary key! Fortunately, most feeds do have a unique <link>, <guid>, or both. But if you’re trying to be robust and handle unanticipated feeds, it gets tricky. There were also some boundary cases involving feeds which had changed their strategy (fortunately, improved it by adding guids) at some point, but never updated the old items. So the feed was a hybrid.
The net effect was a gigantic number of items being accumulated for certain feeds. Every hour, the server checked for updates, it decided that yes these key-less feeds had totally changed and we need to pull all the posts in again and save a record of it. That’s why you see the hourly cycles in the “before” picture. I still need to go and cleanse the database of those duplicate items.
By taking a step back and looking at what makes the items truly unique, and with the help of Rails’ handy collection methods, it was possible to make feed items unique again and smooth out crawling.
Inspecting a handful of anomalous feeds once an hour, due to the problem mentioned above, is not the worst thing in the world. What made the server veer towards FUBAR was certain query that was being performed each time in the absence of indexes. I was able to see the heaviest queries in the Rails log using the grep/sed command posted here yesterday. I added those indexes and the queries went from ~ 1200ms to 20ms, with the overall throughput for a feed dropping down to about 20% of its former time.
A third issue was forcing the database to run the wheel all the time. This wasn’t a major hourly thrashing like above, but a few feeds that were being polled every few minutes.
I got a sniff of this problem when I noticed the same set of feeds would keep appearing when I looked at the logs. After grepping, I realised they were not obeying the rule of waiting an hour to re-check, but were in fact taking their turn to poll the external feed, then jumping right back in line for another go.
This really wasn’t having much performance impact, because these feeds weren’t adding new items with each check (as the item keys were sound). But with more feeds like this, it could have an impact, and more to the point, being polled every few minutes is not good for my bandwidth or the people on the receiving end!
The cause turned out to be some trivial problems with feed items, which were being blocked by Rails’ validation when trying to save the items. Because scheduling info is &emdash; for convenience &emdash; tied to the items’ records, the scheduling info was being lost. A bit of overkill to isolate out the scheduling info at this stage, so I switched the validation to a before_save which did some cleansing to ensure the format is right.
Update: IO Rate
Another issue I still had to fix was the IO rate. You can see it above, not in the spikes – which reflect me making the fixes above – but in the small wave pattern on the left. Those are actually very high in absolute terms, at around 1K blocks per second being transferred between disk and memory. This is due to swap thrashing and required updates to my.cnf. In particular, decreasing key_buffer. Also, I decreased max_connections, such that (with key_buffer change), https://github.com/rackerhacker/MySQLTuner-perl was content with the memory required, and also increasing innodb_buffer_pool_size. I haven’t measured the effect of that yet, need to let it run for a while in order to get that.
I’m sure plenty of other optimisations are possible, but the good news is that IO Rate has gone right down to near-zero and swap rate likewise. So no more thrashing.
I’m getting more convinced Plus is the new Twitter, and also the new Posterous. I’ve been posting things on there I previously would have stuck on the Twitter or the Posterous, and so it was time to integrate Plus on my homepage alongside the existing Twitter and Posterous links.
It was pretty easy to integrate my latest Google Plus post (we don’t really have a name for a Plus post yet; a plust?), as I already have a framework in place for showing the last post from an Atom or RSS feed.
I just discovered a new feed meta-search: TalkDigger (via Data Mining).
It’s Ajax search all the way (buzzword overload!).
The site shows how ideal Ajax is for meta-search. Each time you enter a query, the browser fires off multiple queries – one for each engine it searches. That means the results all come back in parallel – no bottlenecks.
Back in the day, metacrawler and others were smart enough to start writing out the page straightaway, so users start seeing some results while others are still pending. The Ajax meta-search improves on the situation by directly morphing the result panels, so the page structure remains fixed even as all the results are populated. Each panel gets its own Progress Indicator.
This is an example of Multi-Stage Download – set up a few empty blocks and populate them with separate queries. When I initially created the pattern, it was pure speculation, but TalkDigger now makes the third real example I know of. I recently created a Multi-Stage Download Demo.
Another nice feature of TalkDigger, which fits well with meta-search, is the use of Microlinks: You can click on the results to immediately expand out a summary.
There’s some more features I’m hoping to see:
The results page definitely needs work – it’s nice seeing a brief summary of all results and having them expandable, but it’s difficult to get an overall feel. An “Expand All” would help, or showing at least one posting for each search engine.
The results are broken up by an ad. To me, that’s counter-productive as they look like two separate panels. I think most users will mentally filter out the ad anyway and just see the results as broken into two.
[Sortable columns](http://ajaxpatterns.org/Query-Report Table] – so I could sort by engine name or feed count.
Unique URLs Unique URLs are critical for a search engine. Unique URL Demo. Jon Udell mentioned the issue recently, regarding MSN Virtual Earth, Google Maps, and others’ lack thereof. This demo, based on Mike Stenhouse’s ideas shows it’s actually fairly straightforward to emulate standard URLs.