This blog is now hosted on WPEngine. I was having trouble managing it on the Linode VPS for some time now. It seemed to cause DB issues for some reason, which would in turn lock up my other sites (WebWait, AjaxPatterns etc). So I had to isolate it on a separate Linode, and decided if I’m doing that, I may as well just go for a dedicated WordPress host. So here we are at WPEngine. And took the opportunity to cut the clutter and go for a minimalist theme. So thanks Sayanee for this here IceCap theme. Update: Or not. Having some scrolling issues, oddly enough, so reverting back to the old theme for now. Update again: Fixed. It was a conflict with the MailChimp Social plugin. Luckily, Social has an option to disable its own comment view, so I can keep both plugins active.
I explained in the previous post that we need to canonicalise Feedburner URLs. Since about a quarter of the feeds are Feedburner, this is worth special-casing when it comes to parsing rules. Lots of different Feedburner URLs end up floating around for what it actually the same feed. This is partly because feedburner doesn’t redirect funny variants. That actually makes sense, since a lot of RSS clients apparently aren’t smart enough to handle redirects, believe it or not. Though it wouldn’t be altogether insane if they sent in a header the canonical URL. But anyway. It just serves them as happy 200 responses. So the variants end up floating around. So I need to canonicalise the variants to be the same URL. Here’s what I’ve learned.
Update: Show me the code
Full Path Matters.
The canonical URL is http://feeds.feedburner.com/
For example, a typical name is http://feeds.feedburner.com/TheNerdpocalypse
Sometimes the path is longer, and you can’t ignore that full path (ie., /name2). For example, http://feeds.feedburner.com/linuxbasix/mp3 is a podcast feed, http://feeds.feedburner.com/linuxbasix is not. path. http://feeds.feedburner.com/lifestyle-business-podcast doesn’t exist, but http://feeds.feedburner.com/lifestylelife/LwyMis hours of listening pleasure.
Looking at a big sample, it seems that you’re allowed: alphabet, numbers, dash, and underscore. That’s an 38 significant characters, given that …
Case doesn’t matter
http://feeds.feedburner.com/TheNerdpocalypse and http://feeds.feedburner.com/TheNerdPocalypSe and http://feeds.feedburner.com/thenerdpocalypse are all the same thing. For this reason, I will downcase everything to canonicalise the URL, and consider a second “pretty_url” column if we want to display the feed nicely and in line with the publisher’s proclivities.
Domain doesn’t matter
Sometimes I see http://feeds2.feedburner.com domain. I don’t know why this leaks out past Feedburner’s farm (or at least, why we don’t see hundreds of other domains like that if it does leak out), but it doesn’t matter. http://feeds2.feedburner.com/TheNerdpocalypse is the same as http://feeds.feedburner.com/TheNerdpocalypse. So whenever I see feeds2, it becomes feeds.
Slashes don’t matter
http://feeds.feedburner.com/TheNerdpocalypse/ is the same as http://feeds.feedburner.com/TheNerdpocalypse.
Suffixes don’t matter
I also see the dapper .xml or .rss variant, e.g. http://feeds.feedburner.com/TheNerdpocalypse.xml. REST envy perhaps…but the suffix does nothing useful. Ignore.
CGI doesn’t matter
?format=xml smells like even more REST envy, but afaict it doesn’t do anything either. Honeybadger don’t care.
Bonus factoid: feedproxy will redirect
A URL like http://feedproxy.google.com/TheNerdpocalypse will redirect to http://feeds.feedburner.com/TheNerdpocalypse. And this keeps the full path intact, i.e if there’s an xml on the end, it will stay on the end after the redirect.
I’ll use a URL parsing library for this. Lowercase the URL, extract the path without CGI parameters, strip the trailing slash. That will about do it.
I’m currently building some features to help manage series “aliases” on Player FM. The idea is to canonicalise podcasts, so there’s only one true “TWIT” record in the system, for example. This is important for efficiency – it means the server’s not parsing 12 different variants of the same feed. Moreover, it’s important for user experience. It means we recognise when two users are subscribed to the same feed, so we can show them recommendations, they can share with each other, etc.
So we need aliases.
By aliases, I mean two things. Firstly, something like Wikipedia’s redirects, e.g. a feed changes title from “foo” to “bar”, now /series/foo will redirect to /series/bar. That’s basically a slug alias. Secondly, there are “feed aliases”. This is where a publisher asks me to update the show’s URL, perhaps because they’ve moved host. It may also be where a user or myself notices that a feed has changed, e.g. various people have recently been moving away from Feedburner, including 5×5 network, so I noticed their feeds are now marked “obsolete”. So we’d like to alias the old feed to the new one, so anyone in the future who imports their XML containing the old feed will automatically be subscribed to the new one.
The aliases above are all database-driven. e.g. user subscribes to feed1.xml, we look it up in the aliases table and find it’s aliased to feed2.xml, and boom…they’re subscribed to feed2.xml. But a different kind of feed alias is an “automatic” one. By which I mean no DB…just some basic parsing rules. User subscribes to feedburner.com/AbC, we transform it to feedburner.com/abc, and boom they’re subscribed to the same “abc” feed everyone else is subscribed to.
In the next post, I’ll explain how Feedburner URLs work, since that’s a big part of those automatic translations.