Key-based cache expiry: A developer’s primer

Key-based cache expiry is a powerful pattern for efficient and reliable caching. I’ve been using it on for some time now after reading DHH’s original post and it’s worked well. This post explains the more conventional approaches to explain how key-based caching arises as a fruitful alternative.

So then, how does caching normally work, and what’s so bad about that anyway?

Clear after some time. Sure, you can say “this stock price table expires in 5 minutes” and then re-render it every 5 minutes (or longer if no-one immediately requests it). The problem is, you’re often making a big compromise on both ends. On the one hand, you’ll end up with stale results when the stock price changes during this cache window, e.g. if it changes 2 minutes after you serve it, you’re sending wrong data for another 3 minutes. And on the other hand, what if it only changes once a day? Most of the time you’re needlessly re-retrieving data, re-calculating and re-rendering it. Wasteful.

Clear manually. Seeing the problems of time-based expiry, you could be tempted to just keep the cache up to date manually. So when you’re notified of a price change, you explicitly update the value in the cache. This will certainly be more efficient, but the logic gets pretty complex as you scale up. You end up with NxM code complexity as all N bits of code needs to be aware of which M cache items could be affected by changes.

So one technique is easy but inefficient and inaccurate; the other is efficient and accurate, but hard. Let’s see if we can find a sweet spot which is easy AND efficient AND accurate.

Don’t clear. With key-based cache expiry, everything’s put there forever and never cleared. How is that possible? Because it takes advantage of the cache’s built-in automatic expiry mechanism. We must use a cache, such as Memcached or Redis, which supports some kind of expiry based on least-recently-used (LRU) or similar selection. In that sense, we have reached our application’s sweet spot by offloading complexity to the cache framework.

How this works is the keys must reflect a version or timestamp of the object being cached, e.g. a key might be “article-123-201404070123401″, generalised as “type-id-timestamp”. Normally clients won’t request the object by version, so you’ll need to do a quick lookup to find the object’s latest version or timestamp [1]. Then you retrieve it from the cache, or write through to the cache if it’s not already present. And the important thing is you write to it with infinite expiry.

The technique can be used at many levels – HTTP caching, memcached, persistent databases. I first asked about it here and I’ve since used it effectively in production on Player FM’s website and API. It’s certainly how various frameworks handle asset serving (ie compiling CSS with a timestamp and so on), and it’s also an official part of Rails 4, and I expect other frameworks in the future. So it’s a pattern programmers should be familiar with in an era where performance is held in high esteem, and rightly so.

  1. Looking up timestamp or version is work you don’t have to do with manual expiry, so it’s again a trade-off that makes this slightly less efficient, but a lot easier. Furthermore, if you arrange things right, you can actually have clients request the latest version/timestamp for all but the original resource (when they are requesting several resources in succession).

Photo Credit: Paolo Margari via Compfight cc

Load-balancing Rails with Nginx

Well this was some fine undocumented black magic. I’ve got Player FM behind a load balancer now, using the following Nginx config. I’ll explain some more about the overall upgrade later.

# App load balancer

upstream playerhost {
 server 192.168.1.1;
 server 192.168.1.2;
}

server {

  server_name playerhost;

  location / {

    proxy_set_header Authorization "Basic blahblahblah==";
    proxy_next_upstream http_500 http_502 http_503 http_504 timeout error;

    # http://stackoverflow.com/questions/16159998/omniauth-nginx-unicorn-callback-to-wrong-host-url
    proxy_set_header Host $http_host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header Client-IP $remote_addr;
    proxy_set_header X-Forwarded-For $remote_addr;

    proxy_redirect http://playerhost http://player.fm;
    proxy_redirect https://playerhost https://player.fm;
    proxy_pass http://playerhost;

  }
}

Notes

  • I recommend using a distinct name for the backend (I’ve used “playerhost”). Most tutorials use the non-descript “backend”, but it’s a useful indicator if something’s going wrong as you’ll see this ID pop up in URLs and HTML content.
  • You don’t have to use basic auth for the backends. You could just firewall them off the public internet or deal with them being public. Being public is not clean and will cause the site to be hit by search bots and so on unnecessarily. But closing it off altogether is not ideal, because it’s useful for diagnostics to go into the backend servers. So I expose them via basic auth. The “blahblahblah” is base64 of your basic auth username:password.
  • The site mostly worked without set_header, but I had some weird oAuth redirect problems and occasional HTML problems. In both cases, I’d be seeing that “playerhost” URL instead of the actual domain. This fixed it.
  • The proxy_redirect commands were an earlier attempt to fix https redirections, and worked, but still left the problem mentioned in the previous point. It may not be necessary at all, after adding the set_headers. I haven’t tested that yet.

Photo Credit: Graham Cook via Compfight cc

Why my Nexus is fantastic and why my next phone won’t be a Nexus

My Nexus 5 is great. It runs pure Android, it’s super-fast with Kitkat, screen is great, and it was great value as a SIM-only purchase. I can be confident it will always be running the latest Android too, which means not just more toys but improved security. And it’s almost a necessity as an Android app publisher to own a Nexus device, for testing purposes (pure Android is the starting point for all other devices/OSs to deviate from, so Nexus with stock Android is the least deviation from the sum total of all things Android).

So why will my next phone not be a Nexus?

One word: tethering. Many people claim they can make a day without charging their phone. I can too, with the right settings. But not if I want to tether. Tethering drains battery hard, not surprising that turning your phone into a modem/router would do that. Not that Nexus battery is bad at all, it’s probably about average for a high-end. But forget about lasting a day when tethering.

The thing is, you see these products like “Kindle+3G”, “iPad with data plan”, and think why bother. I have true-unlimited 4G for ~ £20/month (thanks Three) and a phone capable of sharing it with any device I damn please. As well as Kindle e-reader and tablets, I’m sometimes testing other phones and devices which either aren’t phones (e.g. iPod touch) or are cheapo PAYG phones without a data plan. Sometimes others need to grab a connection or I need to work on PC too. All of these things become full-fledged smartphones through the magic of tethering.

Similarly, if I go abroad and get a local SIM, that’s another time I really want to tether. I can bypass silly hotel internet altogether by getting a local SIM and sharing the connection.

Bottom line, I want to tether without having to worry my phone won’t last the morning. So a phone without replaceable battery doesn’t cut it. I seriously miss being able to carry a battery in my pocket and another in my bag, pretty much guaranteeing there will always be charge. Sure there are various portable ways to charge on the go, I know them well and use them all the time. It’s not the same as having portable batteries. AKA Sod’s law ensures you won’t have it when you need it.

I only wish the manufacturers would embrace it and provide front-loading slots instead of forcing me to rip off a fragile plastic lid every day. And support hot-swapping (which IIRC Nexus S did, but nothing since).

So my next phone is likely to be a Samsung or HTC, one with portable battery and plenty of charge. Or a Nexus if it does indeed support battery changing. But it seems the priority is understandably on keeping the product simple and as cheap as possible. That means a single battery for life.

Linode API – Reference Data

Reference data in Linode API (and things built on top of it like Ansible’s module) isn’t really documented anywhere, so I’m dumping some of it here using the Ruby Linode gem. I assume this stuff is the same for all users (theoretically some plans or data centres might vary, but I doubt it). Of course it will go out of date as Linode adds new options, so feel free to fork it and re-run using the setup code provided there.

Easiest to read the raw Gist here

Ansible and Linode: What I learned about controlling Linodes from Ansible

Background: Learning Ansible

I decided it’s time to bite the bullet and commit to a configuration management platform. Ansible keeps coming up as the present pinnacle. There were a few hurdles before I could get to the point of successfully rebooting a Linode, so figured I’ll save you the learning curve.

I’ve not used Puppet or Chef or Salt in production, so I’m not qualified to compare. I can only go by what others say, and apparently Ansible’s main strength stems from the fact it ships “with the kitchen sink” instead of forcing administrators to trawl through GitHub to find the appropriate third-party module, which might then be buggy or out of date. And also because it’s relatively simple, having a “push” model which simply uses ssh to update remote servers instead of running some agent on them to “pull” changes from the config server. It also supports deploying new apps (like Capistrano) as well as the main purpose of configuring servers.

If you’re learning Ansible, I recommend setting up a dedicated VPS to run Ansible and another couple of dummy VPSs to act as the remotes. You could do this cheaply with DigitalOcean for example. And then following the official tutorials.

Installing Linode-Python

The first thing to know is that Ansible ships with a vast array of built-in modules, but not their dependencies. So running an ansible Linode command (`ansible remote-host -m linode somearg=someval’), I got a dependency fail: “linode-python required for this module”.

Easy enough fix, right? Well, I went through various steps to install linode-python and verified it worked by opening the Python repl (ie typing python on the command-line) and entering “import linode”. Worked fine.

But same error with Ansible. “linode-python required for this module”. I thought my PYTHONPATH must be messed up or … who knows? Read on …

Prepare “localhost” host so you can run local actions

Looking more closely at the Linode module doc, I realised this is a local action. That is, it doesn’t actually run on a remote machine. Which makes sense, because we’re trying to run general Linode commands. So it’s like a global command, and Ansible’s way of dealing with that is to consider it runs on localhost.

So how do you run a local command on the command-line (since I was in experimental mode)?

First, Open up the inventories file (/etc/ansible/hosts) and enter a new group:

[local]
localhost

Second, make sure you can ssh to localhost. Which might need you to do this:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Anyway, just check ssh localhost actually works.

Figure out Linode IDs

At this point, I could use the Ansible Linode module, but I still had a problem changing Linode state. I found the library needed a Linode ID argument, which had to be numeric. This is painful as the Linode web console only shows you the string “nickname” and not its underlying ID.

So I tried to use the API to list IDs, and discovered the Linode Ruby library doesn’t actually show those IDs. But turns out, the Python Linode library does produce the IDs! I wrote a script to show the ID for each Linode:

UPDATE Feb 2014: Actually this isn’t necessary as Ansible’s dynamic inventory script can do it. Download and run this script to get your list of Linodes: https://github.com/ansible/ansible/blob/devel/library/cloud/linode. It’s probably a good idea to just clone that whole ansible project so you can easily access parts of it.

Bouncy Bounce! Restart the Linode …

Okay, finally I can run this. I found another little quirk is that the module requires a name argument, even though it should be able to work it out from the ID. (It seems to be a bug, as it should only need the name if it’s creating a new node, but what do I know.)

ansible localhost -m linode -a "linode_id=123456 name=funkyserv state=restarted wait=true api_key=$LINODE_API_KEY"

This now works and responds with a “restarted” status. I’ve verified with the Linode web console the node is actually rebooting.

Appendix: Bonus tips

I found it useful to refer to the source for the Ansible module. It was easy to find using locate linode, located at /usr/share/ansible/cloud/linode and easy enough to follow without having seen a module’s internals before.

I got a similar error with the mysql module (“linode-mySqlDb required for this module”). A bit confusing as this turned out to be a different cause. In this case, it really was a remote command, meaning the Python module (and Python itself) would have to be installed on the remote server (ie the mysql server).

Migrating user accounts from Google OpenID to Google OAuth to Google Plus

Background

Over here, Ade has asked me to permalink a comment I made about moving Player FM accounts from Google Open ID to OAuth. The reason I did it was because Android sign-in is really based on OAuth, but what I didn’t know was Google at the time was preparing to launch “Sign in with Google Plus”, also based on OAuth. Bottom line: Google Open ID, afaict, is going the way of the dodo, so any services using it probably want to migrate their users to G+. (Googlers may please correct me if I’m wrong about Open ID’s demise.)

There were many permutations to be considered for this kind of migration, each with trade-offs to developer complexity and user experience. What follows was the optimum balance for me after a repetition of thought experiments, in between wishing I had a pony (ie that I’d opted for OAuth in the first place, the only reason I didn’t was availability of an Open ID Rails plugin). This is all web-based as we (thankfully) hadn’t launched on Android yet.

The problem

The first concern here is largely about user perceptions. To us developers, Google OAuth and Google Open ID are distinct login mechanisms, as similar to each other as they are to Facebook or Twitter. But to the user, they’re the same thing – Googles all the way down. So you can’t present the user with separate login buttons for Google OAuth and Google Open ID…you only get to present one Big G button.

The other concern is that sites who present “Existing users – log in here” versus “New users – sign up here” buttons … are doing it wrong. A major benefit of third-party sign-in is you can just present a “Connect with X” button and the user doesn’t have to care or remember if they previously connected with X or not. Don’t make me think!

Put concern A together with concern B and Houston, we have a problem. You present that one big G button with Google OAuth and what happens if the user is unrecognised? Is this a new user or someone who had previously logged in using Open ID. (It’s fine if the user is recognised, that means they’re one of the post-OAuth-era people.)

The solution

The solution depends if you’re willing to ask for email permissions on the new oAuth flow.

If you are willing to ask for email, that will make it easy to link the two accounts, because Open ID already relies on email, so you have their email. You can just switch right now to oAuth and once the user authenticates, link the account with the account having the same o8 ID. (Note: this scenario is purely speculative and not based on my experience.)

Since I chose not to ask for email, I had to do the uncool thing and temporarily divided Login from Signup.

In the Login area, the app prompted users for their email or login, and immediately made an XHR call to detect whether that account is using o8 or oAuth, then showed the corresponding button (the button is just a Google button, looks the same either way but the link will be different for o8 vs oAuth). (In addition, the Twitter and classic login form were shown.)

For people who logged in with Open ID, I built an !IMPORTANT! big red notification when the user logged in via Open ID, telling them we’ve updated Google login procedure, and when they click to set it up, taking them through the oAuth flow with a special callback for the migration. At this point, the server recognises the two accounts are linked (they’d already logged in with Open ID, now they’ve just logged in with OAuth), so we can save the user’s OAuth credentials. This user now has two third-party accounts – Google Open ID and Google OAuth. Just as they might also have a Facebook account and a Twitter account.

The Signup area of course only contained an OAuth button (as well as Twitter, which was exactly the same as Twitter in the login area, and classic signup form).

I published advance notice on the blog I would be shutting down the old Google IDs, kept the migration alive for two months, and then deleted all the Open ID accounts at that time. Anyone who didn’t log in at the time lost their accounts, but of course a few people mailed me about it (mainly when we launched the Android app and they tried to log in again), so I helped them migrate manually.

I did this for a couple months before deprecating o8 and returning to the nicer single Google button setup. And just manually merged the accounts when a few users asked me why the Google button is not getting them back to their old account.?

Epilogue

It was well worth the pain, as the vast majority of Android users now choose to log in with Google, even though we also support classic login and guest mode. The G+ API was a big bonus to come out of it. I’ve done some experiments on the social side and expect to do much more with G+ accounts in the future, to help people discover new shows to listen to.

Defer and Recur with Rails, Redis, and Resque

I’ve put off some scaling related issues about as long as possible, and am now proceeding to introduce a deferred-job stack. I’ll explain what I’ve learned so far, and with the caveat: this isn’t in production yet. I’m still learning.

What it’s all about

Tools like Resque let you perform work asynchronously. That way, you can turn requests around quickly, so the user gets something back immediately, even if it’s just “Thanks, we got your request”, which is nicer than the user waiting around 5 minutes, and ensures your server doesn’t lock up in the process. Typical example being sending an email – you don’t want the user’s browser to wait while your server connects elsewhere and fires off the email. Other examples would be fetching a user’s profile or avatar after they provide their social profile info; or generating a report they asked for.

So you set up an async job and respond telling the user their message is on the way. If you need to show the user the result of the delayed job, make the clien polls the server and render the result when it’s ready. More power XHR!

The simple way to do this

The simple way, which worked just fine for me for a long time and I’d recommend for anyone starting, is a simple daemon process. Basically:

  1. while true
  2.     if (check_database_for_condition)
  3.       do_something
  4.     sleep 10
  5.   end

The fancy way

The problem with the simple way is it can be hard to parallelise and monitor; you’ll end up reinventing the wheel. So to stand on the shoulders of giants, go install Redis, Resque, and Resque-Scheduler. I’ll explain each.

Redis

Redis, as you probably know, is a NOSQL database. It’s been described as a “data structure server” as it stores lists, trees, and hashes; and assuming Knuth is your homeboy, that’s a mighty fine concept. And it’s super-fast because everything is kept in memory, with (depending on config) frequent persistence to disk for durability.

Resque

Resque is no sneezing matter either, being a tool made and used by GitHub, no less.

Resque uses Redis to store the actual jobs. It’s worth explaining the main components of Resque, because I’ve found they’re often not defined very clearly and if you don’t understand this, everything else will trip you up.

Job. A job is a task you want to perform. For example Job 1234 might be “Send welcome email to [email protected]”. In Resque, a job is defined as a simple class having a “perform” method, which is what does the work [1].

Queue. Jobs live in queues. There’s a one-liner config item in the Job class to say which queue it belongs to. In a simple app, you could just push all jobs to a single queue, whereas in a bigger app, you might want separate queues for each job type. e.g. you’d end up with separate queues for “Send welcome email”, “Fetch user’s avatar”, and “Generate Report”. The main advantage of separate queues is you can give certain queues priority. In addition to these queues, you also have a special “failed” queue. Tasks that throw exceptions are moved to “failed”; otherwise the task disappears.

Worker. A worker is a process that runs the jobs. So a worker polls the queues, picks the oldest jobs off them, and runs them. You start workers via Resque’s Rake task, and in doing so, you tell it which queues to run. There’s a wildcard option to run all queues, but for fine-grained optimisations, you could set up more workers to run higher-priority queues and so on.

An important note about the environment. Rails’ environment can take a long time to start, e.g. 30 seconds. You clearly don’t want a 30-second delay just to send an email. So workers will fork themselves before starting the job. This way, each job gets a fresh environment to run off, but you don’t have the overhead of starting up each time. (This is the same principle as Unicorn’s management of Rails’ servers.) So starting the worker does incur the initial Rails startup overhead, but starting each job doesn’t. In practice, jobs can begin in a fraction of a second. You can further optimise this by making a custom environment for the workers, e.g. don’t use all of Rails, but just use ActiveRecord, and so on. But it’s probably not worth the effort initially as the fork() mechanism gets you 80% there.

Resque-Scheduler

For many people, Resque alone will fit the bill. But certain situations also call for an explicit delay, e.g. “send this email reminder in 5 days”; or repeat a task, e.g. “generate a fresh report at 8am each day”. That’s where Resque-Scheduler comes in [2].

Resque-Scheduler was originally part of Resque, so it basically extends the Resque API. The “scheduling”, i.e. repeated tasks, are represented as a Cronjob-like hash structure and can be conveniently represented in a YML file.

Delayed jobs are created by your application code. It’s basically the same call as when you add the job directly to Resque, but you need to specify an additional delay or time argument.

The cool thing is jobs are persisted into Redis, so they will survive if the system — or any components (Redis/Resque/Resque-Scheduler) — goes down. I was confused at first as I thought they were added to some special Resque queue. But no, they are actually in the Redis database. I found this by entering keys * into Redis’s command-line tool (redis-cli), which yielded some structures including “resque:delayed:1372936216″. When I then entered dump resque:delayed:1372936216, I got back a data structure which was basically my job spec, ie. {class: 'FeedHandler', arg: ['http://example.com'].

So Resque-Scheduler basically wakes up every second or so, and does two things: (a) polls Redis to see if any delayed jobs should now be executed; (b) inspects its “schedule” data structure to see if any repeated jobs should now be executed. If any jobs should now be executed, it pushes them to the appropriate Resque queue.

Notes

  1. Conceptually a job definition is little more than a function definition, rather than a full-blown class. But being a class is the more Rubyesque way to do it and also makes it easy to perform complex tasks as you can use attributes to hold intermediate results, since each job execution will instantiate a new job object.

  2. I evaluated other tools, e.g. Rufus and Clockwork, but what appeals about Resque-Scheduler is it persists delayed jobs and handles both one-off and repeated jobs.

What Everyone Should Know About REST: Talk at Web Directions Code

Here are the slides:

Slides: What Everyone Should Know About REST

Sketchnotes

Thanks UX Mastery for the sketchnotes, they are awesome! (Seriously, I would be much more swayed to speak at any conference with sketchnotes because it’s a straightforward permanent memento, a better snapshot than slides or video.)

Overall, it was great to be associated with another fine Web Directions conference and the Melbourne Town Hall venue was amazing. I only regret that we were so busy scrambling on the Android app, after launching just a few days earlier, to be around the whole time. But this being my hometown — I’ll be back!

Talk Structure

I spoke at Web Directions Code on Friday, a talk on REST. I’ve been putting a lot of this into practice lately, and the talk was really an attempt to convey the main practical things every developer should know. The structure was:

  • Everyone should know about REST because it’s not just about websites anymore. Devices, whether computers, fridges, or wearable glasses – are connected, and device-to-device communication happens with web standards, i.e. HTTP. The talk covered three things about REST: Simplicity+Consistency; Security; Caching.
  • Simplicity+Consistency: Emphasising Developer Experience (#devexp) was a way to frame the general concepts, ie URLs, HTTP methods, response types.
  • Security: How the web is becoming SSL-only, and various authentication schemes. I referenced the latest Traffic and Weather, which has a good discussion on this.
  • Performance+Scalability: Mostly about caching. I’ve been musing on REST caching quite a bit for Player FM’s API (most recently thinking about a kind of reverse patch protocol, where the server can send out diffs that get cached), and explained some of the standards and tricks for squeezing efficiency out of the network.

What Wasn’t Covered

  • I didn’t go into the REST acronym or the general theory of REST as an architectural pattern arising from specific forces.
  • SSL and caching. Good Twitter conversation afterwards about this point, that you can’t cache in the middle of an SSL connection. The answer is to split the connection in the middle and run SSL on either side, with a trusted cache seeing plain-text in the middle. This is how Cloudflare works, and the CEO Matthew Prince chimed in to say it will be free soon. (At least, SSL from client to Cloudflare.) So that means the SSL-protected web could triple overnight.

JavaScript swims downstream with the web

Roy Fielding’s original REST dissertation (published in 2000) has an interesting section on Java versus JavaScript, which I’ve not come across before and has certainly stood the test of time. In particular, the biggest benefit is explained to be nothing more complicated than performance, obviously a huge deal these days.

Extract below with obligatory PSD artwork. Emphasis mine.

[Background: Speaking about REST at Web Directions Code soon and finding myself on a long day of interstate flights, I bit the bullet and finally read Roy Fieldings' original REST thesis cover-to-cover. And if anyone has a good way to export highlights from a personal document in Kindle, please let me know as it's apparently unsupported.]

6.5.4.3 Java versus JavaScript

REST can also be used to gain insight into why some media types have had greater adoption within the Web architecture than others, even when the balance of developer opinion is not in their favor. The case of Java applets versus JavaScript is one example.

The question is: why is JavaScript more successful on the Web than Java? It certainly isn’t because of its technical quality as a language, since both its syntax and execution environment are considered poor when compared to Java. It also isn’t because of marketing: Sun far outspent Netscape in that regard, and continues to do so. It isn’t because of any intrinsic characteristics of the languages either, since Java has been more successful than JavaScript within all other programming areas (stand-alone applications, servlets, etc.). In order to better understand the reasons for this discrepancy, we need to evaluate Java in terms of its characteristics as a representation media type within REST.

JavaScript better fits the deployment model of Web technology. It has a much lower entry-barrier, both in terms of its overall complexity as a language and the amount of initial effort required by a novice programmer to put together their first piece of working code. JavaScript also has less impact on the visibility of interactions. Independent organizations can read, verify, and copy the JavaScript source code in the same way that they could copy HTML. Java, in contrast, is downloaded as binary packaged archives — the user is therefore left to trust the security restrictions within the Java execution environment. Likewise, Java has many more features that are considered questionable to allow within a secure environment, including the ability to send RMI requests back to the origin server. RMI does not support visibility for intermediaries.

Perhaps the most important distinction between the two, however, is that JavaScript causes less user-perceived latency. JavaScript is usually downloaded as part of the primary representation, whereas Java applets require a separate request. Java code, once converted to the byte code format, is much larger than typical JavaScript. Finally, whereas JavaScript can be executed while the rest of the HTML page is downloading, Java requires that the complete package of class files be downloaded and installed before the application can begin. Java, therefore, does not support incremental rendering.

Roy Fielding