Bloom Filters and Recommendation Engines

I’ll explain here what Bloom filters are and how you might find them useful for recommendation engines. I haven’t used them yet in production — just sharing what I’ve been learning.

Why Bloom Filters?

I was thinking about recommendation system algorithm. Not the main algorithm: how do you generate good recommendations. But instead, an important “side algorithm”: how do you keep track of recommendations users have previously dismissed? All the genius recommendations in the world aren’t going to matter if you keep showing the same results.

The most obvious solution here would be to track everything. Simply store a new record for every “dismissal” the user makes. But that’s a lot to store in a high-scale system, e.g. if 10 million users dismissed 20 items each, you have 200 million records to store and index.

So this is where Bloom filters come in as a highly compressed way to store a set of values. The catch is: it’s fuzzy. It’s not really storing the set; instead, it’s letting you ask the question you want to ask, which is: “Is X in this set?” and coming back with a probabilistic answer. But that’s okay for something like a recommendation system.

Here’s an example. A user Jane has dismissed three articles, identified with their IDs: 123, 456, 789.

Under the traditional model, we perform a standard set inclusion check (e.g. check if a database row exists) and come out with a definite answer:


Q: Is article 888 in the "Jane" set? Algorithm: Check if 888 is in [123, 456, 789] A: No. I'm sure about that.

Under the fuzzy Bloom filter model, we end up with some funny value as a fuzzy representation of the whole set, and then we can get a probabilistic answer about set inclusion.

  01101001 (this is the Bloom filter)

Q: Is article 888 in the "Jane" set? Algorithm: Check against the Bloom filter (details below) A: Probably not. But maybe. About 5% likelihood it's in the set.

Deriving the Bloom filter

So in the previous example, how did we end up with that representation of the set (what I playfully refer to as 01101001). And what did we do with it?

It’s fairly simple. Remember, this is the only thing we store and the set builds up over time. So the Bloom filter starts out as empty and each new set member adds something to it.

The real representation is a bitwise vector, let’s go with 8 bits: 00000000

So when user Jane is created, her Bloom filter is 0000000.

Jane dismisses article 123. Now what we do is, we compute some hashes of 123 using different algorithms. Since we have decided to make our Bloom filter 8 bits, each hash algorithm should give a number between 0 and 256, so we can store the result. Let’s assume we use two hash algorithms to hash 123. One ends up with 64 (01000000) and the other with 33 (00100001). So now our Bloom filter is:


When we get a 1, we set the bit to 1. When we get a zero, we do nothing. So yes, over time, this will fill with 1s. That’s why we have to choose a big enough bloom filter size.

Going on, the next dismissal is 456. And maybe we end up with hash values 01001001 and 0110000. So the first of these has added a new “1” to our previous value of 011000001:


And finally, we might end up with 01001000 and 00100000 for ID 789, neither of which light up any new bits. So we still have the same Bloom filter as before.


Is X in the set?

Now we have Jane’s Bloom filter, 01101001. This is a fuzzy representation of [123, 456, 789]. We can then ask, for any given value, is it in the set?

e.g. if our recommendation algorithm comes up with 888, should we show it to Jane. We don’t want to show it if it is in that set of previous dismissals. We compute using the same hash algorithms as before and perhaps we end up with 00101100. It lit up a different bit (the 6th one), so we can say categorically, it’s not in the set. We know that for sure because if it was in the set, all those bits would be on. We know for sure it’s not in the set of dismissals, so we can confidently recommend it to Jane.

Take another recommendation we might end up with – 456. Do we show it to Jane? Well, is it in the set of previous dismissals? We compute and get 01101001. It fits within our Bloom filter, so there’s a good chance it was in the list of values that was used to build up the filter. But no guarantee. We might end up with a value of 00001000 for another ID, e.g. 555. This would also fit the Bloom filter and we can be no more certain that it was in the original set than we can be for the 456 value. So, it’s probabilistic. You can be certain some things aren’t in the set, but you can’t be certain something was in the set. For a recommendation of 456 or 555, we can’t be sure. So in either case, we will not show Jane the recommendation and look deeper for more certain values.

Fine tuning

The example above just magically decided to use a Bloom filter of 8 and hand-waved around the algorithms. In practice, you will need to decide on those things and in practice it will probably be hundreds or thousands of bits; otherwise, every bit will quickly fill up to become 1. A cool thing is that there are precise calculations that can help you estimate exactly how big the Bloom filter should be, based on the expected number of items in it, combined with your tolerance for error. (If your algorithm can easily generate lots of good recommendations, you could have quite a high tolerance because it would be easy to skip over any potential matches.)


Considering the recommendation problem made me recall this article about how Medium uses Bloom filters and also led me to a useful tutorial on the topic.

Sidekiq 4’s performance boost

Mike Perham managed to turbo-boost Sidekiq for v4, making it six times faster. This in itself is good news for those of us who use it and his write-up is also of interest. #perfmatters

The perf tricks that made this possible:

  1. Redis -> worker communication (dispatching new jobs to work on): Instead of a single, global, thread on the client taking requests from Redis and locally dispatching them, every worker now gets its own direct line to Redis.
  2. Worker -> Redis communication (reporting when a job is complete): Instead of workers constantly updating the server, there’s now a client-side proxy that updates it in batches every few seconds, ie it buffers up the pending updates and periodically sends them in a multiplexed message.
  3. Refactored to do direct thread manipulation instead of relying on Celluloid.

Very interesting that (1) and (2) are almost the inverse of each other. Redis → worker job assignment has been switched from a global model to a per-worker model, while Worker → Redis job completion reporting has been switched from a per-worker model to a global model. So that’s the time-honoured pendulum swing between centralisation and decentralisation, in a nutshell.

Also, as a commenter notes, it’s not obvious how much has been gained by the withdrawal of Celluloid. Removing a library can not only increase complexity, but can be counter-productive to performance if the library captures years of performance boosts you’ll otherwise have to learn yourself. Nevertheless, in the case of Celluloid, it was really there to simplify the multithreading programming effort, and given how important this is to Sidekiq, it’s the kind of thing that often makes sense to take full control of. (The dubious refactorings are those where some peripheral feature like logging just had to be home-made. In the case of mission-critical functionality, there’s often a lot to be said for DYI.)

When your app composes tweets: Dealing with metadata

For those who don’t know, Twitter converts every URL to its own “” shortener URL. So no matter how short or long your original URL is, the URL will end up as a fixed character length, and that character length does count towards the 140 limit.

Any sane Twitter client will hide this complexity from end-users. The word count algorithm will be smart enough to take this into account show the remaining characters.

But as a coder, you need to incorporate that logic yourself.

You should also know that Twitter’s API won’t automatically truncate a tweet, so if your app tries to send a long one, Twitter will return an error. So your tweet-posting app will need to truncate the tweet to 140 characters.

So I was coding up an auto-tweet setting, which requires you to estimate the length of a tweet. The code looks like:

  1. TWEET_LENGTH = 140
  2. TWITTER_URL_LENGTH = 19 // !!Danger - read on!!
  4. def compose_message(episode)
  5.    hashtag = '#nowplaying'
  6.    url_and_hashtag_suffix = " #{episode.url} #nowplaying"
  7.    max_title_length = TWEET_LENGTH - (1 + TWITTER_URL_LENGTH + 1 + hashtag.length)
  8.    "#{truncate episode.title, max_title_length}#{url_and_hashtag_suffix}"
  9. end

And then, with a long title, it failed. Can you guess why?

The answer is because I apparently went to sleep for three years, and when I woke up, the world had composed hundreds of billions of tweets. Many of them include URLs, which means the length has crept up to 22 characters – 23 for SSL URLs – rising at about 1 character a year. Yes, if your tweet has a link in it, you now have to be 2.5% more concise in describing the link (that’s 3/(140 – 19)).

Thankfully, there’s an API for this:

So your code could periodically crawl the config API and aggressively-cache the result. Or alternatively, have your build script download it to your code base at compile-time, if it hasn’t seen an update for a while.

I haven’t checked in detail, but there are probably some open-source Twitter packages (gems, NPM modules etc) that include this config data and keep it up to date.

Note this also affects images and video – the above config URL also provides the length of a media item.

A simple way to speed up Vim Ctrl-P plugin: Delegate to Ag

Ctrl-p is “Intellisense for Vim”, allowing you to quickly jump to a file by searching for a few letters or even fancy camel-case type searches. (e.g. find article_editor.rb by searching for “ae”).

However, doing all this requires it to maintain a search index, aka cache, to be maintained. That can be very frustrating with a big project as it takes 5-10 seconds to update, which is not a good thing when you’re desperately trying to jump around files. This delay would be fine if Ctrl-P worked in the background, but due to Vim limitations, it can’t, so you have to frequently run it on the command-line and wait for the update.

Or do you?

No you don’t. Here is a trick that lets you never wait for ctrl-p again! Just add this to your vimrc:

let g:ctrlp_user_command = 'ag %s -i --nocolor --nogroup --hidden
      \ --ignore .git
      \ --ignore .svn
      \ --ignore .hg
      \ --ignore .DS_Store
      \ --ignore "**/*.pyc"
      \ -g ""'

It’s taken straight from here. The cool thing about this trick is it doesn’t just speed up indexing, it completely removes the need for it. This is achieved by relying on the command-line tool Ag, aka Silver Searcher. It’s a brilliant grep replacement I would recommend to anyone, being exponentially faster than grep (as in, you can happily search a whole hard drive in real time).

I’ve used Ag for years but never realised it could be piped into Ctrl-P!

That page also includes some matching optimisation, but seriously the Ag trick was all I needed. Searching is now completely instantaneous and I never need to worry about the index going stale again.

The update has been pushed to my dotfiles.

Speeding up Rails tests with Spring

I found Rails tests were running slow, these things helped.

Instrumenting application.rb

First, I added some logging to application.rb

  1. def logg(m)
  2.   puts "#{} #{m}"
  3. end
  5. logg 'require boot'
  6. require File.expand_path('../boot', <strong>FILE</strong>)
  7. logg 'done'</code>

The main bottleneck was bundling (Bundler.require(:default, Rails.env)), which took around 12 seconds on OSX.

Goodbye Bundle?

10+ seconds delay for running a test is unacceptable, leading me to contemplate ditching Bundler. However, that’s a lot of manual overhead, requiring us to manually import stuff, and we’re using Ruby here because it’s supposed to cut out tedium like that.

Spring forward

The real answer is to avoid running application.rb every time using the (defaultly-installed) Spring. Spring should keep a master process around after the app has booted, so we don’t need to run application.rb every time. I thought Spring was already running, but with this kind of delay, I had to check it.

Opened a new terminal, started tailing /tmp/spring.log, and then made spring start using it in the main Ruby test running terminal:

  1. spring stop
  2. export SPRING_LOG=/tmp/spring.log
  3. tail -f /tmp/spring.log

This is perfect for diagnosis: I have the aforementioned application.rb logs in the main terminal and the other window shows me what Spring is up to.

Ran a test:

  1. truby test/models/user_test -n /username/ # truby is my alias for RAILS_ENV=test bundle exec ruby ...

And yes, I saw application.rb running through all the motions, 10 second bundle, etc. And no spring action.

Ran it again.

Same thing. We’re booting the whole app for every one of these tiny tests!

The thing is Spring only works for a select few commands – any rake command and just a few Rails commands (rails console, rails generate, rails runner). This caught me out as I was trying to run rails server and Spring again wasn’t running. I guess the thinking is you shouldn’t need to restart web server often in dev anyway – stuff is already auto-reloaded according to default config.

So the answer was to run tests through Spring. I’d previously stopped doing that due to difficulty running a single test method, but now it’s easy.

  1. rake test test/models/user_test /username/

In the above case, the second argument is an optional test method or regular expression. Unlike the direct invocation, no “-n” is needed if a regex is provided.

And, problem solved. I see Spring is cloning the booted process and the test runs in a fraction of a second.

Chrome’s new user switcher is a major design regression

I rely on profile switching completely for day-to-day use. I have my personal Google account, Google apps account, a sandbox account for testing in clean-slate, a company account, support account, account for Google Play, account with read-only access to some servers which can be authorised to analytics tools, etc. Consultants often have one or more accounts per client.

The old “switching between different browsers” doesn’t cut it, Chrome knew it when they introduced this feature, and with the latest update, it’s become barely usable.

The new user navigation switcher:

  • Shows text labels instead of the instantly recognisable, customisable, graphical ones. Welcome to 1990, have a nice day.
  • Requires an extra click to switch (“Switch profile”) after hitting on the current profile name.
  • Takes several seconds to open the switcher menu in its own dialog.
  • After clicking twice to open the switcher and waiting around for it to paint, it only shows 8 profiles (what?!), only the first 8 in alphabetical order, and won’t scroll for more. Turns out you have to manually expand it to see all profiles! Hidden core features ftw!
  • The big compromise was to show the old menu if you click with the second mouse button. Well guess what, many of us don’t have a second mouse button!
  • The old menu can also be launched with … a 2-finger tap on the trackpad. Yes, the only way to access this functionality is with an awkward 2-finger tap. It works 35% of the time for me, I’m trying to get some practice in so my average can hopefully get up to 50% (dream big). Keep in mind, this 2-finger tap is *the only way* I can access some of my profiles (those beyond the first 8).

Bottom line: Chrome had a great feature for years, which no other browser supports to this day. The architects of this feature have dropped the ball completely, given no explanation for these counter-intuitive design decisions, and hopefully Firefox or Edge will be there to pick up the slack.

All of the above issues are articulated in the bug tracker in very much the same way as the new design has no-one explaining the benefits. It’s hard to see anything good about the new design.

It’s not a case of wanting to stick to the old thing for the sake of not learning a new interface. It’s a case of a serious design regression in every dimension, one that hasn’t seen any rational justification from the designers.

Sometimes you can just tell when the designer doesn’t use their own feature.

Time spent in mobile apps considered harmful

TechCrunch reports on time spent in mobile apps. Saying users spend X time on these kinds of apps sounds like the best way to gauge engagement, but caution advised. This is a useful figure for apps like Facebook, whose revenue is closely tied to how many eyeball-seconds are spent in the app. But as others have pointed out, it’s flawed.

Consider the following app categories:

  • E-Commerce – as the above tweet mentions, a staggering amount can be spent with a few taps. Uber’s whole magic is precisely how effectively it gets out of the way.
  • Notifiers – how about getting recommendations from FourSquare when you walk past it, or seeing when your friend is nearby. Apps can be valuable even if they are mostly about poking you when something happens.
  • Audio – Few people outside of stock photo models specifically sit down to listen to music and podcasts. Audio apps are fundamentally about multitasking, which means users will do some basic management, hit play, and get out of the way. Whether they leave the app in the foreground or background is fairly arbitrary, so as long as these metrics are measuring eyes and not ears, it’s skewed.

“But we have ping pong tables”

As competition for developers heats up, companies are falling over themselves to attract the best talent. A common tactic is to offer perks, probably because someone heard Google offers free food and maybe because it sounds appealing to a manager who’s trying hard to channel their inner programmer.

Perks aren’t bad, but in my experience, the best workers don’t pay much attention to them. The best workers want to actually get work done. They care about the people they work with, their work environment, the technology and tools they can use, and the product they work on.

Of course, developers are motivated by extrinsic factors such as salary, health care, vacation time, etc. But if they are a sufficiently talented engineer to be worth hiring, they’ll also be smart enough to know the true value of a free gym membership.

Perks can make a difference, but they should be considered in the context of things developers actually care about.

Ping pong, for example, might add to the work environment for some developers. So it’s a small subset of a small subset of the overall things devs care about.

Free lunch, the real benefit is not money saving. Developers earning six figures are smart enough to know that. The benefit is saving them the hassle of arranging or queuing for lunch and being able to socialise with their team mates. They will probably end up working more instead of standing in a line somewhere, but that’s just fine for both parties.

For talented developers, it’s far more important to consider work-related perks like big displays and flexible operating system choice. Those things help developers kick ass all day long. They are more attractive to the kind of people who are able and motivated to get the job done.

(On ping pong tables specifically, they will backfire if they cause even the slightest noise disruption. Programmers need to work in quiet environments. Also, what makes you think programmers will like them more than other people you are trying to hire? It can come across as condescending to think it’s something programmers will value and others won’t.)

Google’s Many Marketplaces

Google’s Many Marketplaces

While Play is an “all-in-one” store for Google, there are numerous marketplaces or catalogues from Google, defined as resources where third parties may share content for other third parties to consume.

Here’s a crowdsourced list, courtesy Twitter. I’ve wikified it here; please add any others there.

The list could be useful for developers considering how to distribute their app or extend its reach.

For consumers

For businesses

For developers and online service providers

Android M Highlights

Google IO has now wrapped up. The conference went smoothly and overall there has been a decisive swing back in focus towards developers. Android remains the centerpiece, in terms of the keynote, schedule, and displays, and Android M is now here and available to developers.

Here are some highlights of Android M following IO. I haven’t yet tried it, though I’ll be flashing the giveaway N9 before long.

Mmmm … that name

If two makes a pattern, Google has now established the single-letter name convention for developer preview editions. I suppose this goes along with IO’s renewed focus on developers, in contrast to the actual OS release (which normally coincides with the new Nexus line, about 6 months later). Google gets to save the name reveal for the fanfare it really wants around that time. Any problems that happen along the way will be associated with M, not the eventual name. And finally, it also lets them defer a decision, which is usually a Good Thing, and especially when it involves other companies (Kit Kat).

Anyway … if it’s not marshmallowMarshMallow, I’ll be disappointed.

On-demand permissions! Finally

Android M joins iOS, Cyanogenmod, and the web in supporting on-demand permissions, a feature that had been experimented with previously but didn’t see the light of day until now. This is great news for both users and developers. Improved security and control for users, and meanwhile developers don’t have to get caned for providing rich, optional, features.

One recent example was Player FM supporting audio ducking, ie the ability to keep playing when a transient sound plays. Because phone calls are also transient sounds, we had to declare a permission to detect when the phone was ringing, which happens to include the ability to see the user’s phone number and those they call. All this for an optional setting. Not surprisingly, there were many queries about the new permission and I’m very pleased it can be optional with M.

App Linking

A developer can now tie its app and website together by, in crude terms, pointing each to the other. The net effect is users can click on a link without being presented with the tedious “Open with …” menu. It will open straight in the app if it’s installed. I believe these menus are one of the biggest UX problems for Android, as they appear far too often (basically, anytime an installed application has declared a regexp matching the link you just click). This doesn’t solve the problem completely, but is a major step in that direction.

It will also be a controversial feature, as it means any site can now block third-party apps from “presenting” or “hijacking” (depending which side you’re on) its links. By declaring its the one and only app to handle those URLs, other handlers are shut out. It’s true, at least in this initial M build, that users can modify those settings, but few will go to those lengths.

Custom Chrome Tabs

App linking is about jumping from one app to another. Custom tabs is about jumping from an app to the web. It’s actually a Chrome update, so once it lands in Chrome, it should work on older Android too.

Consider what happens when a link needs to open a web page? One solution is for the app to host its own webview, thus keeping the user from leaving their app and retaining their controls and branding. This unfortunately leads to a cottage industry in “optimising your website for Twitter/Pinterest/Facebook webviews”. The webview, being a sandboxed environment, is slow to load as caching isn’t possible, and nor does it inherit cookies or the user’s browser settings. So, not ideal.

Custom tabs aim to get the best of both worlds: custom controls and branding, but inside “actual Chrome”. Even the transition between browser and app can be configured, and pre-caching is also possible, ie telling Chrome to start loading something while the user is inside the app.

Direct Share

Direct share is a new standard for the popular pattern of sharing something with N friends. The technology is similar to the file picker introduced in Kitkat – where the file picker could let you choose a file from any of Dropbox, Google Drive, etc, the new share feature is effectively a “people picker” that lets you share with contacts or any participating social network. This is also about Google recognising Google Plus is not the only answer to social on Android.

Doze State

Doze is the apt name for a new state that kicks in when the phone is still for a long time. Apps go into background but notifications and alarms still work. It makes sense as a trade-off and will be brilliant if it even gets close to the “up to two times longer” standby lifetime.

Play Store:

Technically Play!=Android, but hey it’s the main distribution point, and there were some important updates. A-B testing of the description will be possible with the new experiments feature. Analytics on ad revenues and conversions will be available directly in Play. And there will be a new family-friendly Play store with some apps there being verified as family-friendly.

Image credit