Bloom Filters and Recommendation Engines

I’ll explain here what Bloom filters are and how you might find them useful for recommendation engines. I haven’t used them yet in production — just sharing what I’ve been learning.

Why Bloom Filters?

I was thinking about recommendation system algorithm. Not the main algorithm: how do you generate good recommendations. But instead, an important “side algorithm”: how do you keep track of recommendations users have previously dismissed? All the genius recommendations in the world aren’t going to matter if you keep showing the same results.

The most obvious solution here would be to track everything. Simply store a new record for every “dismissal” the user makes. But that’s a lot to store in a high-scale system, e.g. if 10 million users dismissed 20 items each, you have 200 million records to store and index.

So this is where Bloom filters come in as a highly compressed way to store a set of values. The catch is: it’s fuzzy. It’s not really storing the set; instead, it’s letting you ask the question you want to ask, which is: “Is X in this set?” and coming back with a probabilistic answer. But that’s okay for something like a recommendation system.

Here’s an example. A user Jane has dismissed three articles, identified with their IDs: 123, 456, 789.

Under the traditional model, we perform a standard set inclusion check (e.g. check if a database row exists) and come out with a definite answer:


Q: Is article 888 in the "Jane" set? Algorithm: Check if 888 is in [123, 456, 789] A: No. I'm sure about that.

Under the fuzzy Bloom filter model, we end up with some funny value as a fuzzy representation of the whole set, and then we can get a probabilistic answer about set inclusion.

  01101001 (this is the Bloom filter)

Q: Is article 888 in the "Jane" set? Algorithm: Check against the Bloom filter (details below) A: Probably not. But maybe. About 5% likelihood it's in the set.

Deriving the Bloom filter

So in the previous example, how did we end up with that representation of the set (what I playfully refer to as 01101001). And what did we do with it?

It’s fairly simple. Remember, this is the only thing we store and the set builds up over time. So the Bloom filter starts out as empty and each new set member adds something to it.

The real representation is a bitwise vector, let’s go with 8 bits: 00000000

So when user Jane is created, her Bloom filter is 0000000.

Jane dismisses article 123. Now what we do is, we compute some hashes of 123 using different algorithms. Since we have decided to make our Bloom filter 8 bits, each hash algorithm should give a number between 0 and 256, so we can store the result. Let’s assume we use two hash algorithms to hash 123. One ends up with 64 (01000000) and the other with 33 (00100001). So now our Bloom filter is:


When we get a 1, we set the bit to 1. When we get a zero, we do nothing. So yes, over time, this will fill with 1s. That’s why we have to choose a big enough bloom filter size.

Going on, the next dismissal is 456. And maybe we end up with hash values 01001001 and 0110000. So the first of these has added a new “1” to our previous value of 011000001:


And finally, we might end up with 01001000 and 00100000 for ID 789, neither of which light up any new bits. So we still have the same Bloom filter as before.


Is X in the set?

Now we have Jane’s Bloom filter, 01101001. This is a fuzzy representation of [123, 456, 789]. We can then ask, for any given value, is it in the set?

e.g. if our recommendation algorithm comes up with 888, should we show it to Jane. We don’t want to show it if it is in that set of previous dismissals. We compute using the same hash algorithms as before and perhaps we end up with 00101100. It lit up a different bit (the 6th one), so we can say categorically, it’s not in the set. We know that for sure because if it was in the set, all those bits would be on. We know for sure it’s not in the set of dismissals, so we can confidently recommend it to Jane.

Take another recommendation we might end up with – 456. Do we show it to Jane? Well, is it in the set of previous dismissals? We compute and get 01101001. It fits within our Bloom filter, so there’s a good chance it was in the list of values that was used to build up the filter. But no guarantee. We might end up with a value of 00001000 for another ID, e.g. 555. This would also fit the Bloom filter and we can be no more certain that it was in the original set than we can be for the 456 value. So, it’s probabilistic. You can be certain some things aren’t in the set, but you can’t be certain something was in the set. For a recommendation of 456 or 555, we can’t be sure. So in either case, we will not show Jane the recommendation and look deeper for more certain values.

Fine tuning

The example above just magically decided to use a Bloom filter of 8 and hand-waved around the algorithms. In practice, you will need to decide on those things and in practice it will probably be hundreds or thousands of bits; otherwise, every bit will quickly fill up to become 1. A cool thing is that there are precise calculations that can help you estimate exactly how big the Bloom filter should be, based on the expected number of items in it, combined with your tolerance for error. (If your algorithm can easily generate lots of good recommendations, you could have quite a high tolerance because it would be easy to skip over any potential matches.)


Considering the recommendation problem made me recall this article about how Medium uses Bloom filters and also led me to a useful tutorial on the topic.

Sidekiq 4’s performance boost

Mike Perham managed to turbo-boost Sidekiq for v4, making it six times faster. This in itself is good news for those of us who use it and his write-up is also of interest. #perfmatters

The perf tricks that made this possible:

  1. Redis -> worker communication (dispatching new jobs to work on): Instead of a single, global, thread on the client taking requests from Redis and locally dispatching them, every worker now gets its own direct line to Redis.
  2. Worker -> Redis communication (reporting when a job is complete): Instead of workers constantly updating the server, there’s now a client-side proxy that updates it in batches every few seconds, ie it buffers up the pending updates and periodically sends them in a multiplexed message.
  3. Refactored to do direct thread manipulation instead of relying on Celluloid.

Very interesting that (1) and (2) are almost the inverse of each other. Redis → worker job assignment has been switched from a global model to a per-worker model, while Worker → Redis job completion reporting has been switched from a per-worker model to a global model. So that’s the time-honoured pendulum swing between centralisation and decentralisation, in a nutshell.

Also, as a commenter notes, it’s not obvious how much has been gained by the withdrawal of Celluloid. Removing a library can not only increase complexity, but can be counter-productive to performance if the library captures years of performance boosts you’ll otherwise have to learn yourself. Nevertheless, in the case of Celluloid, it was really there to simplify the multithreading programming effort, and given how important this is to Sidekiq, it’s the kind of thing that often makes sense to take full control of. (The dubious refactorings are those where some peripheral feature like logging just had to be home-made. In the case of mission-critical functionality, there’s often a lot to be said for DYI.)

When your app composes tweets: Dealing with metadata

For those who don’t know, Twitter converts every URL to its own “” shortener URL. So no matter how short or long your original URL is, the URL will end up as a fixed character length, and that character length does count towards the 140 limit.

Any sane Twitter client will hide this complexity from end-users. The word count algorithm will be smart enough to take this into account show the remaining characters.

But as a coder, you need to incorporate that logic yourself.

You should also know that Twitter’s API won’t automatically truncate a tweet, so if your app tries to send a long one, Twitter will return an error. So your tweet-posting app will need to truncate the tweet to 140 characters.

So I was coding up an auto-tweet setting, which requires you to estimate the length of a tweet. The code looks like:

  1. TWEET_LENGTH = 140
  2. TWITTER_URL_LENGTH = 19 // !!Danger - read on!!
  4. def compose_message(episode)
  5.    hashtag = '#nowplaying'
  6.    url_and_hashtag_suffix = " #{episode.url} #nowplaying"
  7.    max_title_length = TWEET_LENGTH - (1 + TWITTER_URL_LENGTH + 1 + hashtag.length)
  8.    "#{truncate episode.title, max_title_length}#{url_and_hashtag_suffix}"
  9. end

And then, with a long title, it failed. Can you guess why?

The answer is because I apparently went to sleep for three years, and when I woke up, the world had composed hundreds of billions of tweets. Many of them include URLs, which means the length has crept up to 22 characters – 23 for SSL URLs – rising at about 1 character a year. Yes, if your tweet has a link in it, you now have to be 2.5% more concise in describing the link (that’s 3/(140 – 19)).

Thankfully, there’s an API for this:

So your code could periodically crawl the config API and aggressively-cache the result. Or alternatively, have your build script download it to your code base at compile-time, if it hasn’t seen an update for a while.

I haven’t checked in detail, but there are probably some open-source Twitter packages (gems, NPM modules etc) that include this config data and keep it up to date.

Note this also affects images and video – the above config URL also provides the length of a media item.

The Class1-Class2 Naming Antipattern for Associations

Join classes (aka join tables) are entities whose main purpose is to associate one object (aka record) with another in a NxN relationship.

A common and decades-old pattern, which is almost always wrong, is to name these classes after both association classes.

Examples of NxN relationships:

  • PersonStock maps owners to their stock.
  • UserFeed maps users to feeds they are subscribed to.
  • StudentCourse maps students to their courses.

What’s wrong with these names? First, they are awkward to say and cumbersome to deal with in code (as a general rule, multi-word entities are best avoided, because it becomes confusing and ambiguous when they are combined with other words). Second, they are redundant to anyone who is looking at the class’s foreign keys (admittedly, some redundancy is okay if it makes the code more understandable, but one should always be weary of a naming scheme which could be auto-generated by a trivial script). Third, and the biggest complaint: it’s wholly unnatural to anyone versed in the domain, therefore not a good model of reality. Only programmers and DBA use terminology like this; domain specialists do not.

The fundamental problem is it frames these associations as being entirely about the things they associate, instead of treating the association as a first-class citizen, which is inevitably how they are treated by a practitioner in the field you’re modelling. Once you start seeing the association as a model in its own right, you can start to enrich it with meaningful properties and behaviours. And this is typically true in the real world – associations are more than just dumb pairings of item A and item B.

More than just modelling these associations and finding an appropriate name, it can also prompt you to talk with domain specialists about what actually are the NxN join concepts in this domain.

Revisiting these examples:

  • PersonStock is better modelled as Ownership. Now that we have a concept of “ownership”, we can think about things like when was it created (ownership.created_at) and what kinds of conditions must be required to create an “ownership”. You could do this kind of reasoning with a “OwnerStock” thingy, but it’s more mental gymnastics and takes you a step away from domain specialists.

  • UserFeed is better modelled as Subscription. Now we can attach properties of the subscription, e.g. a ranking/rating indicating how much the user loves any particular feed. This data may then be used to determine how the user is notified of updates and perhaps how the “river of news” is sorted. Or maybe a visibility attribute indicating who can see the subscription, ie is it public that a given user is subscribed to a given feed.

  • StudentCourse is better modelled as Enrolment. Now we can record a “passed” or “grade” attribute against the enrolment and consider pre-conditions for creating an Enrolment, such as looking at the user’s past Enrolments.


p>Not all associations have a natural word to describe them, but even when they don’t, it’s worth thinking really hard about coming up with a new term. The Class1-Class2 name is almost always the road to pain.

A simple way to speed up Vim Ctrl-P plugin: Delegate to Ag

Ctrl-p is “Intellisense for Vim”, allowing you to quickly jump to a file by searching for a few letters or even fancy camel-case type searches. (e.g. find article_editor.rb by searching for “ae”).

However, doing all this requires it to maintain a search index, aka cache, to be maintained. That can be very frustrating with a big project as it takes 5-10 seconds to update, which is not a good thing when you’re desperately trying to jump around files. This delay would be fine if Ctrl-P worked in the background, but due to Vim limitations, it can’t, so you have to frequently run it on the command-line and wait for the update.

Or do you?

No you don’t. Here is a trick that lets you never wait for ctrl-p again! Just add this to your vimrc:

let g:ctrlp_user_command = 'ag %s -i --nocolor --nogroup --hidden
      \ --ignore .git
      \ --ignore .svn
      \ --ignore .hg
      \ --ignore .DS_Store
      \ --ignore "**/*.pyc"
      \ -g ""'

It’s taken straight from here. The cool thing about this trick is it doesn’t just speed up indexing, it completely removes the need for it. This is achieved by relying on the command-line tool Ag, aka Silver Searcher. It’s a brilliant grep replacement I would recommend to anyone, being exponentially faster than grep (as in, you can happily search a whole hard drive in real time).

I’ve used Ag for years but never realised it could be piped into Ctrl-P!

That page also includes some matching optimisation, but seriously the Ag trick was all I needed. Searching is now completely instantaneous and I never need to worry about the index going stale again.

The update has been pushed to my dotfiles.

Installing Android Marshmallow – quick gotcha notes

What the various guides often omit.

  • Run android-sdk/tools/android and update platform-tools
  • The oem unlock setting on phone kept reverting. But “fastboot oem unlock” fixed it.
  • Ensure “adb devices” shows the device
  • Ensure “Device State” on phone shows as UNLOCKED
  • “cannot load ‘system.img'” may happen because it requires several GB of free memory. So try chunking it: android-sdk/platform-tools/fastboot flash -S 512M system system.img

Speeding up Rails tests with Spring

I found Rails tests were running slow, these things helped.

Instrumenting application.rb

First, I added some logging to application.rb

  1. def logg(m)
  2.   puts "#{} #{m}"
  3. end
  5. logg 'require boot'
  6. require File.expand_path('../boot', <strong>FILE</strong>)
  7. logg 'done'</code>

The main bottleneck was bundling (Bundler.require(:default, Rails.env)), which took around 12 seconds on OSX.

Goodbye Bundle?

10+ seconds delay for running a test is unacceptable, leading me to contemplate ditching Bundler. However, that’s a lot of manual overhead, requiring us to manually import stuff, and we’re using Ruby here because it’s supposed to cut out tedium like that.

Spring forward

The real answer is to avoid running application.rb every time using the (defaultly-installed) Spring. Spring should keep a master process around after the app has booted, so we don’t need to run application.rb every time. I thought Spring was already running, but with this kind of delay, I had to check it.

Opened a new terminal, started tailing /tmp/spring.log, and then made spring start using it in the main Ruby test running terminal:

  1. spring stop
  2. export SPRING_LOG=/tmp/spring.log
  3. tail -f /tmp/spring.log

This is perfect for diagnosis: I have the aforementioned application.rb logs in the main terminal and the other window shows me what Spring is up to.

Ran a test:

  1. truby test/models/user_test -n /username/ # truby is my alias for RAILS_ENV=test bundle exec ruby ...

And yes, I saw application.rb running through all the motions, 10 second bundle, etc. And no spring action.

Ran it again.

Same thing. We’re booting the whole app for every one of these tiny tests!

The thing is Spring only works for a select few commands – any rake command and just a few Rails commands (rails console, rails generate, rails runner). This caught me out as I was trying to run rails server and Spring again wasn’t running. I guess the thinking is you shouldn’t need to restart web server often in dev anyway – stuff is already auto-reloaded according to default config.

So the answer was to run tests through Spring. I’d previously stopped doing that due to difficulty running a single test method, but now it’s easy.

  1. rake test test/models/user_test /username/

In the above case, the second argument is an optional test method or regular expression. Unlike the direct invocation, no “-n” is needed if a regex is provided.

And, problem solved. I see Spring is cloning the booted process and the test runs in a fraction of a second.

Chrome’s new user switcher is a major design regression

I rely on profile switching completely for day-to-day use. I have my personal Google account, Google apps account, a sandbox account for testing in clean-slate, a company account, support account, account for Google Play, account with read-only access to some servers which can be authorised to analytics tools, etc. Consultants often have one or more accounts per client.

The old “switching between different browsers” doesn’t cut it, Chrome knew it when they introduced this feature, and with the latest update, it’s become barely usable.

The new user navigation switcher:

  • Shows text labels instead of the instantly recognisable, customisable, graphical ones. Welcome to 1990, have a nice day.
  • Requires an extra click to switch (“Switch profile”) after hitting on the current profile name.
  • Takes several seconds to open the switcher menu in its own dialog.
  • After clicking twice to open the switcher and waiting around for it to paint, it only shows 8 profiles (what?!), only the first 8 in alphabetical order, and won’t scroll for more. Turns out you have to manually expand it to see all profiles! Hidden core features ftw!
  • The big compromise was to show the old menu if you click with the second mouse button. Well guess what, many of us don’t have a second mouse button!
  • The old menu can also be launched with … a 2-finger tap on the trackpad. Yes, the only way to access this functionality is with an awkward 2-finger tap. It works 35% of the time for me, I’m trying to get some practice in so my average can hopefully get up to 50% (dream big). Keep in mind, this 2-finger tap is *the only way* I can access some of my profiles (those beyond the first 8).

Bottom line: Chrome had a great feature for years, which no other browser supports to this day. The architects of this feature have dropped the ball completely, given no explanation for these counter-intuitive design decisions, and hopefully Firefox or Edge will be there to pick up the slack.

All of the above issues are articulated in the bug tracker in very much the same way as the new design has no-one explaining the benefits. It’s hard to see anything good about the new design.

It’s not a case of wanting to stick to the old thing for the sake of not learning a new interface. It’s a case of a serious design regression in every dimension, one that hasn’t seen any rational justification from the designers.

Sometimes you can just tell when the designer doesn’t use their own feature.

Time spent in mobile apps considered harmful

TechCrunch reports on time spent in mobile apps. Saying users spend X time on these kinds of apps sounds like the best way to gauge engagement, but caution advised. This is a useful figure for apps like Facebook, whose revenue is closely tied to how many eyeball-seconds are spent in the app. But as others have pointed out, it’s flawed.

Consider the following app categories:

  • E-Commerce – as the above tweet mentions, a staggering amount can be spent with a few taps. Uber’s whole magic is precisely how effectively it gets out of the way.
  • Notifiers – how about getting recommendations from FourSquare when you walk past it, or seeing when your friend is nearby. Apps can be valuable even if they are mostly about poking you when something happens.
  • Audio – Few people outside of stock photo models specifically sit down to listen to music and podcasts. Audio apps are fundamentally about multitasking, which means users will do some basic management, hit play, and get out of the way. Whether they leave the app in the foreground or background is fairly arbitrary, so as long as these metrics are measuring eyes and not ears, it’s skewed.

“But we have ping pong tables”

As competition for developers heats up, companies are falling over themselves to attract the best talent. A common tactic is to offer perks, probably because someone heard Google offers free food and maybe because it sounds appealing to a manager who’s trying hard to channel their inner programmer.

Perks aren’t bad, but in my experience, the best workers don’t pay much attention to them. The best workers want to actually get work done. They care about the people they work with, their work environment, the technology and tools they can use, and the product they work on.

Of course, developers are motivated by extrinsic factors such as salary, health care, vacation time, etc. But if they are a sufficiently talented engineer to be worth hiring, they’ll also be smart enough to know the true value of a free gym membership.

Perks can make a difference, but they should be considered in the context of things developers actually care about.

Ping pong, for example, might add to the work environment for some developers. So it’s a small subset of a small subset of the overall things devs care about.

Free lunch, the real benefit is not money saving. Developers earning six figures are smart enough to know that. The benefit is saving them the hassle of arranging or queuing for lunch and being able to socialise with their team mates. They will probably end up working more instead of standing in a line somewhere, but that’s just fine for both parties.

For talented developers, it’s far more important to consider work-related perks like big displays and flexible operating system choice. Those things help developers kick ass all day long. They are more attractive to the kind of people who are able and motivated to get the job done.

(On ping pong tables specifically, they will backfire if they cause even the slightest noise disruption. Programmers need to work in quiet environments. Also, what makes you think programmers will like them more than other people you are trying to hire? It can come across as condescending to think it’s something programmers will value and others won’t.)