Logging the caller

It’s so useful to automagically log caller line number and so on. Most languages make it possible via hack involving throwing an exception to yield a stack trace. Some languages explicitly provide this info. In Ruby, it’s possible with the caller array.

Here’s how I used it just now:

  1. def logg
  2.   caller_info = caller[0].gsub! /^.+\/(.*)\.rb:(.*):in `(.*)'/, '\\2:\\3:\\2'
  3.   Rails.logger.debug "[#{caller_info}] - #{id}. Thread#{Thread.current.object_id.to_s(36)}"
  4. end

This will output caller_info in the format: [series_feed:fetch:123]. Which is the file:method:line_number. It’s derived, via the initial regex, from the slightly less log-friendly caller string, path/to/series_feed.rb:123:infetch’.

Rails quirk: Adding to DateTime

Just came across a weird one with DateTime. Adding an int value will increment by a day, not a second as you may expect:

  1. $ x=1
  2. 1
  3. $ x.class
  4. Fixnum
  5. $ DateTime.new(2000, 1, 1, 0, 0, 0)+x
  6. Sun, 02 Jan 2000 00:00:00 +0000

But adding by a second is possible using explicitly 1.second. The strange thing is both inherit from FixNum and essentially act as the same number. So if you want it to mean seconds, one way to achieve it is use “x.seconds”.

  1. $ x=1.second
  2. 1 second
  3. $ x.class
  4. Fixnum
  5. $ DateTime.new(2000, 1, 1, 0, 0, 0)+x
  6. Sat, 01 Jan 2000 00:00:01 +0000

Shorthand Parameters

Here is a weird abuse of default variable values to support shorthand variable names. It’s valid Ruby.

  1. def area(r=radius) {
  2.   Math::pi * r * r
  3. }

Simple example, but you get the point. It lets you tell the external world what a parameter is all about, but keeps the implementation shorthand. Obviously it’s just a simple example here; parameter names can be much more verbose than just this example and functions can be longer, so you don’t want to keep repeating a long name. For example:

  1. def damage_level(force_exterted_by_car=force) {
  2.   force = 0 if force < 0
  3.   acceleration = mass/force
  4.   ...
  5. }

Now you might say “just declare it in the first line”, but I prefer small code and there could be several such lines.

You might say “mention it in a comment”, but I prefer self-documenting code. Comments go out of date and clutter up code. (Strictly speaking, the long name here is a comment, but it’s more likely to be maintained.)

[Update: I don't often mention Pi, but when I do, it's on March 14: Pi Day. Thanks to the reader who pointed it out!]

More RSS Client Optimizations: Preventing Re-Fetch

Background: Has the Feed Changed?

I previously mentioned some work I did to cut down processing and IO on an RSS client. Yesterday, I was able to continue this effort with some more enhancements geared around checking if the feed has changed. These changes are not just important for my server’s performance, but also for being a good internet citizen and not hammering others’ machines with gratuitous requests. Note everything in this article will be basic hygiene for anyone whose written any kind of high-scale bot, but documenting here as it was useful learning to me.

Normally, a fetch requires the client to compare the incoming feed against what has been stored. This requires lookup on the database and a comparison process. It’s read-only, so not hugely expensive, but does require reading a lot — all items in the feed — and at frequent intervals.

All this comparison effort would be unnecessary if we could guarantee the feed hasn’t changed since the last fetch. And of course, most of the time, it won’t have changed. If we’re fetching feeds hourly, and the feed changes on average once a week, then we can theoretically skip the whole comparison 99.4% of the time!

So how can we check if the feed has changed?

Feed Hash

The brute-force way to check if the feed has changed is to compare the feed content with the one we received last time. We could store the incoming feed in a file, and if it’s the same as the one we just sucked down, we can safely next it.

Storing a jillion feed files is expensive and unnecessary. (Though some people might temporarily store them if they’ve separated the fetching from the comparison, to prevent blockages, which I haven’t done here). If all we need the files for is a comparison, we can instead store a hash. With a decent hash, the chance of a false positive is extremely low and the severity in this context also extremely low.

So the feed now has a new hash field.

  1. incoming_feed = fetch_feed(feed_record.url)
  2. incoming_hash = Digest::MD5.hexdigest(incoming_feed.body)
  3. return if incoming_hash == feed_record.hash # Files match, no comparison necessary
  4.  
  5. feed_record.title = incoming_feed.title
  6. feed_record.hash = incoming_hash # Save the new hash for next time
  7. # ... Keep processing the feed. Compare each item, etc.

HTTP if-not-modified-since

The HTTP protocol provides its own support for this kind of thing, via the if-not-modified-since request header. So we should send this header, and we can then expect a 304 response in the likely event no change has happened. This will save transferring the actual file as well as bypassing the hash check above. (However, since this is not at all supported everywhere, we still do need the above check as an extra precaution.)

  1. req = Net::HTTP::Get.new(feed_record.url)
  2. req.add_field("If-Modified-Since", last_fetched_at.rfc2822) if last_fetched_at
  3. ...
  4. res = Net::HTTP.new(...)
  5. return if res.code=='304' # We don't even need to compare hashes

ETag

Another HTTPism is ETag, a value that, like our hash, is guaranteed to change if the feed content changes. So to be extra-sure we’re not re-processing the same feed, and hopefully not even fetching the whole feed, we can save the ETag and include it in each request. It works like if-not-modified-since; if the server is still serving the same ETag, it will respond with an empty 304.

  1. req.add_field("If-None-Match", etag) if etag
  2. ...
  3. # Again, we return if res.code=='304'
  4. feed_record.etag = incoming_feed.etag # Save it for next time

For the record, about half of the feeds I’ve tested — mostly from fairly popular sources, many of them commercial — include ETags. And of those, at least some of them change the ETag unnecessarily often, which renders it useless in those cases (actually worse than useless, since it consumes unnecessary resources). Given that level of support, I’m not actually convinced it adds much value over just using if-not-modified-since, but I’ll leave it in for now. I’m sure managers of those servers which do support it would prefer it be used.

Website Migrations: CodePlane, Nginx, Passenger

I’m just about done (famous last words) with a long, and not just a bit tedious, migration from Slicehost to Linode. Both are “cloud-style” VPSs, where you can do immediate one-click backups, cloning, upgrading, etc. As VPSs, you get to choose a raw initial Linux stack, and you’re then on your own to add on web servers, command-line tools, etc.

While SliceHost has worked out fine, notwithstanding [the 2008 exit of its founders]((http://37signals.com/founderstories/slicehost), Linode has much better rates these days and also seems to be gaining traction with the NodeJS community. Maybe it’s the fortunate name affinity. There’s also the distinct possibility Slicehost’s parent Rackspace will close down Slicehost itself and move things over to Rackspace. They also have great support docs. In any event, I’ve really been wanting to migrate from lighty to nginx and start using CodePlane for code repo, so I was going to be stuck doing some work anyway.

A few random notes …

Setup

I don’t intend this to be a review of Linode. Suffice to say, it was a super-easy process, and the Getting Started guide made everything nice and simple. Anyone coming from a similar environment like Slicehost would have no trouble. A nice bonus is the built-in monitoring charts, basically the kind of thing you’d get from a monit install, but right there on the admin panel, with no need to install anything. They’re powered by RRD.

Another nice bonus is ipv6 support. It’s been a long time coming, but I’m finally ready for the internet of IP-labelled things and beyond! 173.255.208.243? That’s not cool. You know what’s cool? 2600:3c01::f03c:91ff:fe93:5a3e/64.

Passenger on Nginx

I’d previously used Thin on Nginx, but Passenger promised to be a lot easier. Turned out maybe not, Passenger needs a custom Nginx install. I’d already set up Nginx with apt-get, and in retrospect, I should have tried to roll it back. So there was some challenges configuring things (the default Passenger-powered nginx goes into /opt/nginx, whereas the Ubuntu one is customised for the usual suspects on ubuntu, binary in /usr/sbin, conf in /etc/, and so on).

With the custom Passenger-powered Nginx, the core Nginx config needs no passenger or ruby reference. (ie You don’t do the apache/lighttpd equivalent of declaring a dynamic modules – the lack of such things is why the custom nginx install is necessary). You only need to declare passenger in the site-specific config. For a Sinatra app, with config.ru at /path/to, you do this:

server {
  passenger_enabled on;
  location /static {
  root /path/to/public;
  index index.html index.html;
}

(Yes, you never need to point directly to the config.ru or its directory.)

PHTML Suffix

I still have some weird bug preventing *.phtml files running. If no-one knows the answer, I’ll just rename it.

Mediawiki

After much messing around, I ended up with something that works for the (unendorsed, I think) “cool URI” scheme, i.e. site.com/EntryName. (versus site.com/index.php?title=EntryName).

  location / {
    rewrite ^/([a-zA-Z0-9_:]+)$ /wiki/index.php?title=$1 last;
    rewrite ^/$ /wiki/index.php?title=Main_Page last;
    root  /path/to/public;
  }

Self-explanatory I think. Along the way, I learned about nginx’s “try_files” mechanism, which fits nicely with many PHP CMSs like WordPress and Mediawiki, where a single front controller is the gateway to the entire app. You can do stuff like try_files $uri $uri/ /index.php?page=$request_uri…though I didn’t need it here.

WordPress

wordPress was similarly simple:

    if (!-e $request_filename) {
      rewrite ^/(.+)$ /index.php?p=$1 last;
    }

One quirk I discovered was I couldn’t do online changes using the SFTP facility. This happened at Slicehost too. I eventually discovered the cause. The directory needs to be owned by the webserver. The confusing thing is it looks like SFTP isn’t set up right or you have the password wrong or something. Glad to know this.

CodePlane

On a slightly separate note, I started using CodePlane as a code repo. It’s similar to GitHub’s private repo, but allows for unlimited projects. While I’d want to support GitHub, given all the value it provides, it’s not feasible to store dozens of small throwaway projects there, so CodePlane is really neat so far. And the developer is highly responsive. I noted on GetSatisfaction that the homepage doesn’t personalise if you’re logged in, he fixed it in a matter of hours. He’s also been open to engaging about some other suggestions I had. So it’s working out nicely so far.

Ruby Script to Localise Images for Offline HTML

I’m maintaining a custom HTML5 slide framework. A little similar to the canonical slides.html5rocks.com insofar as, well, it’s HTML5 and slides horizontally! But the key difference is that it’s very markup-based – What You See Is What You Need (WYSIWYN) – so creating new slides is easy.

Anyway, I did something similar with TiddlySlides a little while ago, and @FND created a nice Python script to inline external resources – http://mini.softwareas.com/using-fnds-excellent-spapy-to-make-a-single-p. I wanted something slightly different; since this is markup, I can’t rely on <img src… pattern. Could have possibly incorporated changes into Fred’s SPA, but being that I need it for GDC presentation tomorrow, and I said the last thing to myself the last time I did these slides and it didn’t happen, I opted to make a Ruby script which is general-purpose, but meets the specific needs of my slides. See GIST

  1. #!/usr/bin/env ruby
  2. # This script will download all images referenced in URL (URLs ending in
  3. # jpg/gif/png), stick them in an images/ directory if they're not already there,
  4. # and make a new file referencing the local directory.
  5. #
  6. # The script depends on the http://github.com/nahi/httpclient library.
  7. #
  8. # USAGE
  9. # localiseImages.rb index.html
  10. # ... will create images/ containing images and local.index.html pointing to them.
  11. #
  12. # The point is to cache images so your HTML works offline. See also spa.py
  13. # http://mini.softwareas.com/using-fnds-excellent-spapy-to-make-a-single-p
  14.  
  15. require 'httpclient'
  16.  
  17. ### UTILITIES
  18. IMAGES_DIR = 'images'
  19. Dir.mkdir(IMAGES_DIR) unless File.directory?(IMAGES_DIR)
  20. def filenameize(url)
  21.   IMAGES_DIR + '/' + url.sub('http://','').gsub('/','__')
  22. end
  23.  
  24. def save(filename,contents)
  25.   file = File.new(filename, "w")
  26.   file.write(contents)
  27.   file.close
  28. end
  29.  
  30. ### CORE
  31. def saveImage(url)
  32.   save(filenameize(url), HTTPClient.new().get_content(url))
  33. end
  34.  
  35. def extractImages(filename)
  36.   contents = File.open(filename, "rb").read
  37.   localContents = String.new(contents)
  38.   contents.scan(/http://S+?.(?:jpg|gif|png)/im) { |url|
  39.     puts url
  40.     saveImage(url) unless File.file?(filenameize(url))
  41.     localContents.gsub!(url, filenameize(url))
  42.   }
  43.   save("local."+filename, localContents)
  44. end
  45.  
  46. ### COMMAND-LINE
  47. extractImages(ARGV[0])

Aside: This is also related to “offline” web technologies…my article on “Offline” recently went live at HTML5Rocks: “Offline”: What does it mean and why should I care?.

Preventing a Rails (Mongrel App) from Crashing

I’ve had a Rails website which works fine for about 12-18 hours, then starts giving out intermittent 500 errors because the mongrels die.

After searching around, I ended up fixing it on two levels.

(a) Direct solution – Fix MySQL config One reason Mongrels die is MySQL connections not timing out, leading to starvation. Apparently there’s a bug here which means you have to set “ActiveRecord::Base.verification_timeout = 14400″ in environment.rb, the figure must be less than MySQL’s interactive timeout, which is 8 hours (28800 secs, so this is half of that). But as this thread points out, it doesn’t seem like that will achieve a whole lot on its own, so there’s also a tiny Strac hack you can include in the Rails code. Basically the hack is to set a reconnect flag when establishing a connection. The code is shown in the aforementioned thread.

(b) Risk mitigation – Automate monitoring and automatically redeploy when it fails I’ve always done silly things with cronjobs to automate redeployment, got the job done okay, but is definitely an admin smell. Nagios seemed too complicated. I just noticed this Monit tool seems to be gaining traction in the Rails community and turned out to be pretty easy to set up. It wakes up every three minutes (by default) and runs a specified command if specified conditions are(n’t) met. I hope Cap or Deprec will introduce support for Monit in the future.

Rails JSON, JSON Rails

<

p style=”clear:both; margin-top: 2em; “> Man, JSON has come out of nowhere! I first came across it amid the Javascript hype in early ’05, now it’s everywhere. Not only in your Ajax apps, but in the On-Demand Javascript APIs of the world.

Sharing an ethic of simplicity with Rails, it’s not surprising these technologies have come closely together, with JSON support now baked into the Rails core. (They would have been even closer together, if only anyone had noticed JSON==YAML early on and merged the two completely.) And yet, the current implementation misses a couple of tricks.

  • Use of “attributes”. According to Rails, a JSON object generally has just a single key, “attributes”. Instead of the simple { :name=>”Pac-Man”, :creator=>”Namco” }, we get (and have to give) { :attributes => { :name=>”Pac-Man”, :creator=>”Namco” }}. That’s not DRY. Not that it matters, but the reason for this is because ActiveRecords internally store persistable data fields in a hashmap called “attributes”. The JSON serialization is therefore faithful to Rails’ internal implementation, while being unfaithful to intuition. Maybe you could argue for “attributes” on the basis that you might one day have a transient attribute with the same name as a persistent attribute, in which case you’d get a clash. But Convention Over Configuration rules this argument out; you can always override JSON serialization behavior in boundary situations.
  • Child attributes This is probably an issue with XML as well. Basically, there is no attention paid to child attributes, i.e. JSON/XML serialization of a HABTM or has_many relationship. You want something like { :name=>”Pac-Man”, :creator=>”Namco”, :high_scores=>[{ :player_id=>1, :score=>99999 }, {:player_id=>2, :score=>88888 }] }, but all you get is the top-level attributes :name and :creator.

What to do?

First, I just discovered this extremely helpful library – Jsonifier, which also ships as a Rails plugin. It deals with both of the above problems. There’s no “attributes” chaff and it deals with associated records, even going recursive and letting you navigate from end of a many-to-many into the other. (In my example, you could show the names and ages of Pac-Man high scorers.) You also get to whitelist or blacklist attributes matches Rails’ associations syntax, i.e. :only, :except. Highly recommended, check it out.

Jsonifier, however, is pure to_json – it doesn’t (yet) handle deserialization. To my knowledge, there’s no Rails library that persists both parent and children in one go. There are lots of forum posts about handling multiple checkboxes and the like, and the answer usually involves overriding “children=” or “child_ids=”. I want to do this automatically!

I’ve been working out how to extend ActiveRecord to persist automatigically. I have the following very rough code working in basic situations, although it would need to do some more validation to work in production, among other things. It works on data structures such as the recurisve one above. You must first use a before_filter to convert the incoming JSON string into a Rails structure.

  1. class << ActiveRecord::Base
  2.  
  3.   def recursive_create(params)
  4.     transaction do
  5.       collections_by_children = pluck_hash_collections_by_children params
  6.       new_model = self.create(params)
  7.       populate_children new_model, collections_by_children
  8.       new_model
  9.     end
  10.   end
  11.  
  12.   def pluck_hash_collections_by_children(params)
  13.     hash_collections_by_children = Hash.new
  14.     params.each_pair { |name, value|
  15.       if value.class==Array
  16.         hash_collections_by_children[name] = value
  17.         params.delete(name)
  18.       end
  19.     }
  20.     hash_collections_by_children
  21.   end
  22.  
  23.   def populate_children(new_model, hash_collections_by_children)
  24.     hash_collections_by_children.each_pair { |children_name, hash_collection|
  25.       new_model.send("#{children_name}").destroy_all
  26.       child_class_name = children_name.singularize.camelize
  27.       child_class = Object.const_get(child_class_name)
  28.       hash_collection.each { |child_hash|
  29.         child = child_class.create(child_hash)
  30.         new_model.send("#{children_name}") << child
  31.       }
  32.     }
  33.   end
  34.  
  35. end
  36.  
  37. module ActiveRecord
  38.   class Base
  39.  
  40.     def recursive_update_attributes(params)
  41.       transaction do
  42.         collections_by_children = self.class.pluck_hash_collections_by_children params
  43.         self.update_attributes(params)
  44.         self.class.populate_children self, collections_by_children
  45.         save
  46.       end
  47.     end
  48.  
  49.   end
  50.  
  51. end

Ruby/Rails: Overriding NilClass

Rails uses “whiny nil”, which means if you call a method on an object that happens to be nil (null), you get an exception. This is good. But with strings in a web app (in any language), you often don’t know if an empty string will be nil or simply zero-length (“”). That’s because some code will see a form submitted with a whitespace string and make it nil, while others will make it “”. These might be my libraries or the web framework’s libraries or plugin libraries. This leads to ugly code like this:

  1. @message = "Oi! Enter your name fella" if name or name.empty?

When we really just want

  1. @message = "Oi! Enter your name fella" if name.empty?

So I derive great pleasure from overriding NilClass like so:

  1. class NiClass
  2.   def empty?
  3.     true
  4.   end
  5. end