I’ve had a Rails website which works fine for about 12-18 hours, then starts giving out intermittent 500 errors because the mongrels die.
After searching around, I ended up fixing it on two levels.
(a) Direct solution - Fix MySQL config One reason Mongrels die is MySQL connections not timing out, leading to starvation. Apparently there’s a bug here which means you have to set “ActiveRecord::Base.verification_timeout = 14400” in environment.rb, the figure must be less than MySQL’s interactive timeout, which is 8 hours (28800 secs, so this is half of that). But as this thread points out, it doesn’t seem like that will achieve a whole lot on its own, so there’s also a tiny Strac hack you can include in the Rails code. Basically the hack is to set a reconnect flag when establishing a connection. The code is shown in the aforementioned thread.
(b) Risk mitigation - Automate monitoring and automatically redeploy when it fails I’ve always done silly things with cronjobs to automate redeployment, got the job done okay, but is definitely an admin smell. Nagios seemed too complicated. I just noticed this Monit tool seems to be gaining traction in the Rails community and turned out to be pretty easy to set up. It wakes up every three minutes (by default) and runs a specified command if specified conditions are(n’t) met. I hope Cap or Deprec will introduce support for Monit in the future.