About WebWait and Caching

I received today a question often asked about WebWait, so I’ll answer it here for reference.

WebWait User asks:

I have been using webwait for a while and have a quick question for you. When running multiple calls on the same website, is each call downloading the entire page again, or is the information being loaded from the browser cache?

My answer:

It will do whatever the browser would do if the page was loaded normally, so that would usually mean the 2nd-Nth time it will download from the cache. To counter-act that, you can simply disable your browser cache while performing your tests. Or if you do want to test cache performance, just open your site once (either in the browser or WebWait) and then start the WebWait tests, obviously keeping the cache enabled throughout.

More RSS Client Optimizations: Preventing Re-Fetch

Background: Has the Feed Changed?

I previously mentioned some work I did to cut down processing and IO on an RSS client. Yesterday, I was able to continue this effort with some more enhancements geared around checking if the feed has changed. These changes are not just important for my server’s performance, but also for being a good internet citizen and not hammering others’ machines with gratuitous requests. Note everything in this article will be basic hygiene for anyone whose written any kind of high-scale bot, but documenting here as it was useful learning to me.

Normally, a fetch requires the client to compare the incoming feed against what has been stored. This requires lookup on the database and a comparison process. It’s read-only, so not hugely expensive, but does require reading a lot — all items in the feed — and at frequent intervals.

All this comparison effort would be unnecessary if we could guarantee the feed hasn’t changed since the last fetch. And of course, most of the time, it won’t have changed. If we’re fetching feeds hourly, and the feed changes on average once a week, then we can theoretically skip the whole comparison 99.4% of the time!

So how can we check if the feed has changed?

Feed Hash

The brute-force way to check if the feed has changed is to compare the feed content with the one we received last time. We could store the incoming feed in a file, and if it’s the same as the one we just sucked down, we can safely next it.

Storing a jillion feed files is expensive and unnecessary. (Though some people might temporarily store them if they’ve separated the fetching from the comparison, to prevent blockages, which I haven’t done here). If all we need the files for is a comparison, we can instead store a hash. With a decent hash, the chance of a false positive is extremely low and the severity in this context also extremely low.

So the feed now has a new hash field.

  1. incoming_feed = fetch_feed(feed_record.url)
  2. incoming_hash = Digest::MD5.hexdigest(incoming_feed.body)
  3. return if incoming_hash == feed_record.hash # Files match, no comparison necessary
  4.  
  5. feed_record.title = incoming_feed.title
  6. feed_record.hash = incoming_hash # Save the new hash for next time
  7. # ... Keep processing the feed. Compare each item, etc.

HTTP if-not-modified-since

The HTTP protocol provides its own support for this kind of thing, via the if-not-modified-since request header. So we should send this header, and we can then expect a 304 response in the likely event no change has happened. This will save transferring the actual file as well as bypassing the hash check above. (However, since this is not at all supported everywhere, we still do need the above check as an extra precaution.)

  1. req = Net::HTTP::Get.new(feed_record.url)
  2. req.add_field("If-Modified-Since", last_fetched_at.rfc2822) if last_fetched_at
  3. ...
  4. res = Net::HTTP.new(...)
  5. return if res.code=='304' # We don't even need to compare hashes

ETag

Another HTTPism is ETag, a value that, like our hash, is guaranteed to change if the feed content changes. So to be extra-sure we’re not re-processing the same feed, and hopefully not even fetching the whole feed, we can save the ETag and include it in each request. It works like if-not-modified-since; if the server is still serving the same ETag, it will respond with an empty 304.

  1. req.add_field("If-None-Match", etag) if etag
  2. ...
  3. # Again, we return if res.code=='304'
  4. feed_record.etag = incoming_feed.etag # Save it for next time

For the record, about half of the feeds I’ve tested — mostly from fairly popular sources, many of them commercial — include ETags. And of those, at least some of them change the ETag unnecessarily often, which renders it useless in those cases (actually worse than useless, since it consumes unnecessary resources). Given that level of support, I’m not actually convinced it adds much value over just using if-not-modified-since, but I’ll leave it in for now. I’m sure managers of those servers which do support it would prefer it be used.

The Weirdness of Ajax Redirects: Some Workarounds

I’ve been dealing with the situation when your server redirects after a DELETE. Which is actually a fairly common use case. User deletes a blog post, redirect them to the index of all posts. The article below has some Rails workarounds, but is relevant to anyone using XHR and perhaps HTML5 History, ie many single-page web apps.

It’s important stuff to know if you’re HTML5ing in earnest, because there was a recent change to Chrome which makes it standard-compliant, but fundamentally different to other browsers and also makes it handle DELETE and other verbs different to how it handles POST and GET.

DELETE redirection

This redirection becomes troublesome if you want to do this in a way which will seamlessly support Ajax calls, as well as standard web requests. It’s troublesome because of a lethal combination of two browser facts.

Firstly, XHR automatically follows redirects. Your JavaScript has no way to jump in and veto or alter any redirect that comes from the server. XHR call fires off to server, server responds with 302 status code and a location header, the browser automatically issues a new request to the specified location.

Secondly, a DELETE request, when it’s redirected, will actually cause another DELETE request to the new location. Yes, if you’re redirecting to the index, congratulations … your server now receives a request to DELETE the fraking index. You lose!!! So the very normal thing to do for a non-Ajax app suddenly becomes an epic tragedy. It’s completely unintuitive, but it’s actually the standard! I learned as much by filing an erroneous Chrome bug. Turns out Chrome’s unintuitive behaviour is actually correct and Firefox and Opera were wrong. Even more confusing, I was seeing POST requests being converted to GETs, and it turns out this is also part of the 302 standard – POSTs can be converted to GETs, but not DELETEs (or PUTs etc I assume).

Just now, I made a little workaround which seems to be working nicely in my case. In my AppController, which is my app’s own subclass of ActionController, I simply convert any 302 response (the default) to a 303, for redirect_to calls. I should maybe do this for XHR calls only, to be “pure” when a normal call comes in, but no real difference anyway.

class ApplicationController < ActionController::Base

# see src at https://github.com/rails/rails/blob/master/actionpack/lib/action_controller/metal/redirecting.rb def redirect_to(options = {}, response_status = {}) super(options, response_status) self.status = 303 if self.status == 302 end

end [/ruby]

Finding the new location

While I’m blogging this, let me tell you about another related thing that happens with redirects (this section applies to any type of request – GET, POST, DELETE, whatever). As I said above, XHR will automatically follow requests. The problem is, how do you know where the ultimate response actually came from? XHR will only tell you where the initial request went; it refuses to reveal what it did afterwards. It won’t even tell you if there was any redirect. Knowing the ultimate location is important for HTML5 apps using the History API, because you may want to pushState() to that location. The solution is to have the web service output its own location in a header. Weird that you have to, but gets the job done.

class ApplicationController < ActionController::Base … before_filter proc { |controller| controller.response.headers[‘x-url’] = controller.request.fullpath }

end [/ruby]

Two-Way Web: Can You Stream In Both Directions?

Update (couple of hrs later): I mailed Alex Russell (the guy who named Comet and knows plenty about it), it sounds like he’s been investigating this whole area and he’s sent me his views.

We know about Comet (AKA Push, HTTP Streaming) and its ability to keep streaming info from server to browser. How about streaming upwards, from browser to server, and preferably in the same connection? A reader mailed me this query:

Im missing one demo, would it be possible to reuse same stream in streaming demos to send msg to server, I’ve been digging throw your examples, but they all seem to create a new connection to the server when posting, would be very interesting see a demo that does this within the same stream, and of course the server code would be as interesting as the client.

Here’s my thinking, I’m sure a lot of smart readers will know more about this and I’ll be interested in your views – is it feasible? Any online demos?

Unfortunately, I’ve not seen anyone pull this off – it’s always assumed you need a “back channel”. It’s the kind of hack someone like Google or 37S would turn around and pull off even though it’s “obviously impossible” 😉 .

There are two key issues:

(1) Server needs to start outputting before incoming request is finished. With a specialised server, this problem could be overcome.

(2) (More serious as we can’t control the browser) The browser would need to upload data in a continuous stream. You can do it with Flash/Java, but I can’t see how to do this with standard JS/HTML. If you use XHR, you’re going to call send() and wave goodbye to the entire request…there’s no support for sequencing it. Same if you submit a regular form, change IFrame’s source etc. Even if you could somehow delay reading of content so it’s not immediately uploaded, the browser would probably end up not sending anything at all as it would be waiting to fill up a packet.

I think the solution lies in the Keep-Alive extension to HTTP 1.1:

What is Keep-Alive?

The Keep-Alive extension to HTTP, as defined by the HTTP/1.1 draft, allows persistent connections. These long-lived HTTP sessions allow multiple requests to be send over the same TCP connection, and in some cases have been shown to result in an almost 50% speedup in latency times for HTML documents with lots of images.

If you google for “xmlhttprequest keep-alive” or “ajax keep-alive”, you’ll see people talking about the idea a bit, but there’s not much info on how to script it for continuous connections and no demos to be found. It would make a great experiment if someone did a proof-of-concept!

As an alternative, you could consider a thin, invisible, Flash layer to handle transport, and degrade to frequent Submission Throttling where Flash isn’t an option.

<

p>BTW I have a post and podcast planned about the whole two-way web thing, which will be profound (the two-way web thing, not the podcast :-)). The web is entering a new era of Real-Time Collaboration and Communication, post-Ajax (and of course building on Ajax, just as Ajax builds on the technologies of the previous era: CGI, DHTML, CSS, etc).

Update: As mentioned above, Alex Russell mailed me his views. In particular, it’s interesting to consider the possibility that browsers might transparently exploit keep-alive if you hit the server frequently enough.

So I’ve spent some time investigating this (as you might expect), and at the end of the day there’s not much to be done aside from using Flash and their XMLSocket interface. That’s an obvious possibility given the high-performance Flash communication infrastructure we have in Dojo. Doing bi-directional HTTP probably won’t happen, though, but I don’t think that’s cause for despair. In my tests, we can get really good (relative) performance out of distinct HTTP requests so long as the content of the request is kept to a minimum and the server can process the connection fast enough. HTTP keepalive exists at a level somewhat below what’s currently exposed to browsers, so if the client and server support it, frequent requests through stock XHR objects may verywell be using it anyway. We’ll have to do some significant testing to determine what conjunctions of servers/clients might do this, however.

There are even more exotic approaches available from Flash peering that I’ve been investigating as well, but they will require significantly different infrastructure from what we already deploy that I think they’re still in the land of “hrm…someday”.

First we have to solve the *regular* Comet scalability problems for existing servers and app containers.

Regards

PS: we haven’t been making much noise about it, but serious work has started on an Open Source Comet protocol with initial implmemntations in both Perl and Python over at http://cometd.org. The initial client library is Dojo-based, but we’ll be publishing the protocol so that anyone can “play” with it.

Comet: It’s Ajax for “Push” (Podcast)

Here’s a podcast about Comet – exploring the two-way web with Ajax. From my Ajaxian post earlier today:

Alex Russell has coined a term for a flavour of Ajax that’s been getting more attention of late. Comet describes applications where the server keeps pushing – or streaming – data to the client, instead of having the browser keep polling the server for fresh content. Alex identifies several buzzworthy examples:

This is an important article because it captures a growing trend in Ajax, a trend I had in mind when I said we expect to hear more about “Push and the Two-Way Web” in the next twelve months, on the occasion of Ajax’s birthday. There will, of course, be people saying “there’s nothing new here”, and that’s presumably all too obvious to Alex himself, who has worked with these ideas for a long time. But as with Ajax, it’s the power of a name. I don’t think these ideas can adequately be described as Ajax, because Ajax changes a lot about the browser whereas Comet fundamentally changes the nature of browser-server communication. I see Comet as part of the overall Ajax trend, complementary to the UI aspects of Ajax.

People may also say there are existing names like “Push”. True, but they have baggage – I think it’s useful to have a name for this architectural pattern in light of the relationship to Ajax.

Anyways, I wanted to expand on some of the thoughts in the article and after the recent Basics of Ajax Podcast, I’m in the mood for more audio rambling. So here’s a 56-minute discussion about Comet and the general trend of push and streaming within Ajax.

Click to download the Podcast. You can also subscribe to the
feed if you want future podcasts automatically downloaded - check out the
podcast FAQ at http://podca.st.

Shownotes…

It's the Duplex, Stupid! Push or Pull - it doesn't matter so much. What's critical here is the focus on the two-way web.

Applications - Chat - Wiki - News - Current events, sport, financials, etc - Trading and Auctions - Real-time control and logistics - File transfer (combine with local storage) - Any other genre you'd care to name

Vanilla Ajax: Await the User

Comet Ajax: Keep Pushing

Polling Ajax: Keep Pulling

Benefits of Comet - Responsive: data pumped out immediately - More stable profile - Less overhead of establishing connections

Benefits of Polling - Browser memory - Can run on any standard server; Comet requires suitable server - Can upload at the same time - Can run on - with Comet, XHR and IFrame won't always reflect changes while the connection's open - Being more standard, works with existing infrastructure. Comet is vulnerable to middle-men dropping the connection. - Simpler architecture - only the browser's in control - Easier to test - More familiar architecture - Less programming effort - with Comet, must watch for changes on the stream - More efficient for infrequently accessed data - Leverages caching

Maybe Comet causes more pain, but if it keeps the user happy ...

Questions and Trends - Which to use. Variables include: frequency of updates, importance of updates, server capabilities, target browsers - Dealing with incoming messages, e.g. Distributed Events pattern, Event bus (browser or server?), etc - Workarounds for throbber, status bar, clicking sound, etc. - How often to drop connections - How browsers can accommodate it

Proof-Of-Concept Demos - Wiki using Periodic Refresh/Polling - Wiki using HTTP Streauh, Comet (Actually, this is only a very basi implementation - there's no use of events, just custom handling of HTTP.

Related Patterns - HTTP Streaming - Periodic Refresh (aka Polling) - Distributed Events

As always, feedback is welcome – [email protected]

HTTP Streaming: An Alternative to Polling the Server

If Ajax apps are to be rich, there must be a way for the server to pass new information to the browser. For example, new stock quotes or an instant message someone else just sent you. But the browser’s not a server, so the server can’t initiate an HTTP connection to alert the browser. The standard way to deal with this dilemma is Periodic Refresh, i.e. having the browser poll the server every few seconds. But that’s not the only solution.

The recent podcast on Web Remoting includes a discussion of the HTTP Streaming pattern. By continuing to stream information from the server, without closing the connection, you can keep the browser content fresh. I wasn’t aware that it was being used much on the public web, since it can be costly, but I recently discovered JotLive (which is only semi-public since it requires registration) is indeed using it. Do you know any other examples?

Ajaxian.com’s interview with Abe Fettig of JotLive:

How do you handle the “live” part? Polling?

We’re using a (very slightly modified) version of LivePage, which Donovan Preston wrote as part of Nevow, a Python library for building web applications using the Twisted networking framework (which I just wrote a book on: Twisted Network Programming Essentials). LivePage doesn’t use polling. Instead, it uses a clever technique where each browser keeps an open XMLHTTP request to the server at all times, opening a new connection each time the old one closes. That way every client viewing the page is constantly waiting for a response from the server. When the server wants to send a message to a client, it uses the currently open request. So there’s no waiting.

A few (edited) extracts from the HTTP Streaming pattern:

Alternative Pattern: Periodic Refresh is an obvious alternative to HTTP Streaming. It fakes a long-lived connection by frequently polling the server. Generally, Periodic Refresh is more scaleable and easier to implement in a portable, robust, manner. However, HTTP Streaming can deliver more timely data, so consider it for systems, such as intranets, where there are less simultaneous users, you have some control over the infrastructure, and each connection carries a relatively high value.

Refactoring Illustration: The Basic Wiki Demo, which uses Periodic Refresh, has been refactored to use [http://ajaxify.com/run/wiki/streaming](HTTP Streaming).

Solution:
Stream server data in the response of a long-lived HTTP connection. Most web services do some processing, send back a response, and immediately exit. But in this pattern, they keep the connection open by running a long loop. The server script uses event registration or some other technique to detect any state changes. As soon as a state change occurs, it pushes new data to the outgoing stream and flushes it, but doesn’t actually close it. Meanwhile, the browser must ensure the user-interface reflects the new data. This pattern discusses a couple of techniques for Streaming HTTP, which I refer to as “Page Streaming” and “Service Streaming”.
“Page Streaming” involves streaming the original page response. Here, the server immediately outputs an initial page and flushes the stream, but keeps it open. It then proceeds to alter it over time by outputting embedded scripts that manipulate the DOM. The browser’s still officially writing the initial page out, so when it encounters a complete <script> tag, it will execute the script immediately. A simple demo is available at http://ajaxify.com/run/streaming/.
…(illustration and problems)…
“Service Streaming” is a step towards solving these problems, though it doesn’t work on all browsers. The technique relies on XMLHttpRequest Call (or a similar remoting technology like IFrame_Call). This time, it’s an XMLHttpRequest connection that’s long-lived, instead of the initial page load. There’s more flexibility regarding length and frequency of connections. You could load the page normally, then start streaming for thirty seconds when the user clicks a button. Or you could start streaming once the page is loaded, and keep resetting the connection every thirty seconds. Having a range of options helps immeasurably, given that HTTP Streaming is constrained by the capabilities of the server, the browsers, and the network.

Experiments suggest that the Page Streaming technique does work on both IE and Firefox ([1]), but Service Streaming only works on Firefox, whether XMLHTTPRequest ([2]) or IFrame ([3]) is used. In both cases, IE suppresses the response until its complete. You could claim that’s either a bug or a feature; but either way, it works against HTTP Streaming.

Donovan Preston explains the technique he uses in Nevow, which overcomes this problem:

When the main page loads, an XHR (XMLHttpRequest) makes an “output conduit” request. If the server has collected any events between the main page rendering and the output conduit request rendering, it sends them immediately. If it has not, it waits until an event arrives and sends it over the output conduit. Any event from the server to the client causes the server to close the output conduit request. Any time the server closes the output conduit request, the client immediately reopens a new one. If the server hasn’t received an event for the client in 30 seconds, it sends a noop (the javascript “null”) and closes the request.

Basics of Ajax 2 of 3: Web Remoting (XMLHttpRequest etc) (Podcast)

Ajax Basics 2 of 3

This is the second of three podcasts on the basic Ajax patterns.

  • Podcast 1: Display Patterns and the DOM.
  • Podcast 2: Web Remoting – XMLHttpRequest, IFrame Call, HTTP Streaming.
  • Podcast 3: Dynamic Behaviour – Events and Timing.

Podcast 2: Web Remoting (XMLHttpRequest, IFrame, HTTP Streaming)

Click to download the Podcast. You can also subscribe to the
feed if you want future podcasts automatically downloaded - check out the
podcast FAQ at http://podca.st.

This 75 minute podcast covers web remoting concepts and the following specific patterns:

  • XMLHttpRequest Call Use XMLHttpRequest objects for browser-server communication. (05:00)
  • IFrame Call Use IFrames for browser-server communication. (31:45)
  • HTTP Streaming Stream server data in the response of a long-lived HTTP connection. (47:00)


Please ignore this bit, I’m claiming My Odeo Channel (odeo/bdd768d9acefcad9)