CORS, Scraping, and Microformats

Jump straight to the demo.

Cross-Origin Resource Sharing makes it possible to do arbitrary calls from a web page to any server, if the server consents. It’s a typical HTML5 play: We could do similar things before, but they were with hacks like JSONP. Cross-Origin Resource Sharing lets us can achieve more and do it cleanly. (The same could be said of Canvas/SVG vs drawing with CSS; WebSocket vs XHR-powered Comet; WebWorker vs yielding with setTimeout; Round corners vs 27 different workarounds; and we could go on.)

This has been available for a couple of years now, but I don’t see people using it. Well, I haven’t checked, but I don’t get the impression many sites are offering their content to external websites, despite social media consultants urging them to be “part of the conversation”. It’s like when people make a gorgeous iPhone app, but their website doesn’t work at all in the same phone (cough fashionhouse) . Likewise, if you’ve got a public API, but not providing JSONP/callback support, it’s not very useful either…making developers host their own cross-domain proxy is tedious. It’s cool there are services like YQL and Embed.ly for some cases, but wouldn’t it be better if web pages could just pull in all that external content directly?

Except in this case, it’s just not happening. Everyone’s offering APIs, but no-ones sharing their content through the web itself. At this point, I should remind you I haven’t actually tested my assumption and maybe everyone is serving their public content with “Access-Control-Allow-Origin: *” … but based on the lack of conversation, I am guessing in the negative. The state of the universe does need further investigation.

Anyway, what’s cool about this is you can treat the web as an API. The Web is my API. “Scraping a web page” may sound dirtier than “consuming a web service”, but it’s the cleaner approach in principle. A website sitting in your browser is a perfectly human-readable depiction of a resource your program can get hold of, so it’s an API that’s self-documenting. The best kind of API. But a whole HTML document is a lot to chew on, so we need to make sure it’s structured nicely, and that’s where microformats come in, gloriously defining lightweight standards for declaring info in your web page. There’s another HTML5 tie-in here, because we now have a similar concept in the standard, microdata.

So here’s my demo.

I went to my homepage at mahemoff.com, which is spewed out by a PHP script. I added the following line to the top of the PHP file:

  1. <?
  2.   header("Access-Control-Allow-Origin: *");
  3.   ... // the rest of my script
  4. ?>

Now any web page can pull down “http://mahemoff.com/” with a cross-domain XMLHttpRequest. This is fine for a public web page, but something you should be very careful about if the content is (a) not public; or (b) public but dependent on who’s viewing it, because XHR now has a “withCredentials” field that will cause cookies to be passed if it’s on. A malicious third-party script could create XHR, set withCredentials to true, and access your site with the user’s full credentials. Same situation as we’ve always had with JSONP, which should also only be used for public data, but now we can be more nuanced (e.g. you can allow trusted sites to do this kind of thing).

On to the client …

I started out doing a standard XHR, for sanity’s sake.

javascript

  1. var xhr = new XMLHttpRequest();
  2.   xhr.open("get", "message.html", true);
  3.   xhr.onload = function() { //instead of onreadystatechange
  4.     if (xhr.readyState==4 && xhr.status==200)
  5.     document.querySelector("#sameDomain").innerHTML = xhr.responseText;
  6.   };
  7.   xhr.send(null);

Then it gets interesting. The web app makes a cross-domain call using the following facade, which I adapted from a snippet in the veritable Nick Zakas’s CORS article:

javascript

  1. function get(url, onload) {
  2.     var xhr = new XMLHttpRequest();
  3.     if ("withCredentials" in xhr){
  4.       xhr.open("get", url, true);
  5.     } else if (typeof XDomainRequest != "undefined"){
  6.       xhr = new XDomainRequest();
  7.       xhr.open("get", url);
  8.     } else {
  9.       xhr = null;
  10.     }
  11.     if (xhr) {
  12.       xhr.onload = function() { onload(xhr); }
  13.       xhr.send();
  14.     }
  15.     return xhr;
  16. }

This gives us a cross-domain XHR, for any browser that supports the concept, and it makes a request the usual way, and the request works against my site, but not yours, because of the header I set earlier on my site. Now I can dump that external content in a div:

javascript

  1. get("http://mahemoff.com/", function(xhr) {
  2.     document.querySelector("#crossDomain").innerHTML = xhr.responseText;
  3.     ...

(This would be a monumentally thick thing to do if you didn’t trust the source, as it could contain script tags with malicious content, or a phishing form. Normally, you’d want to sanitise or parse the content first. In any event, I’m only showing the whole thing here for demo purposes.)

Now comes the fun part: Parsing the content that came back from an external domain. It so happens that I have embedded hCard microformat content at http://mahemoff.com. It’s in the expandable business card you see on the top-left:

And the hCard content looks like this, based on :

  1. <div id="card" class="vcard">
  2.   <div class="fn">Michael&nbsp;Mahemoff</div>
  3.   <img class="photo" src="http://mahemoff.com/speak2.jpg"></img>
  4.   <div class="role">"I like to make the web better and sushi"</div>
  5.   <div class="adr">London, UK</div>
  6.   <div class="geo">
  7.     <abbr class="latitude" title="51.32">51&deg;32'N</abbr>,     <abbr class="longitude" title="0">0&deg;</abbr>
  8.   </div>  <div class="email">[email protected]</div>
  9.   <div class="vcardlinks">    <a rel="me" class="url" href="http://mahemoff.com">homepage</a>
  10.     <a rel="me" class="url" href="http://twitter.com/mahmoff">twitter</a>    <a rel="me" class="url" href="http://plancast.com/mahemoff">plancast</a>
  11.   </div>
  12. </div>

It’s based on the hCard microformat, which really just tells you what to call your CSS classes…I told you microformats were lightweight! The whole idea of the card comes from Paul Downey’s genius Hardboiled hCards project.

Anyway, bottom line is we’ve just extracted some content with hCard data in it, so it should be easy to parse it in a standard way and make sense of the content. So I start looking for a hCard Javascript library and find one, that’s the beauty of standards. Even better, it’s called Sumo and it comes from Dan Webb.

The hCard library expects a DOM element containing the hCard(s), so I pluck that from the content I’ve just inserted on the page, and pass that to the library. Then it’s a matter of using the “hCard” object to render a custom UI:

javascript

  1. var hcard = HCard.discover(document.querySelector("#crossDomain"))[0];
  2.  var latlong = new google.maps.LatLng(parseInt(hcard.geo.latitude), parseInt(hcard.geo.longitude));
  3.   var markerImage = new google.maps.MarkerImage(hcard.photoList[0], null, null, null, new google.maps.Size(40, 40));
  4.  var infoWindow = new google.maps.InfoWindow({content: "<a href='"+hcard.urlList[0]+"'>"+hcard.fn+"</a>", pixelOffset: new google.maps.Size(0,-20)});
  5.    ...

And I also dump the entire hCard for demo purposes, using James Padolsey’s PrettyPrint.

javascript

  1. document.querySelector("#hcardInfo").appendChild(prettyPrint(hcard));

There’s lots more fun to be had with the Web as an API. According to the microformats blog, 2 million web pages now have embedded hCards. Offer that content to the HTML5 mashers of the world and they will make beautiful things.

7 thoughts on CORS, Scraping, and Microformats

  1. Hey Michael,

    great Article. You should probably mention that putting * in the CORS is extremely dangerous with respect to cross site request forgery attacks and should basically only be done on pages that return absolutely, purely public info. If e.g. Facebook would do this, all accounts would be hacked in a second.

    Still it is great to have this and another feature killed off the “need Flash for that”-list

    Malte

  2. Part of the hassle of CORS (sic), is that MSFT, in their wisdom, make you use a different object for CORS than standard XHR var xdr = new XDomainRequest(); …

    Still, I suppose this will normalize, and get baked into libraries (eg) http://plugins.jquery.com/files/jquery.cors.js.txt

    @Malte – yes their is a CSRF problem here. No different really than JSONP (in fact better), or a cross domain proxy on a server, or a wide open crossdomain.xml file.. Guess this will shake out, but sensible authentication tokens should probably be used for all cross origin requests that access protected data.

    To some large extent, same origin policy implementations that we have been living with since Netscape Navigator 2.0 (http://en.wikipedia.org/wiki/Same_origin_policy#History) are a body bandage for a tick bite – the result of poorly thought through server side security. And never part of any spec that I am aware of.

    Cheers Tim

  3. @Malte Thanks mate. I tend to agree with @san1t1, the situation is better than JSONP because it’s more explicit and fine-grained. But yeah it’s dangerous and there will no doubt be the story of the bank that got it wrong in a few years. Emphasised the risk a bit more.

  4. Pingback: Ajaxian » The Web as an API

  5. Pingback: CORS, Scraping, and Microformats

  6. Pingback: OpenWeb-Notizen: PseudoID, Contacts API, Microformats - notizBlog

  7. Pingback: Including HTML Files from Other HTML Files, on the File:// System

Leave a Reply