CORS, Scraping, and Microformats

Jump straight to the demo.

Cross-Origin Resource Sharing makes it possible to do arbitrary calls from a web page to any server, if the server consents. It’s a typical HTML5 play: We could do similar things before, but they were with hacks like JSONP. Cross-Origin Resource Sharing lets us can achieve more and do it cleanly. (The same could be said of Canvas/SVG vs drawing with CSS; WebSocket vs XHR-powered Comet; WebWorker vs yielding with setTimeout; Round corners vs 27 different workarounds; and we could go on.)

This has been available for a couple of years now, but I don’t see people using it. Well, I haven’t checked, but I don’t get the impression many sites are offering their content to external websites, despite social media consultants urging them to be “part of the conversation”. It’s like when people make a gorgeous iPhone app, but their website doesn’t work at all in the same phone (cough fashionhouse) . Likewise, if you’ve got a public API, but not providing JSONP/callback support, it’s not very useful either…making developers host their own cross-domain proxy is tedious. It’s cool there are services like YQL and Embed.ly for some cases, but wouldn’t it be better if web pages could just pull in all that external content directly?

Except in this case, it’s just not happening. Everyone’s offering APIs, but no-ones sharing their content through the web itself. At this point, I should remind you I haven’t actually tested my assumption and maybe everyone is serving their public content with “Access-Control-Allow-Origin: *” … but based on the lack of conversation, I am guessing in the negative. The state of the universe does need further investigation.

Anyway, what’s cool about this is you can treat the web as an API. The Web is my API. “Scraping a web page” may sound dirtier than “consuming a web service”, but it’s the cleaner approach in principle. A website sitting in your browser is a perfectly human-readable depiction of a resource your program can get hold of, so it’s an API that’s self-documenting. The best kind of API. But a whole HTML document is a lot to chew on, so we need to make sure it’s structured nicely, and that’s where microformats come in, gloriously defining lightweight standards for declaring info in your web page. There’s another HTML5 tie-in here, because we now have a similar concept in the standard, microdata.

So here’s my demo.

I went to my homepage at mahemoff.com, which is spewed out by a PHP script. I added the following line to the top of the PHP file:

  1. <?
  2.   header("Access-Control-Allow-Origin: *");
  3.   ... // the rest of my script
  4. ?>

Now any web page can pull down “http://mahemoff.com/” with a cross-domain XMLHttpRequest. This is fine for a public web page, but something you should be very careful about if the content is (a) not public; or (b) public but dependent on who’s viewing it, because XHR now has a “withCredentials” field that will cause cookies to be passed if it’s on. A malicious third-party script could create XHR, set withCredentials to true, and access your site with the user’s full credentials. Same situation as we’ve always had with JSONP, which should also only be used for public data, but now we can be more nuanced (e.g. you can allow trusted sites to do this kind of thing).

On to the client …

I started out doing a standard XHR, for sanity’s sake.

javascript

  1. var xhr = new XMLHttpRequest();
  2.   xhr.open("get", "message.html", true);
  3.   xhr.onload = function() { //instead of onreadystatechange
  4.     if (xhr.readyState==4 && xhr.status==200)
  5.     document.querySelector("#sameDomain").innerHTML = xhr.responseText;
  6.   };
  7.   xhr.send(null);

Then it gets interesting. The web app makes a cross-domain call using the following facade, which I adapted from a snippet in the veritable Nick Zakas’s CORS article:

javascript

  1. function get(url, onload) {
  2.     var xhr = new XMLHttpRequest();
  3.     if ("withCredentials" in xhr){
  4.       xhr.open("get", url, true);
  5.     } else if (typeof XDomainRequest != "undefined"){
  6.       xhr = new XDomainRequest();
  7.       xhr.open("get", url);
  8.     } else {
  9.       xhr = null;
  10.     }
  11.     if (xhr) {
  12.       xhr.onload = function() { onload(xhr); }
  13.       xhr.send();
  14.     }
  15.     return xhr;
  16. }

This gives us a cross-domain XHR, for any browser that supports the concept, and it makes a request the usual way, and the request works against my site, but not yours, because of the header I set earlier on my site. Now I can dump that external content in a div:

javascript

  1. get("http://mahemoff.com/", function(xhr) {
  2.     document.querySelector("#crossDomain").innerHTML = xhr.responseText;
  3.     ...

(This would be a monumentally thick thing to do if you didn’t trust the source, as it could contain script tags with malicious content, or a phishing form. Normally, you’d want to sanitise or parse the content first. In any event, I’m only showing the whole thing here for demo purposes.)

Now comes the fun part: Parsing the content that came back from an external domain. It so happens that I have embedded hCard microformat content at http://mahemoff.com. It’s in the expandable business card you see on the top-left:

And the hCard content looks like this, based on :

  1. <div id="card" class="vcard">
  2.   <div class="fn">Michael&nbsp;Mahemoff</div>
  3.   <img class="photo" src="http://mahemoff.com/speak2.jpg"></img>
  4.   <div class="role">"I like to make the web better and sushi"</div>
  5.   <div class="adr">London, UK</div>
  6.   <div class="geo">
  7.     <abbr class="latitude" title="51.32">51&deg;32'N</abbr>,     <abbr class="longitude" title="0">0&deg;</abbr>
  8.   </div>  <div class="email">[email protected]</div>
  9.   <div class="vcardlinks">    <a rel="me" class="url" href="http://mahemoff.com">homepage</a>
  10.     <a rel="me" class="url" href="http://twitter.com/mahmoff">twitter</a>    <a rel="me" class="url" href="http://plancast.com/mahemoff">plancast</a>
  11.   </div>
  12. </div>

It’s based on the hCard microformat, which really just tells you what to call your CSS classes…I told you microformats were lightweight! The whole idea of the card comes from Paul Downey’s genius Hardboiled hCards project.

Anyway, bottom line is we’ve just extracted some content with hCard data in it, so it should be easy to parse it in a standard way and make sense of the content. So I start looking for a hCard Javascript library and find one, that’s the beauty of standards. Even better, it’s called Sumo and it comes from Dan Webb.

The hCard library expects a DOM element containing the hCard(s), so I pluck that from the content I’ve just inserted on the page, and pass that to the library. Then it’s a matter of using the “hCard” object to render a custom UI:

javascript

  1. var hcard = HCard.discover(document.querySelector("#crossDomain"))[0];
  2.  var latlong = new google.maps.LatLng(parseInt(hcard.geo.latitude), parseInt(hcard.geo.longitude));
  3.   var markerImage = new google.maps.MarkerImage(hcard.photoList[0], null, null, null, new google.maps.Size(40, 40));
  4.  var infoWindow = new google.maps.InfoWindow({content: "<a href='"+hcard.urlList[0]+"'>"+hcard.fn+"</a>", pixelOffset: new google.maps.Size(0,-20)});
  5.    ...

And I also dump the entire hCard for demo purposes, using James Padolsey’s PrettyPrint.

javascript

  1. document.querySelector("#hcardInfo").appendChild(prettyPrint(hcard));

There’s lots more fun to be had with the Web as an API. According to the microformats blog, 2 million web pages now have embedded hCards. Offer that content to the HTML5 mashers of the world and they will make beautiful things.

Did you hear the one about enterprise reuse?

Confirming that enterprise reuse can be a bit of a joke at times, Jason Gorman shares this fable on enterprise reuse (via another inspired Jason). Short summary: Two ladies could save 8 cents by boiling tea in the same kettle. But the cunning analyst forgets that, since they live 20 miles from each other, there will be overheads to the tune of a $20 cab ride and the travelling time.

Viewed from a high level, enterprise reuse is a noble goal; what’s the point of being a single company if everyone writes their own code? In practice, it can be fraught. Ironically, it’s usually easier to reuse publicly-available libraries (e.g. open-source libs on sourceforge) and public web services than those in the same company. The following things make reuse more digestible in an enterprise setting:

  • Language-agnostic, industry-standard, technologies Using obscure or proprietary technologies can work in an individual team, but rarely in a large enterprise; in most cases, there are simply too many factions with different skill sets and legacy code bases. There are companies that describe themselves as a “pure Java shop”, for example, but you will indefinitely find pockets working in Python, .Net, and so on. Getting an enterprise to truly standardise (not just lip service) on something non-industry-standard is futile. It takes several months for people to get really competent in a new language; in an environment full of contractors and staff turning over every few years, and full of legacy systems, you can count on the fact that there will be disparate technologies at play. It’s a good thing, too; no one language (or paradigm, for that matter), not even Java *gasp*, is the right solution to all problems.
  • Service-oriented SOA as in “built on a needs-driven basis”. The stuff that’s available for reuse is stuff that’s been abstracted from real-world projects, where at least one project already built it and at least one other project actually needs it. (Rails is successful because 37Signals uses it; there aren’t dozens of 24-month working groups involved.) Perhaps the biggest mistake enterprises make in this whole area is pushing out functionality no-one else actually wants to reuse.
  • Support trumps standardisation The best way IMO to encourage a certain technology or library is the carrot, not the stick. Make people actually want to reuse what you have to offer, rather than forcing them to do so. I am very sceptical about any situation where architects have to act as the reuse police; if the component or service was designed, documented, easily located, and served a genuine need, wouldn’t the developer be drawn towards it? Wouldn’t they actually want to use it, and maybe even give something back to it? In an ecosystem where components and services are high-quality and easily-accessed, you can forget about mandating reuse because it will happen anyway. See Web API Patterns and Documentation as Conversation for the kinds of things that will make this happen.
  • Online As a rule of thumb, offering a centralised web service is better than offering a reusable code component. The web service can (should) be easier to use and is language-agnostic. Obviously, there are sometimes situations where code components make more sense, especially from a performance perspective. I wouldn’t use an online service to create a polygon every millisecond, for example.
  • Easy to use As with any API, it should be easy to learn and make calls. For this reason, online services should be RESTful, not SOAP or CORBA or whatever MQ if you can help it.
  • Iterative progress Don’t try to bite off more than you can chew; if you start pretending *everything* can be reused, you’ll soon find that nothing gets reused.
  • Simple and parsimonious Factor out the trivial factors that relate only to one particular client. In enterprise reuse, this can be a big problem, where client projects may be the budget holder for anything reusable. It’s difficult, but someone needs to stand up and say “no, we’re not going to include feature X because no-one else would actually need it”. In software, deciding what to leave out is usually a greater challenge than coming up with new things to put in. Any feature that won’t be used by a significant proportion of client apps is going to create more clutter than its worth. In a broad-scale service, I’d say this minimum proportion should be something like 5-10% (e.g. Each method should be exercised by 5-10% of clients who used the class.) In an enterprise context, where there may only be a few clients, I’d say the criterion should be “at least 2 clients”. (There was a podcast interview a while ago, with PragDave I think, where he was asked what he would include in Rails 2.0. He essentially replied that he’s more worried about taking things out – push them out of the core distro and into plugins.)
  • Automated Sometimes, people think “it’s all under the same roof”, so getting access and learning about an API requires a call or a meeting with the owner of the reusable service/component. Whereas, if Google offers the same thing, it will provide online doco and a means of accessing it automatically, without any human intervention. An agile enterprise should aspire to do the same thing; it doesn’t have to be as polished as a public offering, but the spirit should be the same. Otherwise, it won’t scale, and the owner will soon become fed up doing the same thing over and over.

Ajax Programming Patterns – Podcast 1 of 4: Web Service Patterns

Whereupon a new podcast series begins …

As promised, a new series of Ajax pattern podcasts. This is the first of four podcasts on the Ajax programming patterns.

In this 73 minute podcast, we look at the seven patterns of web services as they relate to Ajax clients.

Click to download the Podcast. You can also subscribe to the
feed if you want future podcasts automatically downloaded - check out the
podcast FAQ at http://podca.st.

  • RPC Service Expose web services as Remote Procedural Calls (RPCs). (Note: In the book and wiki, REST appears before RPC.) (6:55)
  • RESTful Service Expose web services according to RESTful principles. (13:25 )
  • HTML Response Have the server generate HTML snippets to be displayed in the browser. (44:45)
  • Semantic Response Have the server respond with abstract, semantic, data. (49:00)
  • Plain-Text Message Pass simple messages between server and browser in plain-text format. (56:05)
  • XML Message Pass messages between server and browser in XML format. (57:20)
  • JSON Message Pass messages between server and browser in Javascript Object Notation (JSON) format. (59:55)

Thanks for your feedback since last time. Good, bad, or ugly, it’s all welcome – in the comments for this podcast or [email protected]

SAG Ajax Patterns Review 2 – User Action, Scheduling, Web Service, REST, RPC

Following from the previous post, here’s my notes from the second SAG workshop/discussion on the Ajax Patterns. See the earliest post for a background on this series.

Feb-2-2006 Second Ajax Patterns Discussion

“We should set up wizlite “sag” group for annotations” [MM For the benefit of others, the online book draft version uses Alex Kirk’s Wizlite.] Wizlite – is it Ajax? [MM Yes, Alex Kirk’s a prolific Ajax developer and writer]

Wizlite – can you trust the data on all these things to stay online [MM Backup solutions like openonomy, part of the argument with these Web 2.0 things is they host it, but you own the data, so you can in theory point a backup service to it and at least you’ll always have the raw data. As for privacy, not much you can say about that (although, in theory, Ajax lets you host encrypted content- see the Host-Proof Hosting pattern). Also, some free content is available, then removed.

Maybe AjaxPatterns.org will go down when the book is published [MM I know it was a joke, but just to let you know, won’t†happen. It’s a CC license, and a big factor in going with O’Reilly was being able to keep the wiki online before, during, and after. Interestingly, O’Reilly didn’t have the Rough Cuts program going at the time, but they presumably had been planning it.]

Author had some auditory issues, need to fix that. [MM Thanks, this mp3 worked out better. The first one was fine too once I listened to it with better earphones. Given the quality of my own podcasts, I certainly can’t be a critic here. BTW I think it would be great if the SAG talks were actually linked from a feed. To me, this is the best kind of podcast you can get – it’s zero effort – a meeting that would have happened anyway. (Unlike, say, me rambling about Ajax, which is not something I normally do spontaneously)]

=======================================================

10:50 User Action

So what is the User Action pattern? – What’s the proper way of handling user actions? – “User Action” is more of a problem than a solution. More Event Handler or Callback. Names are typically solutions. Or does a JS programmer call it that? [MM No]. – Rather call it Listener than Event Handler if that has a narrower connotation. – Depends on what people really call it. – But better names come along too, e.g. “Ajax”. – Listener is kind of Java-ish, and JS and Java don’t have much in common other than the name. – The “Java” in “Javascript” name was marketing-driven, a misnomer. [TODO Mention ECMAScript/Javascript distinction in conventions section] – => Suggestion to author for rename

Implies you’re not going to see that event loop, it’s invisible, it’s probably more in the virtual machine. – If doing distr’d programming, end up with lots of loops. Difficult to integrate different packages. If all buried inside the language, could be better because no incompatibilities, but could be worse because it’s invisible. – So events are part of this language. [MM Yes, part of the browser’s DOM implementation]

OK, issues – Not specific to this pattern, but author discusses incompat’s between IE and Firefox. Should be some mentioning about this. [TODO Check pattern mentions libs and deals with this in general terms. Note that intro will discuss how incompats are handled. Mostly, the answer is “Use a library” and that it’s beyond the scope of this book – refer to a JS/DOM book instead. Also, note that I only address IE and Firefox directly.] – Prototype library deals with this. – IE7 uses XHR natively. [MM True, though not a huge deal, before you just needed a factory function. The bigger question is whether they behave the same] – Code – redefine the event handler – is that common? Wouldn’t normally do that. [TODO Indicate this is rare, possibly scrap the Decision. Note: This is indeed somewhat rare]

Could people write a User Action – Just showed onMessageMouseOut or whatever, needs to at least show the outline of the function [TODO I think the request here is to show the signature of onclick, onkeypress, etc. Except they’re all the same – onclick(ev), onkeypress(ev), etc. Need to clarify this in the solution] – Interesting that not compatible across all platform – Couldn’t you abstract out differences between events [TODO I thought that was mentioned, including the quirksblog et al contest, check it] – Interesting discussion on p5 about dynamically .. – Upon seeing that, and if he feels that way, scoring this as a variant of Observer would be an interesting way to go. – That section was a bit confusing. Third parameter was confusing. – How often do you need to have >1 event handler? Not very often. And can always have a composite handler. [TODO Mention that.] – JS event mechanism involving strings e.g. “onclick” for the event handler. Neat but refactoring side of me starts to worry. [MM Could say the same about onclick … in JS x.onclick is equivalent to x[“onclick”].]

=====================================================

29:00 Scheduling

What is it? – As the author puts it, running something at a time in the future or repeatedly. – Another use of event handling. And that’s one reason why “User Action” is better than “Event Handling” [MM That’s basically the rationale I used, could still be “UI Events”] – The vocab is tricky because the overall JS vocab and ideas are still emerging. – Maybe some background on JS event handling is available in the earlier section. [TODO No it’s not. A little in the tutorial but no. Needs sidebar perhaps.] – Is the author assuming JS? [MM Assuming basic knowledge. I suspect most readers have done a little hacking with it, the kind that was used in pre-Ajax style web development.] – Seems like it’s assumed. – But sometimes seems to assume reader doesn’t know anything about JS. [MM Unfortunately, Ajax is so all-encompassing that it’s a tricky issue, and the group of people who could benefit from these patterns is also quite wide. So it’s been a juggling act, but I’m really assuming some basic web development, and not necessarily any general GUI development (ie there are plenty of people who’ve programmed PHP for 5+ yrs, maybe since high school, and have never done anything like Swing).] – Only has example of one-off. Not for repeated event. [TODO Check that. (Solution itself includes that, but maybe reference it from examples, or reference other examples that use it eg Live Search).] – I thought it (JS) was supposed to be object-oriented – As much as it’s like “Java”. – I saw that JS is a prototype-like language in the Self tradition. [MM True. Each object has a mutable “prototype” property. We still need to work out how to fully harness it] – Could also have network events [TODO Sidebar: Also point out other events: XHR, onload/onunload] – Worth mentioning that the timers are not really timers, I’m not going to run my pace-maker using these timers [TODO Precision of timer]

===================================================== 41:00 Web Service

Author provided it even though not on the wiki (as with Ajax App)

Quite short and for being an important pattern, it didn’t seem to be very detailed. So what is Web Service? – Just seems real simple, what the Ajax .. is going to call on the server. – Uses HTTP protocol to pass parameters and bring back data. Could use SOAP, whatever. – It’s a Context, not a pattern. I’m not sure how serious I am about that, but … – Problem: What will the Ajax App call on the Server. – Important to have a web service, just not sure (??if need it as a pattern). – Name is ambiguous (because of SOAP etc) – Well that’s what there are other patterns for eg REST and RPC. – Part of the problem is there isn’t much content to a web service. The hard part is working out what all the messages are, what to expose, etc. – Even when CORBA came out, we felt like, this is old-hat, just trying to standardise what we’ve been doing, etc. – Want to have layers – CORBA, COM, components, Web Service is just the latest in a long line. SUN-RPC (early 80s) was already a standard. – This not only doesn’t tell you how to do these hard issues, but doesn’t…. I would like to see something like: Just because decided to do web service, problems aren’t over. Only just begun. Have to decide on (etc). People want freedom from choice. One reason people like SOAP, REST, etc. [TODO Mention that] – Also, why did web services arise that way? Why not a custom service on port 1065? Why HTTP? [TODO Sidebar? Forces?]

===================================================== 53:10 RESTFul Service

  • Main distinuishing factor bn REST and RPC: REST uses standard keywords for communicating to services. He has interesting anecdote on Google (Accelerator).
  • The idea is if everyone starts following REST, documentation is minimal.
  • “Motivating REST” – He does say important things like there are going to be web crawlers, caches, proxies, etc. So how can you make sure people’s assumptions are met … [TODO Check this is emphasised enough] And I think that’s a big part of REST. [TODO Use the browser as an example, ie type URL makes it perform a GET request]

What do you wish you had in here? – No (I didn’t understand REST from this). – REST = Non-SOAP procedure call. – I’ve heard people say you can use SOAP in a REST-like way [MM Yes, mentioned in “RPC Call” pattern and also in the RPC discussion in the REST solution] – How is this different from a standard form submission? [TODO Sidebar or explanation comparing to std web call? e.g. REST usually returns XML] – Normal web form submission is REST automatically, although one of the things is you want POST to be idempotent. So when you POST, you’ve made a new object. ie Put an ID on the form, so when you POST it says “I’ve already seen this. You’ve sent me this order already.” So won’t put in two orders. – Okay, I was going to ask this … using the URL as a piece of the environment? – No, not the URL. Cannot do it with the URL. The URL tells you the object that you’re posting. – Points to examples in Solution. But these aren’t RESTful [TODO *** Make it a lot clearer that these these match examples aren’t necessarily RESTful!!! I think these might have set the stage for general REST confusion!] – Football service [MM Incidentally it’s Aussie rules :->] – Matches==Games [TODO Change to “game”, good enough for everyone] – I think REST would be good for CRUD, but not sure for transactions. – Well, TX’s are more difficult. Have to make an object that represents the TX, and that’s how businesses work. “I sent you the purchase order, etc”. [TODO Be more clear on this] – Author should describe something outside CRUD. [TODO It’s already there in later sections, but maybe discuss it upfront?] – Was looking forward to learning about REST, but still left wanting more. – The important people here are those who don’t know REST, hence probably needs more content even though it’s already long. [TODO Consider what to add/change] – TODO A REST and RPC FAQ Sidebar – e.g. Can RPC be RESTful?

===================================================== 1:13:55 RPC Service

  • Package up params, unwrap etc. There’s a bunch of ways to do it and he talks about them.
  • Call-By-Reference. URLs can be your references [TODO Mention that]
  • With REST. A lot of things you could say. e.g. Make up all the arguments in the URL. But if you have side effects, that doesn’t work. The side effect operations that aren’t idempotent, e.g if you’re doing SOAP, args will be passed in the POST message, so the URL isn’t really a resource.
  • Pretending you can treat the world as a bundle of RPCs, have to cope with lack of availability, errors, etc, which make RPC unreliable.
  • Which is why REST is successful. What if someone moves a page?
  • Would be nice if RPC was possible, but it’s just not.

1:22 Next time: Patterns after RPC Service?