VJ’s Warez

YQL – converting the web to JSON with mock SQL

Posted on: January 3, 2009

I like getting data from the web and I love JSON – as it is easy to use. The issue is that not many things on the web come as JSON from the get-go. Hence we need converters. You can use cURL and beautiful soup or roll your own hell of regular expressions. Alternatively you can use Yahoo Pipes to build your converter. Pipes is the bomb but a lot of people complained that there is no version control and that you need to use the very graphical interface to get to your data (which was the point of Pipes but let’s not go there).

AlasRejoice for there is a solution now available and it is called YQL. YQL is a SQL-style language to get information from all kind of web services, and – using oAuth – even Yahoo’s social graph. There is a test console available for you to get to grips with all the information it gives you access to (which is a lot!):

The YQL console

Here comes the kicker though: for all the open services that don’t need authentication you can use these YQL statements as a REST API with JSON output and an optional callback function for JSON-P by adding it to http://query.yahooapis.com/v1/public/yql?. For example to get the latest three headlines from Ajaxian’s RSS feed as JSON and wrap it in a function called leechajaxian do the following:

http://query.yahooapis.com/v1/public/yql?select title from rss where url=”http://feeds.feedburner.com/ajaxian” limit 3

The result is:

  1. leechajaxian({
  2. “query”: {
  3. “count”: “3”,
  4. “created”: “2008-12-12T09:01:13Z”,
  5. “lang”: “en-US”,
  6. “updated”: “2008-12-12T09:01:13Z”,
  7. “diagnostics”: {
  8. “url”: {
  9. “execution-time”: “17”,
  10. },
  11. “user-time”: “22”,
  12. “service-time”: “17”,
  13. “build-version”: “2008.12.03.14:01”
  14. },
  15. “results”: {
  16. “item”: [
  17. {
  18. “title”: “Travians: Sims meets Cultures, with Ajax”
  19. },
  20. {
  21. “title”: “Cappuccino 0.6 Released”
  22. },
  23. {
  24. “title”: “The fundamental problems with CSS3”
  25. }
  26. ]
  27. }
  28. }
  1. leechajaxian({
  2. “query”: {
  3. “count”: “3”,
  4. “created”: “2008-12-12T09:06:41Z”,
  5. “lang”: “en-US”,
  6. “updated”: “2008-12-12T09:06:41Z”,
  7. “diagnostics”: {
  8. “url”: {
  9. “execution-time”: “43”,
  10. },
  11. “user-time”: “45”,
  12. “service-time”: “43”,
  13. “build-version”: “2008.12.03.14:01”
  14. },
  15. “results”: {
  16. “result”: [
  17. {
  18. “abstract”: “Introducing <b>JSON</b> <b>…</b> <b>JSON</b> (JavaScript Object Notation) is a lightweight data-interchange format. <b>…</b> <b>JSON</b> is a text format that is completely language <b>…</b>”,
  19. “title”: “<b>JSON</b>”,
  20. },
  21. {
  22. “abstract”: “The <b>JSON</b> format is specified in RFC 4627 by Douglas Crockford. <b>…</b> Although <b>JSON</b> was based on a subset of the JavaScript programming language <b>…</b>”,
  23. “title”: “<b>JSON</b> – Wikipedia, the free encyclopedia”,
  24. },
  25. {
  26. “abstract”: “Matthew Morley has posted his PHP library for <b>JSON</b>-RPC 2.0. <b>…</b> I am think it is great to see the expanding <b>JSON</b> toolset available in JavaScript libraries. <b>…</b>”,
  27. “title”: “<b>JSON</b>”,
  28. }
  29. ]
  30. }
  31. }
  32. });

What about screenscraping? You can get data from any valid HTML document using XPATH with select * from html. For example to get the first 3 tag links on my blog you can do the following:

http://query.yahooapis.com/v1/public/yql?q=select * from html where url=”http://wait-till-i.com” and xpath=’//a[@rel=”tag”]’ limit 3&format=json&callback=leechajaxian

The team is working on making this easier – while we run every page that is indexed through tidy there is still a lot of choking going on (if people wrote valid HTML that wouldn’t happen).

YQL is a pretty easy but also versatile language. You can even use complex aggregation and filtering by for example hosting a lot of URLs in a spreadsheet and loading them one by one before aggregating. The example given in the console is “select * from rss where url in (select title from atom where url=”http://spreadsheets.google.com/feeds/list/pg_T0Mv3iBwIJoc82J1G8aQ/od6/public/basic&#8221;) and description like “Wall Street” LIMIT 10 | unique(field=”title”)”

Happy converting!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

VJ’s Warez

%d bloggers like this: