A Data Liberation WalkthroughShare This

Last time I wrote about how to build a US-wide farmers’ market API in under an hour. Today, I want to walk through a larger challenge to highlight some of the tools & techniques we are using at Code for America to create APIs for public data (and an example of a web app that uses the API). For this example we’re going to use the City of Boston’s Public Art website. I really like this site; they have a great map interface and a bunch of good content. The one missing piece is an API so that other developers can build on this data. That’s what we’ll build today (and just to be extra productive, we’ll setup a mobile web app that runs off the data). Also, Boston is a 2011 Code for America city and they are already taking steps to provide an API for this data, so I figure they won’t mind if we do this :) (tl;dr version)

UPDATE The API I just mentioned is now live and can be accessed via DataCouch.com

Before we dive into this specific case, I want to point out some of the tools that all data liberators should be familiar with, along with how we are going to use them.

  • Google Refine - we only use it as a scraper, but if you’re not familiar with Refine, follow the link and watch the video.
  • Recline - as a data manager for couch documents.
  • GeoCouch-Utils - adds geospatial api, default map.
  • viewer - example of drop in CouchApp…with a poor name.

1. Getting the data

In many cases, this can be the hardest step. Fortunately, the City of Boston has a website with geolocated data for over 250 pieces of public art. We are going to look at the source code for their map and find all the juicy data. If you want to skip straight to the data (or if the website changes formatting) I’ve posted a copy of the file I used on GitHub. Be aware that this method of getting data is not a best practice and should be used only when needed (and legal).

Go to the Public Art Boston map and view the source code for the page. Copy the “markers” array from the javascript source into a file called boston_public_art.json.

source code with all the data

Tidy the data in Refine

Open boston_public_art.json in Google Refine.

Refine Screenshot

Remove the ‘text’ column – we don’t need it

Deleting a column in Google Refine

Clean up the column names (rmt, latitude, longitude, markername, title)

column titles: before

Column headings: before

column titles: after

Column headers: after

Transform the rmt column

I’m not sure what ‘rmt’ stands for, but it looks like an ID to me (except for all the ‘/0′s). Let’s go on that hunch for now and remove all of the ‘/0′s so we are left with just the IDs. We’ll do that using the Transform functionality in Refine. Transforming a cell involves running a small piece of javascript-esque code on each cell in the column. Transforms use the Google Refine Expression Language (GREL). Transforms are a powerful way to programmatically clean poorly formatted data. Our current situation, however, only requires a slight string manipulation using the replace method.

google refine transform menu

Transform using GREL

Scrape more data

I want to pause for a minute and do a little recon. Let’s use one of our IDs and see where it leads us. We’ll take the first ID (276) and build a standard, out of the box, Drupal url: http://www.publicartboston.com/node/276 (it was pretty trivial to figure out this was a Drupal site). This looks like a page full of awesome data! Taking a look at the source code, we see that the data is in some sort of structured form – good news. Extracting the data from the page looks like a simple problem for a javascript framework like jQuery.

Screenshot of the structure of the page with all the data

We now find ourselves in a position where we have a list of IDs which lead to webpages containing great public art data. The next question is, how do we mash all this together? Luckily, Refine has a function to help us out. Its called “Add column by fetching URLs…”. In many cases of data clean-up, this function is used to geocode addresses, but today we will be using it to pull the HTML from a webpage into our DB.

Fetch url on the menu

Fetch from URL dialog

Data fetched from the url

Great, its done! Now we have a column called ‘full_html’ with an entire webpage jammed into it. It may look awful, but it is exactly what we want. At this point, I would say that we’ve successfully acquired the data. Now let’s clean it up and make it more useful.

Note Google Refine is a powerful tool and we’ve barely scratched the surface of what it can do. I would definitely recommend exploring it more. If you’ve got an old CSV of poorly structured data, dump it into Refine and see how easily you can clean/normalize it.

2. Clean the data

Now that we have our raw data, we can put it into our database. We are going to use CouchDB as our database (and later as our application server). I chose CouchDB to leverage its geo-spatial component (GeoCouch), to use it’s out-of-the-box REST API, and because of its schema-less document structure, it will handle data of different shapes. Another benefit of using CouchDB is that we can use CouchApps to easily add frontend interfaces to the data (i.e. Recline, Geojson-utils & the Viewer app used below). For development, I would highly recommend using the Couchbase Single Server. Just download, and install it. You will know it worked when you can go to http://localhost:5984/_utils and see Futon, CouchDB’s web-based administration console.

Setup the DB

From the Futon homepage, create a new database and call it “boston_public_art”.

Creating a DB in Couch

Install Recline

Recline is a cleverly named CouchApp built by Max Ogden. It provides a Refine-ish interface to data in a couch. My favorite part is that transformation functions are written in javascript (as opposed to GREL in Refine). Also, we can access the jQuery object in our transformations, which will come in handy for us. The easiest way to install Recline is via the command line.

curl -X POST http://localhost:5984/_replicate -d '{"source":"http://max.iriscouch.com/apps","target":"boston_public_art", "doc_ids":["_design/recline"]}' -H "Content-type: application/json"

Alternately, if you are familiar with CouchApps, you can download the code from GitHub and install it yourself. Once installed, go to http://127.0.0.1:5984/boston_public_art/_design/recline/_rewrite/. You will see it in an interface which resembles a minimalist version of Refine.

Screenshot of a fresh Recline install

Fresh install of Recline

Lets put some data in it.

Upload the data to a Couch[DB]

Now that the couch is ready, we need to move our data into it from Refine. We’ll use the Refine export templating engine to create a JSON object that can be accepted by the native CouchDB bulk documents API as illustrated below.

screenshot of refine export templating engine

Template for exporting the data in a format _bulk_docs can receive

Clicking “Export” will download the JSON object as a file named boston-public-art.txt. We switch to our terminal and navigate to the folder where the file was downloaded (I use ~/Downloads/ on my machine). The following command will insert all of our data into the couch (each object in our “docs” array will become one document in the couch).

curl -X POST http://127.0.0.1:5984/boston_public_art/_bulk_docs -d@boston-public-art.txt -H "Content-type: application/json"
Screenshot of recline after a the import

Viewing out imported data via Recline

Note There is another way to upload the data from Refine to CouchDB. You will need to download and install Max Ogden’s Google Refine Uploader Extension. It allows you to export data from Refine directly into a couch – basically automating the steps above.

Screenshot of Refine Uploader

After installing Refine Uploader, this option will appear in the "Export" menu

Screenshot of Refine Uploader dialog box

Enter the path to your database, followed by "_bulk_docs"

Use jQuery to extract the data

I may be getting ahead of myself.  Before we can use jQuery to pull the data out of the scraped page and into the correct fields of our DB, we need to know what pieces of information we are looking for.  I know that in order to use the Viewer CouchApp we need to have the following pieces of data:

  • _id – a unique ID for each document. The auto-generated CouchDB IDs will be fine for this.
  • title – the title of the piece of art
  • artist – name of the artist(s)
  • description – description of the work, artist statement, notes, etc.
  • discipline - preferably one of the following: sculpture, painting, photography, ceramics, fiber, architectural integration, mural, fountain, other
  • location_description – a human readable location for the piece
  • full_address - full street address, w/ city, state, zip if possible
  • geometry – longitude/latitude in geojson point format
  • image_urls - a comma delimited list (array) of urls to remote images.
  • data_source - the source of the data. (i.e. ‘Boston Art Commission’)
  • doc_type - this field is used by the Viewer CouchApp and should always be set to ‘artwork’

All of these bits of data are present in our scraped HTML, as well as a few others. The Viewer CouchApp we are going to setup only requires the fields above, but in the interest of building a richer API we should include all of the data available. Our additional fields will include:

  • audio_description
  • collection
  • funders
  • medium
  • neighborhood
  • year

Now that we’ve planned out what information we want, it is time to go about collecting it. A simple way to do this is to run a transform function that uses jQuery to pull the data out of our full_html field. Select “Transform” from the menu at the top of any column. Screenshot of transform link on ReclineThis should all look very similar to Refine so you won’t be surprised by the dialog box that opens next.  The main difference from Refine is that now we can use any javascript functionality within our transform. The default transform function (for the ‘_id’ column) looks like this:

function(doc) {             // doc is an object representing the document (or row)
  doc['_id'] = doc['_id'];  // Do any transformations you'd like
  return doc;               // Return the transformed document
}

This process makes it easy to add new fields to your documents; just assign a value to doc['newFieldName'] and it will be added to the document. We are going to get crazy and add all of our fields at once. Here is our finished transformation function:

function(doc) {
  doc['artist'] = $(doc['full_html']).find('h3:contains(Artist)').next('div').text();
  doc['description'] = $(doc['full_html']).find('h3:contains(Description:)').filter(':first').next('div').text();
  doc['discipline'] = $(doc['full_html']).find('h3:contains(Type:)').next('div').text();
  doc['location_description'] = $(doc['full_html']).find('span.fn').text();
  doc['full_address'] = $(doc['full_html']).find('span.fn').text() +', Boston, MA';
  doc['geometry'] = { type: "Point", coordinates: [parseFloat(doc['longitude'], 10), parseFloat(doc['latitude'], 10)] };
  var temp = [];
  $(doc['full_html']).find('.node_images img').each(function(idx, el){ temp.push(el.src) });
  doc['image_urls'] = temp;
  doc['data_source'] = 'Boston Art Commission';
  doc['doc_type'] = 'artwork';
  doc['audio_description'] = $(doc['full_html']).find('h3:contains(Audio Description:)').next('div').find('a').attr('href');
  doc['collection'] = $(doc['full_html']).find('h3:contains(Collection:)').next('div').text();
  doc['funders'] = $(doc['full_html']).find('h3:contains(Funders:)').next('div').text();
  doc['medium'] = $(doc['full_html']).find('h3:contains(Medium:)').next('div').text();
  doc['neighborhood'] = $(doc['full_html']).find('h3:contains(Neighborhood:)').next('div').text();
  doc['year'] = $(doc['full_html']).find('h3:contains(Year:)').next('div').text();
  return doc;
}

Yes, a bit of refactoring would make this function a bit more efficient, but I want to be clear on where each new field value is coming from. Run the transform. Once it is finished, our data will be clean (almost). One last piece of housekeeping we should do is delete the full_html column. To do that we simply use the “Delete this column” command on the menu for the full_html column. delete this column screenshot

3. Add Geo-spatial support

Earlier, we made sure that each document in our Couch had a geojson-formatted ‘geometry’ field. In order to utilize this field for geospatial queries we are going to install another CouchApp, GeoCouch-Utils. Just like the Recline CouchApp, you can either download the source from GitHub, and install it via the CouchApp command line interface, or replicate it from someone else’s couch via cURL.

curl -X POST http://localhost:5984/_replicate -d '{"source":"http://max.iriscouch.com/apps","target":"boston_public_art", "doc_ids":["_design/geo"]}' -H "Content-type: application/json"

To make sure that everything went well, point your browser at the map that comes with GeoCouch-Utils which is located at http://127.0.0.1:5984/boston_public_art/_design/geo/_rewrite/.  You should see a bunch of dots on a map of Boston like this: GeoCouch-utils out of the box mapWe can now give the API a bounding box of longitude/latitude coordinates and we will get a list of all the pieces of art within that box.  For example, if we wanted all of the art in/around Boston Common, we would hit the following URL:

http://localhost:5984/boston_public_art/_design/geo/_spatial/_list/geojson/full?bbox=-71.07114,42.3519,-71.0631,42.3577

At this point we have a Geospatially-enabled REST API for all city-sanctioned public art in Boston, huzzah!

4. Setup the mobile app

I am a huge advocate for organizations (cities included) to open up their data via APIs, but I’ve found that many times the decision makers need an example of what can be built on the API. This is why I built a CouchApp with a crappy name: viewer. It is basically a CouchApp version of the MuralApp mobile website that Aaron Ogle& I built for the Open Data Philly Hackathon. Once again, you can install it manually by downloading the code from GitHub and pushing it to your couch with the CouchApp CLI, or you can just enter this command in your terminal:

curl -X POST http://user:pass@YOURCOUCH/_replicate -d '{"source":"http://mertonium.iriscouch.com/apps","target":"YOURDB", "doc_ids":["_design/viewer"]}' -H "Content-type: application/json"

Now that the viewer CouchApp is installed, you should be able to see the app live at http://localhost:5984/boston_public_art/_design/viewer/index.html and it should look something like this:

Screenshot of Public Art mobile website

Hey, look at us, we just built a mobile website for exploring and discovering public art in Boston! Awesome.

5. Make it public

Developing locally is great, but we need to close the circle and share what we’ve made with the world. Luckily, we can make use of CouchDB’s replication feature to easily move our app & data to a [free] host.

Two of the big names in free CouchDb hosting are Iriscouch (which I use regularly) and Cloudant.

Once you’ve set up an account, create a database called boston_public_art. Then, all you need to do it replicate your local boston_public_art database to the one you just created via the replicator in Futon (you could also do it from the command line, but beginners usually find Futon more approachable).

Screenshot of Futon replicator

 

After the replication you should be able to access the mobile app with the same url as before, but with your hosted domain instead of localhost:5984. For example, I’ve got the app setup at http://mertonium.iriscouch.com/boston_public_art/_design/viewer/index.html (Be sure to check it out on a mobile device too).

What’s next?

This is the point where I challenge you to go forth and liberate data with the tools and techniques listed in this post, but I do want to add one key point: data liberation is not a smash & grab process.  If the data is not kept up to date, then it will lose its value and all of your hard work will be for naught.  To guard against this, work with the people and organizations who hold the data.  You may need to scrape data at first, but your end-game should be to come up with a system where updates are pulled into the couch automatically (or help the data-holders build an API directly into their current system).

This is what is now happening in Boston, hence my earlier disclaimer that the API we built is not obsolete.

More information

tl;dr version

  • http://twitter.com/jahendler jahendler

    yes, but imagine how much more powerful this would be if it assigned permanent,dereferencible URIs to the data elements that are common between datasets — would make the data much more “webized” (REST friendly) – cf. Berners-Lee: http://www.w3.org/DesignIssues/Webize.html

  • http://www.webeconoscenza.net/2011/10/05/liberateli/ Liberateli | Webeconoscenza

    [...] Per i più frettolosi consiglio uno splendido post sul blog di Code for America: http://codeforamerica.org/2011/09/27/a-data-liberation-walkthrough/ [...]

  • http://6zu8zrxh.com adac

    nach…

    advertising a service or product, the use of storytelling can be totally effective. Lead your viewers by way of instance and affect…

  • http://www.one24secrets.com/health-tips/budget-trip-basics control de flotas

    Most automobile rental corporations won’t accept a money deposit…

    Typically, discounted admission charges on area attractions are the most typical and easiest to search out, since many theme parks, museums and other points of interest run specials to draw tourists….

  • http://www.businesscardmakeronline.com/ business card maker

    Be consistent with your style…

    People like to know what to expect, once you have settled on a style for your audience stick to it….

  • http://www.youtube.com/user/carstereosandiego car stereo san diego

    car stereo installation san diego…

    At any time you articulate every little thing there exists to speak about for a chosen issue you’re unlikely to make people adding their own ideas for the reason that you’ll have taken care of the things they might have included….

  • http://www.youtube.com/watch?v=Qh8h4JxqdWk Free money From the Government

    Promoting your website…

    A lot of organizations do not ever pay for web marketing along with advertisement in order to repair open condemnation and also lies from various persons….

  • http://www.propertynow.com.au/home-loan-comparison cheapest home loan

    These accounts help maintain your credit…

    standing, and closing them would leave the rest of your accounts to dictate which way your credit standing goes….

  • http://www.acnefree911.com/acne-free-medicines-and-creams/ Acnezine

    Managing your online reputation…

    If you’re considering serps, we have understood it is imperative to create a sparkling public on-line persona in the world of the web….

  • http://www.scribd.com/doc/72809740/Acnezine acnezine blackhead removal

    Vitamin E is always essential…

    for the skin since it guards the skin from acne. Keeping your mind calm without stress can also help you to get rid of acne….

  • http://www.scribd.com/acnezine where to buy acnezine

    Acne can occur due to…

    several factors which include heredity, hormonal changes related to pregnancy. Dermatologists will help you in the treatment for curing acne….

  • http://www.linkdirectoryonline.net/business/control-de-flotas-for-a-tension-free-time/ gestion de vehiculos software

    Of course, the church has also…

    at times corrupted along similar lines. One need look no further than the very analogous child abuse scandals of the Roman Catholic church in recent years. However, it is also worthy of note that the Roman Catholic scandals were worsened by the perceiv…

  • http://www.articleblast.com/Business_and_Management/General/Control_de_flotas_for_a_tension_free_time/ software control de flota

    No small amount of blame…

    for the “protect the program” mentality that has grown in college sports lays at the feet of the NCAA. There is no better analog for Big Government than the NCAA. Obsessed with its own rules on recruiting and amateur status, and eager to enforce even t…

  • http://venturepointir.com/ stock promotion

    We specialize in penny stocks…

    Our team of stock promotion specialists enable us to fulfill any budget and any time frame needed….

  • http://ashishpandey.articlesnare.com/business-articles/control-de-flotas-for-a-tension-free-time.htm software gestion de flota

    Additionally examine the types of cars…

    Most AAA offices promote discounted tickets for theme parks, museums and other points of interest inside driving distance, and most AAA workplaces sell discounted tickets to fashionable locations just like the Walt Disney World Resort, too….

  • http://www.yelp.com/biz/rc-audio-car-stereo-installation-san-diego-la-jolla-2 Car Stereo San Diego

    car stereo san diego…

    If you tell anything at all there’s to say over a subject matter you’re unlikely getting others supplying their own viewpoints because you’ll have taken care of what they could possibly have contributed….

  • http://www.ezine4i.com/Business-articles/article-51211/Control-de-flotas-for-a-tension-free-time control de flotas de vehiculos

    The first thing you have to to know is…

    After getting a good idea of what you want to spend on each part of your journey, then it is advisable start buying round for the perfect deals….

  • http://fiverr.com/cipango/give-you-12000-live-scrapebox-links-on-super-low-obl-blog-pages SEO service for webmasters

    Greetings Admin,…

    You really need backlinks to your site, and here is a low cost and powerful solution – http://fiverr.com/cipango/give-you-12000-live-scrapebox-links-on-super-low-obl-blog-pages ….

  • http://florida.arrests.org/Arrests/Gary_Moore_4539109/ Garry Moore Venice

    As you will traverse his posts…

    in reverse chronological order, you will notice the refinement in his writing skills. No matter how expert you are in your niche, putting it creatively on blog needs some special treatment….

  • http://pink-eye.net/ pink eye symptoms

    Bacterial conjunctivitis often spreads…

    to both eyes and causes a heavy discharge, sometimes greenish. Crusting may appear on eyelids….

  • http://tummytuckcosts.org/ Tummy Tuck

    Does a tummy tuck remove the fat from inside the abdominal cavity?…

    If you are ready to explore your options for a new look, then it is time to schedule your consultation with Board Certified Plastic Surgeon….

  • http://olive-gardencoupons.org/ olive garden printable coupons

    Giving Olive Garden gift cards…

    as a Christmas or birthday present is fairly popular. It’s a unique gift in that you’re giving that special someone a fun, relaxing, and not to mention delicious experience, all in one evening….

  • http://burlingtoncoatfactory-coupons.net/ burlington coat factory coupons

    He gave us some helpful hints :…

    Thanks for the guidelines you have discussed here. One more thing I would like to state is that computer memory demands generally rise along with other breakthroughs in the technology. For instance, whenever new generations of processor chips are intro…

  • http://www.gspmonline.com/public-relations/program-overview.asp check it out here

    evident that more students are taking the…

    distance education route to earn their bachelor’s degree, but the question remains are online degrees accepted by potential employers? One of the largest online career management services, Vault Inc., conducted an online survey in which 85% of employe…

  • http://opthamologists-online.com/optometrists-eye-doctors optometrist

    the soft ones are when inserted. Most…

    soft contacts are designed to be disposable. The ability of having a fresh lens does help with reducing infection, less cleaning and of course, more comfort. With rigid lenses, debris and oil may cause the lens to become cloudy and no…

  • http://www.bulkping.com/rss-feed-generator-creator/feed/6822615b2ba3a1ccbbd8821d01d69102.xml free government grant money

    still remember all new assignments that mom…

    gave, One day to eat, next day to speak, And to sit in my cradle, With no freedom to freak! LIFE AS A CHILD… I get nostalgic, and revere about the past… I dream of days when I was a child,…

  • Victoria secret promo codes

    the company ever since, until 1993 when…

    a new marketing campaign was launched targeting a more exclusive group of consumers with seductive and chic French lingerie. The exclusive and adventurous Valisere Tabu line was introduced in 2005, with the company and it’s products finally going glob…

  • http://www.nationalrelief.com credit debt

    Websites we think you should visit…

    [...]although websites we backlink to below are considerably not related to ours, we feel they are actually worth a go through, so have a look[...]……

  • http://www.nationalrelief.com/DebtResources/DebtReduction.aspx debt reduction store

    Online Article……

    [...]The information mentioned in the article are some of the best available [...]……

  • http://www.iaseminars.com/en/event/?Code=1295&VenueID=31 IFRS Capitalization

    Hawt Post…

    Very cool ideas to peruse….

  • https://appworld.blackberry.com/webstore/content/31381 gps localizador

    consider GPS Tracking devices for: Children GPS…

    Tracking devices are perfect for keeping track of people, especially young children who have a tendency to explore! With some service providers you will be able to setup something called a Geo Fence which will trigger alerts if the device goes…

  • http://gasfireplacelogs.org/gas-fireplaces-vented.html gas fireplaces vented

    Interesting fact…

    [...]The information mentioned in the article are some of the best available [...]…

  • http://www.doorsuppliesonline.co.uk/interior-doors/interior-door-by-type/walnut-doors/cat_158.html Walnut Doors

    Informative and precise…

    Its hard to find informative and accurate information but here I found…

  • http://www.iaseminars.com/en/elearning.html IFRS Online

    Got a kick or two…

    I have enjoyed your other items, but this one isn’t your finest….

  • http://www.amours.tw 婚禮錄影

    Creative Wedding videos that you must check out !…

    [...]we like to honor other sites on the web, even if they aren’t related to us, by linking to them. Below are some sites worth checking out[...]…

  • http://www.diigo.com/user/tanks1swtor6?domain=cookeatshare.com swtor Bounty Hunter tanking builds

    Awesome website recommended your blog…

    [...]the time tо read or proceed to thе content or web-sites we’ve linked wіth bеlоw the[...]…

  • http://www.mycoolcaravans8.com/ cool caravans

    Lavoro per il domani – uno sguardo di Yesturdays ad alcuni esempi…

    È stato indicato questo esempio, via Megan Messer sopra Facebook e immaginilo per essere estremamente informativo ed ugualmente il punto…

  • http://CHEAPHOTELSCANADA.ORG cheap hotels canada

    travel…

    [...]the time to read or visit the content or sites we have linked to below the[...]…

  • http://www.nationalrelief.com/DebtResources/MedicalBillsRelief.aspx get help paying medical bills

    Sources…

    [...]check below, are some totally unrelated websites to ours, however, they are most trustworthy sources that we use[...]……

  • http://blog.yam.com/lunarhuang 二手筆電

    New ideas ? Great Ideas ? Best Ideas ?…

    [...]below you’ll find the link to some sites that we think you should visit[...]…

  • http://cheapnbajerseyspro.com/2880/nba-women-jerseys/ Customized Nba Jerseys

    Cheap NBA Jerseys…

    [...]listed below are a few references to webpages which I link to as we think they are seriously worth browsing[...]…

  • http://ipaddanmark.com/ipad-sverige-ipad-priser ipad sverige

    In spite of that, it needs a…

    SIM-card to enable the user to constantly have access to the internet. Og it is this SIM-card that is smaller than known from other phones. “It is only Apple’s own iPad that supports the new micro SIM-card. It gives a telco…

  • http://www.usais.org/articles/154120/1/Credit-score-Repair-Made-Effortless-With-The-following-tips/Page1.html/print/154120 bad credit mortgage repair

    broker is beneficial because the data that…

    they can provide can be of great help in financing your properties. Their main duty is to provide you the best lender they can find to make your home investment. Brokers also are willing to negotiate interest rates with you especially…

  • http://phpnmore.com/entry.php?1733-Determine-A-Bright-Future-Profession-For-Your-self-With-On-line-Nursing-Applications best nursing schools

    a list and formulate your own title.2)…

    opening line. your opening line needs to speak directly to the reader. what i like to do is jot down my keywords and write down “what kind of person would type in these keywords.”for example: the keywords for this article are…

John Mertens 2011 Fellow

John is a full stack developer with an emphasis on rapid prototyping and data wrangling.
@mertonium | Bio & Posts