Build a “Google Dataset”-friendly Catalog in Minutes With IBM Cloud Object Storage

Easily upload datasets and publish metadata (part 1)

--

IBM Cloud Object Storage is a fast, secure, inexpensive way to store all your stuff that doesn’t make sense to put in a database. (Think CSV files, spreadsheets, PDFs, images, video and scientific formats like HDF5 and NetCDF.) Here’s a simple way to upload that data and start exposing it to APIs.

IBM Cloud Object Storage (COS) is similar to Amazon S3 in many ways. It’s compatible with the S3 API in most ways, and that’s important because we’re going to look at two interesting features of COS that come via the S3 API:

  • making files publicly readable
  • using it as a web server

With these features, I’ll explain how to make an open data catalog that has a web interface, a REST API, and can be search-optimized. (I’ll show you how in part two of this article.)

Setting up COS

The first step is to instantiate a COS service, create a “bucket” in which to store your data, and create credentials that allow you to upload data into it. Follow the instructions to sign up for IBM Cloud and create a Cloud Object Storage service instance. Create a new bucket with a useful name like myopendata.

Now let’s talk about access credentials. For most uses of COS, you should try to stick to user-based access via the IBM Cloud Identity Access Management (IAM) dashboard. However, we are going to want to use a tool designed for S3, so we’ll have to set up our bucket in an S3-compatible way.

Start by creating a “service credential” as described here. Make sure to complete step 3, specifying {"HMAC":true} in the Add Inline Configuration Parameters (Optional) field. This parameter makes COS compatible with the S3 authorization method we’ll need for the s3cmd tool. Look at the JSON credential you just created and note the cos_hmac_keys object where you’ll find an access_key_id field and one for secret_access_key.

S3-style access

s3cmd is designed for working with Amazon S3, but since IBM COS supports the S3 API, we can use it for COS upload. And since the COS user interface doesn’t support marking files as publicly readable over the web, this tool is a life saver.

Get s3cmd from GitHub or just install it from PyPi with pip install s3cmd. Now you just have to configure it to work with your COS credentials by running s3cmd --configure. Use the access_key_id and secret_access_key at the proper command prompts to build the configuration file that gets written to $HOME/.s3cmd. Also note the proper host_base string. I used the one for us-geo here below, and if you’re unsure what endpoint you need, I’d suggest you try us-geo (more on why in regions and endpoints in IBM Cloud Object Storage). Also be sure to properly format your value for host_bucket — again, similar to the example below. You can accept the rest of the default values in the prompts.

Note: If you’re not comfortable with or able to use Python, you can follow these instructions to use cURL, but I haven’t tested that method.

Finally, upload some data to the bucket you created earlier. Now it’s just a matter of running the s3cmd tool with the --acl-public flag, the put command, and specifying an input file and a remote location for the file (which can be a new directory-like path). Here’s an example:

Using s3cmd to upload a file for public web access. It’s only sample data, but feel free to check it out: http://s3-api.us-geo.objectstorage.softlayer.net/opendata/retail/salesjan2009.csv

Create the REST API

A REST API allows a client — usually another application — to discover all the resources available on a server via its interface. The simplest REST API is the web. When you look at a web page you can navigate to other content via the “interface” of hyperlinks. But a web page is designed for people. In this first step we’ll design for machines and serve up a JSON document that describes the data sets available on the server and provides hyperlinks to download them.

We’ll provide a single endpoint that points to this JSON document, and we’ll write it by hand. Most REST APIs are more sophisticated than this, updating the JSON document in an automated way, and providing an API that does more than just getting files, but this is overkill for most use cases, and the time to hand-craft the JSON document and keep it updated manually is exponentially less than the time spent setting up and maintaining a more complex system.

The hard part of this step is to pick a good metadata standard to adopt. And by good, I mean that it strikes a nice balance between ease of development, ease of use by your audience, and relative completeness in articulating the information you need to communicate.

What makes for good metadata?

Having worked for a geospatial standards organization for a decade, I know how difficult this can be. So I’ve done that hard work for you and chosen a simple JSON formatted standard called “Project Open Data Metadata Schema v1.1”. It’s based on the grand-daddy of metadata standards, the Dublin Core Metadata Initiative, and it’s the recommended standard for the biggest open data publisher in the world — the U.S. Federal Government. Their site does a good job of describing the format, or you can use my example JSON as an example of how to implement the standard.

Let’s go over a few key properties. First, we have a JSON file that represents the entire data catalog. Within that we have an array of datasets. Each dataset object has an identifier that’s unique to the catalog, a title, description, publisher, and contactPoint. We also have two properties that aid in search — an array of keywords and a spatial code which defines the geographic area about which the dataset has information (more on this in my next article).

Sample Project Open Data Metadata.

Just craft your JSON metadata file, inserting the appropriate links to your files on COS, upload it to COS the same way you did your data files. Then share the link to your JSON file with the world and you’ve got a pretty nice RESTful open data service for pennies.

See you next time, when I’ll look at using some simple JavaScript code to optimize your cloud datasets for search.

--

--