NOARCHIVE: http://sfbay.craigslist.org/sfc/apa/4139480380.html

A Short Proposal for Robots.json

Machine-readable terms of service for APIs

By now, most developers and startup folk have observed the journey of 3taps and craigslist, by way of PadMapper. What follows is a high-level walkthrough of the case, data licensing and product needs, and a proposal for a new data rights management system (not to be, or maybe to be, confused with DRM!).

A Version of the Story So Far

Disclaimer: I am not a lawyer. I’m not even an armchair lawyer. I’m not connected with 3taps, Craigslist, or PadMapper, and the facts here are greatly simplified, and, at times, probably incorrect. It’s an illustration, not a legal quote.

Craigslist provides a classified ads marketplace. There is no API. By choice.

3taps is a data provider of publicly-accessible data. They made an API to make programmatic access of Craigslist’s data possible— and they accomplish this through a technique called scraping. Technically, 3taps became a data distributor.

PadMapper provides a visualized map of available rental properties on Craigslist (and other services), first through scraping themselves, then by leveraging 3taps’ API.

Craigslist instituted a terms of service policy that essentially disallowed the collection republishing and sale of their content. As does most every content provider, collector, aggregator, creator that has ever existed (at least not outside of their defined and allowed channels).

Scraping is often the practice of traversing the HTML of a site and storing, categorizing, and processing the data inside. This is typically done using the built-in HTML tags and metadata. There are legal rationales for doing this, as well as technological ones, but generally speaking most corporate terms of use forbid the practice either explicitly or implicitly.

Long story short, Craigslist asked PadMapper to stop. Padmapper moved to 3taps, thinking it protected them. Craigslist asked them to stop again, and served 3taps with a lawsuit. [Now, the reasons they asked them to stop, as well as the actions they took, are important, but too complex for this.]


Why scrape?

Proponents of scraping cite a number of reasons for the practice, chiefly that, in lieu of an API being available, it is the only way to gather valuable data. Another argument calls back to web standards and open data initiatives— essentially, if a web browser can read this information, and you’ve made it available to be downloaded by a human using a browser, how is it any different if it was automated?


Why not?

The arguments get more complicated when you consider licensing rights to the content on the sites. Is it user generated? Does the user own that content, or was it assigned to the site? Is the information public, governmental data, or is it proprietary? Did the site license it from some other content holder? What’s the difference between ‘facts’ and ‘creative content’? These are all philosophical and legal issues to consider, but are well outside what I can cover here.

“I want this product, and I need this data to have this product, therefore I can and should have it.”

Many software developers and product people, both independent and institutional, generally (and rightfully, I suppose) put their own projects, products (and yes, customers) ahead of others. This argument goes something like “I want this product, and I need this data to have this product, therefore I can and should have it.” I strongly feel this is a selfish position, but I understand where it’s coming from: a desire to innovate; a desire to serve customers. But there ought to be more responsible paths to take.

Robots.txt

The architects of the modern internet foresaw some of these problems, especially in regards to search engines. There had to be a way to tell programatic collectors, hey, this part isn’t for you. Ignore it. Don’t index it, don’t cache it, don’t analyze it. This is what’s know as Robots.txt, a file present in your web root describing URIs to be ignored. Interestingly, the concept itself was much broader than that, as it should be, but Robots.txt is really a gentleperson’s agreement: I tell you what I don’t want you to do, and you respect it. Really, this is what most any terms of service is— it just has some teeth in that it can be programmatically obeyed.


Legalese is just a programming language without an execution environment until a judge is interpreting it.

Robots.json

In an ideal world, though, Robots.txt would be expanded to include, really, all of the terms of service for a product that could (and should) be respected programmatically. Right now we do this with cache headers, no follow links, robots.txt, and other such methods. But what if we had a place that both web applications and APIs could express their data license in an easy-to-consume and execute on system?

Hypermedia-based APIs may hold some of the answers here, but it’s all still predicated on the basic decency of content consumers to follow them. It would, however, bring ‘the rules’ closer to an executable and accessible instruction set easily built into libraries and software at large.

A Robots.json file, or something like it, would contain all desired rules for retention, caching, access control, license rights, chain of custody, business models and other use cases allowed and encouraged for developers.

Back to Craigslist

If Craigslist had such a file, they could have easily expressed that, while their data is available to search engines for indexing, it is not available to be redistributed by a data broker. It would clearly lay out the chain of custody for the content, perhaps allowing PadMapper to not have required 3taps. Perhaps 3taps could have easily ‘passed-through’ Craigslist’s terms of service by requiring their customers to respect the same terms.

Then again, maybe the scene would have played out exactly the same. And that’s within Craigslist’s rights. Thankfully, those rights would have been codified not only in their terms of service, but also in their programmatic terms of service— their Robots.json— that developers like PadMapper and 3taps would have respected. We might lose out on some innovation this way, but constraints produce creativity. They simply would have had to find another way.

Earlier I said that the reasons Craigslist took action against both parties were complicated. They were, at least initially, related to whether or not PadMapper could co-mingle their data along with listings from other services, or advertising, and whether PadMapper could profit from that aggregation. Would we have a way to express constraints like these outside of a Terms of Service? Can we programmatic classify and ‘know’ a service is co-mingling data, or has advertising, and enforce those rules automatically? It requires the next evolution of applications, one that we’re starting to see with wearables and systems that are contextually aware.

How else are we going to manage this problem, which really is only going to get more complicated?