Some people fear the singularity. I have a more mundane concern: are we approaching the end of intellectual property?
I’m not talking about software patents, whose disappearance I would welcome. I mean intellectual property in the broadest sense — the protectable economic value of our intellectual output.
Stealing Machine Learning Models
Last month, Florian Tramèr, a PhD student at Stanford, presented joint work with Fan Zhang, Ari Juels, Michael Reiter, and Thomas Ristenpart on “Stealing Machine Learning Models via Prediction APIs” at the 25th USENIX Security Symposium in Austin. I highly recommend the talk, which I’ve embedded here:
Why do I believe this work — or, more broadly, work on model extraction attacks — is so significant? Because machine-learned models are rapidly becoming our most valuable storehouses of intellectual property.
It’s a cliche that software is eating the world, and that data is eating software. But much of the power of data comes from our using that data to train machine-learned models. If it becomes easy to reverse engineer (aka steal) those models simply by consuming their output, then possession of the original model — and of the data used to build that model — will no longer be a proprietary advantage.
Uncharted Legal Territory
In general, reverse engineering is legal in the United States as a fair use exception to copyright infringement. But, as decided in Bowers v. Baystate Technologies, people are allowed to make more restrictive contractual arrangements, and those contracts are enforceable.
So, if you are using a machine-learned model to provide a commercial service, you probably want to follow the standard practice of prohibiting reverse engineering in your contractual terms of service.
It’s unclear, however, whether a model extraction attack is legally considered to be reverse engineering. Machine learning platform provider BigML wrote the following in its response to the USENIX paper:
To our knowledge, there has been no major IP litigation to date involving compromise of machine-learned models, but as machine learning grows in popularity the applicable laws will almost certainly mature and offer some recourse against the exploits that the authors describe.
Does Size Matter?
What will happen as our intellectual property becomes increasingly concentrated in proprietary machine-learned models?
Eventually, we’ll see a test case, where businesses try to establish legal protection for their models and the data used to train them. And we’ll almost certainly see unauthorized copying of models, analogous to what we’ve seen for all other digital goods. Perhaps we’ll even see machine-learned models distributed using BitTorrent.
But I’m curious about the slippery slope from copying of models to learning from them. In US copyright law, we have a fair use exemption that permits limited use of copyrighted material without acquiring permission from the rights holders. There’s a four-factor test for fair use that includes:
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
the nature of the copyrighted work;
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
the effect of the use upon the potential market for or value of the copyrighted work.
The third factor interests me most. A model extraction requires you to obtain some amount of labeled data from the model. Under what conditions can you assert fair use on the grounds that you’re only using a small amount of data compared to the model as a whole? How do we even make such a judgement? Do we create a legal definition of model complexity? Or a legal threshold for the similarity between a derived model and the original?
In plain English, where do we draw the line between sampling and copying?
Will AI Revolutionize IP?
I’ve focused on a fairly theoretical discussion of model extraction as an attack on machine learning systems. But we’re seeing machine learning become the foundation of software applications. The vulnerability of machine learning to model extraction is just theory — it’s a practical concern.
Machine learning is already ubiquitous, and it’s the foundation of a new generation of AI products and services. How will we protect the intellectual property embodied in those products and services, if anyone can reverse engineer their core IP simply by using them and feeding their output into commodity machine learning systems like TensorFlow? Will we need to overhaul our framework for IP? Or perhaps abandon the concept altogether?
I recognize I’m oversimplifying this issue by reducing all reverse engineering of AI systems to model extraction, and that there are still limits to what model extraction can do today. But I believe it’s a useful oversimplification.
We’re entering a brave new world. We’ll have to figure out the rules for it.