Private Data and the Public Good

18 min readMay 17, 2016

These are the prepared remarks given on the occasion of the Robert Kahn distinguished lecture at The City College of New York on 5/22/16.

I feel very honored to be able to talk to you this morning as part of the Kahn Lecture Series, following Robert Kahn, Dominic Orr, and Alfred Spector. My invitation is evidence of either a dramatic change in standards — or the timeliness of a topic of deep personal interest — the relationship of industry and academia.

For me, I work and think about a specific aspect of this relationship, the broader need for computer science to engage with the real world. Right now, a key aspect of this relationship is being built around the risks and opportunities of the emerging role of data.

Ultimately, I believe that these relationships, between computer science and the real world, between data science and real problems, hold the promise to vastly increase our public welfare. And today, we, the people in this room, have a unique opportunity to debate and define a more moral data economy.

Bloomberg : NYC’s tech start up

But before I start, please let me explain a little bit about Bloomberg as company. As you may know, Bloomberg is many things, and one of them is a tech company, in particular a data and analytics company. Our core product is the Bloomberg Professional Service, also known affectionately as the Terminal. The company was formed in 1981, by four men: Michael Bloomberg, Tom Secunda, Charles Zegar, and Duncan Macmillan. The company built a simple function that enabled buy-side traders to calculate fair intraday municipal bond prices, bringing transparency to a typically opaque market. Nearly 35 years later, the company has a vibrant social network that connects the global financial world, a news organization that breaks huge stories, and a software platform that financial professionals use to power their entire work day.

When the company started, most of our clients, financial professionals, didn’t have a keyboard on their desk, so Bloomberg built a keyboard and monitor. The keyboard even renamed the enter key, to the “go” key, which seems better to me — snappier, more exciting. Since then, a lot has changed. Now you’d be hard pressed to find someone without a computer on their desk, and a smart phone constantly in their pocket.

The nature of how software is built has changed as well. In particular, the emergence of open software as a force has changed the way our business interacts with software development. Even as late as the beginning of the century, it was feasible to develop in a vacuum, but now with amazing open source software with Linux, Hadoop as only the biggest examples, it simply isn’t viable anymore to try to build a software company without open source.

Similarly with academia and industrial research. Since 1981, the nature of industrial research has changed dramatically, and I’m grateful for the opportunity in this talk to consider these changes and highlight some of the challenges ahead.

Industrial Research in the 2010s

Before I do, let me tell you a little about my background. I went to graduate school at Hopkins, where I was part of the Center for Language and Speech Processing, where I worked on various issues in natural language processing. After finishing, I went to University of Massachusetts at Amherst, where I worked on weakly-supervised learning. In 2007 I joined Google as a research scientist, and in 2014 I started at Bloomberg as head of data science.

While at Google, I was lucky to have worked with Alfred Spector as a research scientist and research manager for nearly 6 years, and was deeply touched by his generosity of spirit. He, along with my manager, was supportive of the work of my team, and I’m grateful for the work we were able to do there, among them: building the core small-to-midsize machine learning libraries, building one of the first machine learning as a service platforms (the Google Prediction API), and building a real-time collaborative data science platform on top of the iPython notebook called “coLaboratory”. In addition this work, we were also able to do some academic research.

During my time there, I was also able to see first hand his views of the “hybrid research model”. I have some very strong views about it, and though Alfred is obviously not here to respond, I hope to explain it adequately and explain my concerns about it.

In a 2014 paper, Alfred Spector and colleagues explained their thoughts on the hybrid research model at Google. The core idea of the hybrid research model is that there should be no division between “research” and “practice”, and that research activities are organized “pragmatically” which means, I’ll posit, organized in deference to short-term, 1–3 year, business priorities, not to research which could be 10 or more years out.

The hybrid research model is positioned in contrast to an older style of industrial research, exemplified by Bell Labs. Bell Labs, was, for me, the dream of industrial research. Bell Labs ushered in the digital age with the invention of the transistor, incubating information theory, speech recognition, and many many other key inventions. What made Bell Labs so inspirational for me, was its focus on basic research — research with a decade long research horizon, the freedom for its researchers to explore. It married this with a rigorous selection of the best scientists and vigorous internal debate and competition. The work these scientists performed, and published or patented, significantly enabled our national prominence in the computer industry.

No doubt, there was work at AT&T that was practical, and immediate — notably the improvement of the telephony system, and then the development of the transistor. But the organization of the research group was not designed to build product directly. Bell Labs had an engineering division, whose job was to take the inventions and make them practical. For example, once the junction transistor was invented, engineering had to figure out how to mass produce it. That was certainly an important part of research, but it was also very practical. But within Bell labs, there was a significant number of people who were doing basic research without an immediate company need, and they were embedded instead of with engineering groups, with each other, seated close to encourage serendipitous discoveries.

The hybrid research model proposes something different. The hybrid research model, embeds, as it were, researchers as practitioners.The thought was always that you would be going about your regular run of business, would face a need to innovate to solve a crucial problem, and would do something novel. At that point, you might choose to work some extra time and publish a paper explaining your innovation. In practice, this model rarely works as expected. Tight deadlines mean the innovation that people do in their normal progress of business is incremental.

When I was a research scientist, I always placed my academic work apart from the development work I did. I built the core machine learning libraries and separately looked at distributed optimization. Sometimes the two intersected, but often they were entirely separate. This model worked for me, but it was difficult to support this kind of research activities that were off the beaten path, and my colleagues all struggled to organize their work similarly. This was the only workable systems since the incentive structure doesn’t explicitly reward research publications, and since it isn’t explicitly rewarded, it means it truly isn’t paid for. The only way to do research within the hybrid model is as a uncompensated effort.

This model separated research from scientific publication, and shortens the time-window of research, to what can be realized in a few year time zone. For me, this always felt like a tremendous loss, with respect to the older so-called “ivory tower” research model. It didn’t seem at all clear how this kind of model would produce the sea change of thought engendered by Shannon’s work, nor did it seem that Claude Shannon would ever want to work there. This kind of environment would never support the freestanding wonder, like the robot mouse that Shannon worked on. Moreover, I always believed that crucial to research is publication and participation in the scientific community. Without this engagement, it feels like something different — innovation perhaps.

It is clear that the monopolistic environment that enabled AT&T to support this ivory tower research doesn’t exist anymore. Bell Labs doesn’t exist in its old sense, really, any more, and neither does Yahoo Research, Ebay Research, Intel Research, IBM Research. Microsoft research is still around, with much the same model, but the Silicon Valley branch closed around a year ago, and the current incarnation seems subject to repeated attempts to turn it into an applied research lab. Facebook research may be similarly constituted. Nonetheless, researchers are moving into these hybrid research roles so rapidly, that there is even a concern that there is a brain drain from academia — where there are not be enough faculty members to train new PhD.

Now, the hybrid research model was one model of research at Google, but there is another model as well, the moonshot model as exemplified by Google X. Google X brought together focused research teams to drive research and development around a particular project — Google Glass and the Self-driving car being two notable examples. Here the focus isn’t research, but building a new product, with research as potentially a crucial blocking issue. Since the goal of Google X is directly to develop a new product, by definition they don’t publish papers along the way, but they’re not as tied to short-term deliverables as the rest of Google is. However, they are again decidedly un-Bell-Labs like — a secretive, tightly focused, non-publishing group. DeepMind is a similarly constituted initiative — working, for example, on a best-in-the-world Go playing algorithm, with publications happening sparingly.

Unfortunately, both of these approaches, the hybrid research model and the moonshot model stack the deck towards a particular kind of research — research that leads to relatively short term products that generate corporate revenue. While this kind of research is good for society, it isn’t the only kind of research that we need. We urgently need research that is long term, and that is undergone even without a clear financial local impact. In some sense this is a “tragedy of the commons”, where a shared public good (the commons) is not supported because everyone can benefit from it without giving back. Academic research is thus a non-rival, non-excludible good, and thus reasonably will be underfunded. In certain cases, this takes on an ethical dimension — particularly in health care, where the choice of what diseases to study and address has a tremendous potential to affect human life. Should we research heart disease or malaria? This decision makes a huge impact on global human health, but is vastly informed by the potential profit from each of these various medicines.

There was one more model that happens somewhat infrequently at Google, and other places, but it’s one that I think hold enormous promise. It’s a model of sabbaticals — where academics spend a long time inside of industry, doing academic research. It happens infrequently enough that it’s not a major research method — but I think it holds significant promise. I’ll return to this topic later.

Private Data means research is out of reach

The larger point that I want to make, is that in the absence of places where long-term research can be done in industry, academia has a tremendous potential opportunity. Unfortunately, it is actually quite difficult to do the work that needs to be done in academia, since many of the resources needed to push the state of the art are only found in industry: in particular data.

Of course, academia also lacks machine resources, but this is a simpler problem to fix — it’s a matter of money, resources form the government could go to enabling research groups building their own data centers or acquiring the computational resources from the market, e.g. Amazon. This is aided by the compute philanthropy that Google and Microsoft practice that grant compute cycles to academic organizations.

But the data problem is much harder to address. The data being collected and generated at private companies could enable amazing discoveries and research, but is impossible for academics to access. The lack of access to private data from companies actually is much more significant effects than inhibiting research. In particular, the consumer level data, collected by social networks and internet companies could do much more than ad targeting.

Just for public health — suicide prevention, addiction counseling, mental health monitoring — there is enormous potential in the use of our online behavior to aid the most needy, and academia and non-profits are set-up to enable this work, while companies are not.

To give a one examples, anorexia and eating disorders are vicious killers. 20 million women and 10 million men suffer from a clinically significant eating disorder at some time in their life, and sufferers of eating disorders have the highest mortality rate of any other mental health disorder — with a jaw-dropping estimated mortality rate of 10%, both directly from injuries sustained by the disorder and by suicide resulting from the disorder.

Eating disorders are particular in that sufferers often seek out confirmatory information, blogs, images and pictures that glorify and validate what sufferers see as “lifestyle” choices. Browsing behavior that seeks out images and guidance on how to starve yourself is a key indicator that someone is suffering. Tumblr, pinterest, instagram are places that people host and seek out this information. Tumblr has tried to help address this severe mental health issue by banning blogs that advocate for self-harm and by adding PSA announcements to query term searches for queries for or related to anorexia. But clearly — this is not the be all and end all of work that could be done to detect and assist people at risk of dying from eating disorders. Moreover, this data could also help understand the nature of those disorders themselves.

Another kind of research that happens less than it should is research into the role that algorithms have in decisions about our lives. From things like credit checks, to resume screening, algorithmic decisions are making an increasing difference in our lives, and have substantial influence in our ability to make a living. As such, it is vital to ask: are these decisions fair, and legal. Do they uphold the standards of fairness as required by law, not prejudicing decisions against a particular protected class? When these modeling decisions are made behind closed doors, by a private institution, it is nearly impossible to check whether or not the decision is a non-discriminatory decision. The data necessary to test whether this is happening, since it is held by a private organization, is not accessible to academics and researchers.

Beyond bias, there are even more subtle ways in which algorithms affect our lives. As example, a PNAS paper from Facebook showed that Facebook update selection was able to affect people’s day-to-day happiness. Another has shown that during a political campaign, the order of the results and their negative or positive view representation of a particular candidate can affect the race outcome. Now — we collectively believe and trust that these companies are not deliberately adjusting rankings to favor particular politician, but it would be good to either change our thinking around this ranking — and either collectively understand that it is no “impartial” or “neutral” even for a search engine, or else have a way to verify the bias that is implicitly encoded into these models, and ensure that it conforms to our collective standards.

Finally, another example comes in address online harassment — where the ability for academics to look at and begin to understand online harassment would have general social utility, but is being hampered by the private nature of the data at these companies. While tamping down harassment and online threats may not be the highest priority from a business perspective, it causes real harm to those affected, and arguably is an area that the government should be more assistive in addressing. Does freedom of speech obligate us to protect people online from threats of violence and ensure their physical safety, or at the very least give more assistance to private corporations to ensure that public online discourse is conducted in a way similar to the way discourse is conducted in the physical world?

While companies will try to share data, there’s a limit to the amount of data a company can share, and key to this limit is the issue of privacy and anonymity is to data sharing. It is not clear how to enable access to this data in a way that protects users of a data source from deprivation of their human right to privacy. This right, of course, is not inviolate, and as a society we enable police and anti-terrorism forces to look at phone conversations and emails exchanged in order to prevent crimes. But we traditionally assume that there are only two choices : complete privacy or complete disclosure, when there are more possibilities.

In particular, the work that has come out of academia in differential privacy, and flexible schemes for adjusting the amount that is disclosed and the amount of inferential power of that data. As an example, you can enable collection of aggregate statistics over a collection without revealing the particular composition of any data point. As another example, there is new work in architecture for issuing cryptographically secure data queries which again limit the ability of the issuer to gain detailed views of the underlying data. This kind of work is crucial in enabling work on top of private data.

Ignoring the issue of whether the data we freely give these companies truly belongs to those companies, or whether it really belongs more to the public as originators of the data, there is tremendous value that private companies could be doing with their data. It’s fair that these companies aren’t able to do this work themselves, but I would hope that companies would make it more possible for interested parties from public sector to work on their data.

Bloomberg’s Data for Good Initiatives

Bloomberg has always had a deep philanthropic bent, of course stemming from Michael Bloomberg himself. As a company, we’ve been wrestling with these issues and have tried a number of things to push the conversation along more vigorously. One piece of this has been to try to foster a community focused on the application of data science for social good. It’s been amazing to watch the growth of data science for social good work over the past few years, and NYC has been the locus of many of these efforts. As an example, the group DataKind matches public sector organizations with private sector data scientists, on volunteer engagements. In this way, the data that is sequestered within these organizations, often organizations without the structure to operate on this data, is able to build out machine learning systems to improve their functioning.

We’ve been aiding this community over the past few years, starting with work we started in 2014 with the KDD conference. That year, we had a full day workshop at 731 Lexington, our global headquarters, on the applications of data science to public good, in an event that is now called “the Data for Good Exchange”. The goal of these meetings has always been the same — to connect groups that don’t typically talk to each other, public sector thought leaders and private sector data scientists and machine learning experts.

In order to bring together these two groups of very different kinds of people, we constructed the event with showcases for the very different kind of work they do. We have a paper section, which is geared for academic papers, the kind that would appear in a conference like ICML’s Machine Learning for Social Good. We also have a number of panels, where we encourage frank and vigorous discussions around key topics.

As an example, we had an academic paper by David Klein last year that showed how to use deep learning for biodiversity monitoring in forests — it modeled separate bird calls, and build a model to detect and count the number of distinct bird calls it heard. Overall, the work is incredibly intriguing in the way it suggests a natural world panopticon, where the wild only appears wild, but in reality there is a vast network of sensor networks that monitor the natural world, and keep track of every animal, and in this way to hope to preserve more biodiversity.

As an example of non-profits, we had a keynote by Oliver Wise, the chief analytics office of New Orleans, who talked about the way New Orleans has used machine learning to improve safety of its citizens. The fire department had gotten a huge grant of smoke detectors but didn’t know how to distribute them. Working with Engima.IO, they used a model inspired by the work that happened in NYC under Mayor Bloomberg’s MODA group to predict which houses were least likely to have a smoke detector, and then sent targeted outreach groups to those neighborhoods to offer the smoke detectors and install them, and reduced wasted visits. This kind of conversation helps show data scientists and machine learning people the actual problems faced by cities, and illustrates how they can help, and how to engage.

But the conference is only one piece of bridging these communities. At the conference last year, we also announced a researcher in residence position that we formed together with UNICEF. The position established funding for a researcher to be embedded into UNICEF to work inside the organization on bringing data driven methods into the company. As a consequence of this researcher in residence position, we’ve been able to work very closely with UNICEF, and dovetail this work with another vehicle we’ve been using at Bloomberg of the visiting research model.

Finally, returning to the idea of a sabbaticals, this year we have Mark Dredze from Hopkins on sabbatical at Bloomberg. His work at university has been focused on understanding public health impacts from social media, and has relied heavily on Twitter. Since being here, he’s been able to use our extensive Twitter license and cluster to perform experiments beyond the scope of what he’d been able to do in academia.

As an example, we’re working with UNICEF on their Zika response work. Along with other members of our NLP team, he’s built geolocation models and global mobility models that can help predict areas most under threat of Zika exposure.

Another example of the kind work that he’s done while here is his work in understanding the effect of celebrity announcements of health crises. It’s been a widely held belief that celebrity discussions of their own struggles, with health issues, can drive public attention, and currently a burgeoning area of research.

When Charlie Sheen publicly announced that he was HIV positive, Mark looked at the coverage of this announcement in the media, and in search queries. He was able to use our news archives and Google search queries to show that there is a significant effect on public attention. In this case, there were nearly 3 million more searches for HIV related information because of Sheen’s announcement. Our news archives are unique resources, and very difficult to obtain anywhere else, and what was difficult to do in academia, was significantly easier at Bloomberg.

At this year’s conference we’re going to be trying something new, a research immersion day. While I was at Google, the sales org ran something called ‘the Sales Immersion’ day, where an engineer would follow someone from the sales org all day. The goal was for the engineers to gain some empathy for the sales org and to think about ways they could help. This year, we’re doing something similar to pair up data scientists and public sector organizations, with the goal of the data scientists to embed into the organization and help the organization understand which of their problems can be addressed by data scientists, and aid the organization in building out a data science team.

I think the point of building empathy and connections isn’t a trivial one here. If we’re going to have these separate organizations housing different kinds of work, it will take personal connections across the divide in order to get meaningful work to happen, and it is vital for the public good that this happen. No matter whether or not, I think this model of porous industry / academia relationship is a great model — with significant benefits to each side, and I expect us to continue this practice and grow it as possible.

Ecosystem Opportunities

Three last thoughts before I wrap. There is probably a role for a data ombudsman within private organizations — someone to protect the interests of the public’s data inside of an organization. Like a ‘public editor’ in a newspaper according to how you’ve set it up. There to protect and articulate the interests of the public, which means probably both sides — making sure a company’s data is used for public good where appropriate, and making sure the ‘right’ to privacy of the public is appropriately safeguarded (and probably making sure the public is informed when their data is compromised).

Next, we need a platform to make collaboration around social good between companies and between companies and academics. This platform would enable trusted users to have access to a wide variety of data, and speed process of research.

Finally, I wonder if there is a way that government could support research sabbaticals inside of companies. Clearly, the opportunities for this research far outstrip what is currently being done.

Thank you for giving me a chance to talk with you. I talked about the erosion of industrial pure research and its replacement with application focused short-term work. While academia is geared to approach these problems, I’d argue that it leaves work important for the public good undone, since the data crucial for doing this research remains sequestered in private companies.

In order to overcome these challenges, I argued for a more porous exchange between academia and industry. Along these lines, I talked about the the Data For Good Exchange, the researcher-in-residence collaborative project with UNICEF, our visiting researcher Mark Dredze and his work. I’ve highlighted our upcoming conference, and the public-sector immersion day we are currently running, and hope that some of you may take part.

Data has a tremendous capacity to transform our society, and make it more just and fair, but this won’t happen without care in how govern that data, who has access to it. I hope for your help in thinking through and helping pushing the boundaries of a more moral data economy.

Thank you to Ana-Maria Popescu, Mark Dredze, Arnaud Sahuguet, Gwendolyn Litvak, and Susan Kish for their insightful feedback.