Machines That Tell Stories with Data — An Interview with Amy Heineike from Primer AI
In this episode of the Masters of Data podcast, I speak to a fascinating guest who is on the cutting edge of AI and the future of machine learning. One of the most interesting areas of innovation in machine learning and artificial intelligence is what is known as Natural Language Processing or NLP. Basically, it is the idea of teaching machines to understand human language and then reproduce convincingly human language in return. The guest on this episode and her company build what they call “machines that can read and write”. Amy Heineike is the VP of Engineering at Primer AI. Amy and I sit down to talk about how she, and her team, are using applied machine learning techniques to build software that tells stories from data. In particular, we discuss “Quicksilver” — a Primer project aimed at filling gaps in Wikipedia by producing new content for people who do not have pages.
To get things started Amy gives me a little bit of background into her entry in data science and how, by following her curiosity, she was ultimately led to Primer. “I’ve been very driven by curiosity and had a little bit of a strange winding path to get here. I started out…as a mathematician and got super interested in complexity theory and ideas for how you could tell stories about what’s going on in the world…I had spent a while doing transportation, economics, modeling cities, and then wondered where you keep getting more data from, and then when I moved to the US, got the opportunity to work with a startup company, and then got into NLP and modeling data to try and figure out the stories from it.” But truth be told, the path has not always been smooth. As Amy notes, “When I first started working in Silicon Valley…it was weird to have a mathematician on the team…[it] was like, why are you here?” But the idea of having a mathematician on a team is more normalized now. “There are a bunch of different paths people have taken. [Like] on our team, we’ve got people from a computational science background, computational chemistry, biology, astrophysics, and we’ve got people who actually had humanities backgrounds, and then switched into science.” And the result of having such diversity on a team is nothing short of powerful.
Amy and I also dedicate time to discuss the main function and focus of Primer as Amy unpacks what exactly they are doing with NLP. She notes, “Natural Language Processing [is] the idea of having algorithms that can extract information from text…that people write. What we’re seeing is that there [is] a lot that [is] possible with those algorithms, lots that we could actually learn from text and we wanted to build tools that would let people understand vastly more content [more] quickly.” She continues, “I think what we’ve also been very aware of is that there are just a lot of different kinds of problems you have to solve to build data-driven products. We had this idea that it would be cool to generate Wikipedia pages automatically. [But] what’s the input data to that? Well, we started thinking about scientists, to start with. There’s a lot of scientists out there. There’s a lot who are doing really interesting, impactful science. And of those, only a portion of them have Wikipedia pages. What we’re wondering is, could we make Wikipedia pages for the ones who didn’t have them?” And this was the genesis of Quicksilver and the strategy behind it.
While the goal of getting more interesting and impactful people to be represented on Wikipedia is commendable, the question is raised as to why. Why is this so important and worth Primer’s focus and effort? “Wikipedia is really interesting, right? It’s this huge resource that’s really wonderful when you interact with it. There’s loads we can learn about. We all go and look up stuff all the time, right? But there’s actually big recall problems. There’s big holes in it. There’s actually a lot of missing content. And so one example…[was that] there’s a woman called Donna Strickland, who won the Nobel Prize in physics for the first time in 50 years. The morning she got the Nobel Prize in physics, she didn’t have a Wikipedia page.” So even those who are changing the world and impacting society in amazing ways are still somewhat unknown to the general public. At the end of the day, the goal and the mission of the Quicksilver project is simple: “There’s some people who are super interesting who don’t have those pages.” And that’s where Primer comes in.
But understanding the why is not enough. Amy and I also review exactly how the Primer team is undertaking this project by using NLP and AI. As Amy explains, “We have a really fun model that can take the sentences from news articles and figure out whether they look like sentences from Wikipedia pages. We also have models that can structure content out of those sentences, figure out…fields of research, institutions, awards they’ve won and that kind of stuff. So we’re able to build a model of that mapping.” She continues, “The machine can go through all of this hard work of scanning through lots of content, finding the information that you probably want [to] put in a page and [bring] it all together and also actually get all the references together, so [it] links back to your original content, puts that in the Wikipedia format and assembles that all.” And the solution they bring to all of this is incredibly helpful and effective. As Amy explains, it’s a somewhat complicated system to get a Wikipedia page posted in the first place. When an individual compiles information and makes a page, it’s then forwarded to an editor who verifies and processes the content. “There’s a whole world that they’ve built around those pages. It’s not just you write a page and it’s there. There’s actually a whole process. Those processes are what make the quality of the content what it is.” Simply put, it’s a large undertaking that people often don’t understand. But through the power of NLP, AI and a firm like Primer, the process can be improved, simplified and become more efficient and help bridge the gap between humans and data in a whole new way for the good of everyone.
Outbound Links & Resources Mentioned
Learn more about Amy:
Connect with Amy on LinkedIn:
Follow Amy on Twitter: @aheineike
Learn more about Primer AI:
Follow Primer AI on Twitter: @primer_ai
Follow Primer AI on LinkedIn:
- Natural Language Processing (NLP) is the idea of having algorithms that can extract information from text people write.
- Primer AI saw that there was a lot that was possible with algorithms, lots that could actually be learned from text and they wanted to build tools that would let people understand vastly more content quickly.
- At one point it was considered weird to have a mathematician on a data science team. Once the data science phrase was coined, it became something that was a bit easier to talk about.
- With Primer’s Quicksilver project they had the idea that it would be helpful to generate Wikipedia pages automatically.
- The Primer team saw that there were a lot of scientists who are doing really interesting, impactful science and of those, only a portion of them have Wikipedia pages. The thought was, “could we make Wikipedia pages for the ones who didn’t have them?”.
- Wikipedia is a huge resource that’s really wonderful when you interact with it. There’s loads we can learn about. But there’s actually big recall problems, big holes in it and actually a lot of missing content.
- Women and other underrepresented minorities on Wikipedia may be under-reported compared to other groups so the result is that there are some people who are super interesting who don’t have pages.
- For the people who have Wikipedia pages and have news, Primer’s AI can look at what the content is that is represented and that makes it into Wikipedia pages.
- They have a model that can take the sentences from news articles and figure out whether they look like sentences from Wikipedia pages. They also have models that can structure content out of those sentences, figure out like fields of research, institutions, awards they’ve won, that kind of stuff. So they build a model of that mapping.
- The machine can go through all of this hard work of scanning through lots of content, finding the information that you probably want to put in a page, bring it all together and actually get all the references together, so it links back to your original content and puts that in the Wikipedia format.
- With Wikipedia, an individual compiles content and makes a page, then it’s forwarded to an editor for processing. There’s a whole world that they’ve built around those pages. It’s not just you write a page and it’s there. There’s actually a whole process. Those processes are what make the quality of the content what it is.
- The challenge is that we live in a world that’s changing very rapidly. New things are happening that don’t fit in the models of the past, and some of the things that happened in the past are not things we want to keep having happen in the future.
- Firms like Primer are seeing that having people in the loop might be useful. If you can be building algorithms that are telling you about the data it’s flagging, then we can be thoughtful about building tools that are empowering and informative.
- We’re already in a situation where there’s so much content in the world that we’re interacting with, that we can’t interact with it at all. We have ways of filtering that down.