Is AI Even a Big Data Problem?

It’s not about the size of the database, it’s how you use it — for now.


There’s been a big fuss about AI lately. It’s said that AI is the next big frontier. From personal assistants to self-driving cars, AI is set to be an integral part of our lives in the medium to long term future. The largest companies in the world are throwing big money at AI development, hoping to be the first one to crack the equation. It’s said that the companies that get left behind at this stage of AI development will not be able to catch up in the future.

I am a PhD student of psychology (publications pending), with an interest in technology. However, my training is to do with basic learning theory — more commonly referred to as conditioning theory (both Pavlovian and Operant). It’s similar to reinforcement learning in the computing world, the idea of a system that takes information from the environment and processes it, then monitors the output and adjusts its future behaviour based on the feedback. When I hear about AI and teaching computers to learn, my ears prick up because I’m very interested in this area. I have recently noticed that some of the discussion seems to be a bit misguided, especially from an experimental point of view. I’m assuming that all the commentators know this, it’s just that they don’t really bother fleshing it out before making their larger points. I don’t pretend to know a lot about programming and about computers (although I use both in my research), but I know a lot about designing experiments to assess how things learn, and how things should learn.

There’s been strong assertion that “big data” is required in order to develop AI. Therefore, companies that do not collect enough data from users such as Apple cannot compete with Google or Facebook — companies that make their living on mining for user data. As a researcher, while I agree that data is very important for developing new systems and finding out about the way the world works, I disagree with the notion that information mined from private user data is required to develop strong AI. This disagreement has been (not so) gently put forward by Apple over the last year or so:

”Well, turns out, if you want to get pictures of mountains, you don’t need to get it out of people’s personal photo libraries.” — Craig Federighi, speaking to John Gruber after WWDC

Federighi repeated this point in a recent article about Apple’s AI efforts versus the idea that user data mining is required to develop AI technology:

This view is hotly contested by Apple’s executives, who say that it’s possible to get all the data you need for robust machine learning without keeping profiles of users in the cloud or even storing instances of their behavior to train neural nets. “There has been a false narrative, a false trade-off out there,” says Federighi. “It’s great that we would be known as uniquely respecting user’s privacy. But for the sake of users everywhere, we’d like to show the way for the rest of the industry to get on board here.” — Steven Levy for Backchannel

This piece discusses the idea of what big data even is, and how a company like Apple can develop AI while apparently ceding this ground to fellow tech companies.

Big data: What is the big deal?

When we hear the words “big data”, the name Google pops up in our minds right away. It’s basically where everything lives. Everyone knows of Google combing the web and documenting everything that’s on it. If something isn’t on Google, it may as well not be on the internet (ironically, something like this article — I’m not expecting a lot of reads).

Big data typically refers not only to the data that is being collected, but also to the process of analysing said data, in order to make sense of it. Look at Google search for instance. It’s not just a database that you have to scroll through to find what you need. It takes in your query (and any variation of that query, correcting for grammatical and spelling errors), and spits out the links that it thinks is the most relevant for you. For example, I was trying to find some images of different parts of the brain the other day. I put in the search term for the first part I wanted to look at, and then a few more. By the third or fourth search, I barely had to enter a full word before the Google search bar would autocomplete to exactly what I was looking for. In other words, Google search knew that I was on a quest for brain parts, and started prioritising feeding brain images to me over anything else that could also come under the search terms I was using (many parts of the brain are named after physical objects).

Google’s products can be seen as a giant suite of surveying tools that gather user data. For example, Google combs through every video on youtube and uses that information to (amongst other things) learn about how people use language. This information gets fed to the giant Google overlord in the sky, and manifests itself in other places, for example Google Now’s excellent speech recognition. A similar thing can be said for Facebook. Facebook however has access to more personal information about users than Google. While Google collects search terms and clicks on links (this is oversimplification; Google has many properties e.g. maps, Android, Play Store), Facebook has access to people communicating with other people, and (to some extent) their likes and dislikes (and even more emotions now, with the new reaction options), and perhaps even what users are hovering their mouse over when they browse Facebook. Facebook has access to carefully curated photo albums that people show each other, and access to how people communicate to their favourite celebrities. Soon, Facebook will have access to how people behave in virtual environments. And I haven’t even mentioned Amazon and AWS. All of this is often stated prior to making this point: How can Apple compete with these companies whose entire purpose is to collect user data, let alone smaller projects like Viv?

Mo’ data, same problems

Having a lot of data doesn’t automatically mean that you’ve got all the ingredients you need to develop workable AI. There are several pitfalls to avoid. For example, the biggest sin in psychology, and social sciences in general is to take a giant database, and run correlations on it until something significant pops up. Not only are you in danger of finding a type I error(which a good statistical program should, with a large sample be able to correct for), you’re not necessarily asking the right questions about the data. Firstly, we have to remember that correlation does not imply causation. That’s a silly catchphrase that is drummed into every first year statistics student, but remains relevant no matter what level you are in research. For example, a good piece of correlational data that implies causation is one user entering in a search term, and then clicking on a link. If a user then clicks on another link, that means that the first link may not have been relevant, or that it did not contain all of the information that the user required. A less useful piece of correlational data would be for example observing someone buying a large item such as a TV on Amazon, and conclude that this person was interested in seeing more TVs in the future (even though they already bought one).

At this point in time, machine learning still requires that a human specifies the rules, and point the AI in the right direction by telling it to ask the right questions. Meaning that good questions have to be asked. The famous AlphaGo AI developed by Google that defeated the reigning Go champion Lee Se-dol was developed in a way that was much more guided than one might imagine. Without much context, a layperson reading about AlphaGo might believe that programmers programmed a computer to place pieces on a board and then let it play against itself until it learned how to play the game — in fact, the development of AlphaGo while impressive, required much human intervention. Right now, talented humans are still very much required to think about what to tell the programs to look for, how to ask the right questions of the programs, and set up the conditions of reinforcement and punishment (known as an operant in psychology) in order to help the machines learn what they need to, and more importantly, to purge things that they don’t need. This may be partly why Siri, Apple’s personal assistant continues to match the capabilities of Google Now even though the data-mining capabilities of Apple apparently pale in comparison to that of Google. The data is only as good as what they can make of it, and while having the data is obviously better than not having it, the data itself is not the most important commodity, but a part (albeit a very necessary one) of what you need to create strong AI.

The empire strikes back

All of this is not meant to imply that Apple does not need data to get into the AI game. That would be a foolish thing to claim. But we should take some time to think about what it means to be intelligent. Knowing a lot of things doesn’t necessarily make one intelligent. We spend years at school and university not learning a bunch of facts, but learning how to do things. Information is less valuable than having the ability to decipher and use it. In fact, so much of human intelligence is based around purging information that isn’t useful and linking important concepts together in order to behave more flexibly. In other words, machines shouldn’t require perfectly clean data to learn. Apple is betting on this belief by investing in differential privacy, collecting data from users that add noise so that personal information cannot be identified. With a strong, flexible AI, that shouldn’t be a problem. If the point of an AI assistant for example, is to learn from you specifically so that it knows what you want, then it only really needs fuzzy data to start with. Just like when you meet a new friend, you have basic assumptions about what a person is like and how they behave, but you learn more about this individual as you interact with them. It’s the learning that is important, not the knowledge.

Secondly, Craig Federighi’s previous quotes pretty much confirm that Apple is conducting web crawling and data mining as part of their AI efforts. It would not be surprising to me at all if Apple was highjacking Google’s (and others’) work by pointing their learning machines at youtube to learn about natural language, getting them to do Google image searches for mountains and lakes to learn about what they look like, and deploying their own bots on social networks to learn about how humans interact with one another (a la Microsoft). Their continued purchases of AI technologies indicate that Apple believe that talent and software are (at least for the moment) still king in the pursuit of AI.


All of this means that that big data does not necessarily mean unassailable lead. We’ll see how companies with all the big data leverage their advantage in the coming years, but for the moment it is clear that the AI industry is still a tiny seedling that requires a strong guiding hand. The winner will not be the people with the most data, but the people who figure out the best way to use the (available) data. Many people already recognise this and AI-focused startups continue to spring up everywhere. Certainly, the smaller players in the industry have no intention of yielding to the big fish just yet. This race is still worth keeping an eye on for the foreseeable future.