How do you explain Machine Learning and Data Mining to non Computer Science people?

I’ll talk about one aspect/technique of machine learning/data mining.

Let me begin with an admittedly contrived situation. Suppose there are a bunch of tiny balls that are magically floating in a room. (Bear with me…) We’d like to know whether there’s any particular structure to the positions of the balls. For example, do the balls tend to cluster together in certain areas? Do the balls avoid certain spots? Are they evenly distributed everywhere?

However, the room is completely dark, so we can’t see anything. But we do have a flash camera that allows us to take pictures of the floating balls in the room.

So we take a photo, and it looks like this:

From this photo we’re not able to discern much structure, if there even is any, in the positions of the balls. The balls look more or less evenly distributed from this perspective. So we try moving laterally and taking another photo from that new vantage point.

The balls still look pretty much randomly distributed, with no particular patterns. Let’s try taking a photo from a higher angle.

Hmm, still nothing notable here. Okay, let’s try it one last time, lowering our perspective.

Ah-ha! We’ve just discovered something interesting: it looks like the balls are either located near the ground or near the ceiling of the room, and there are no balls that are located in between those two clusters. In order to discover this structure, we needed to take a photo of the room from a “good” angle. The structure could not have been discovered from the previous “bad” angles.

In the situation I’ve just described, we are looking at 3-dimensional data points — — the positions of our floating balls are described by a collection of 3 numbers (x coordinate, y coordinate, and z coordinate). But there are problems in which our data points are described by much larger collections of numbers. For example, a medical record for a hospital patient may consist of 500 numbers: date of birth, height, weight, blood pressure, date of last hospital visit, cholesterol, etc etc etc. We may be interested in figuring out whether these data points have any structure — — for example, are heart attack sufferers’s data points clustered together in any way? If so, if in the future we identify a new hospital patient’s data point as being close to that cluster, then we may label them as being at risk for a heart attack. (NB: In reality it probably wouldn’t be so simple, of course.)

The data in this case is difficult or impossible for a human to visualize. How can we possibly visualize 500 dimensions? Just as we could not see anything in the contrived “dark room” example above, we similarly cannot “see” data points in 500 dimensions. In my previous example, we were taking 2-dimensional photographs of 3-dimensional data points — — and we can just as well take lower dimensional “photographs” of 500 dimensional data points in an analogous way.

So by taking these “photographs” from appropriate “angles”, we can find structures and patterns in the data that could be difficult to find otherwise. This is an example of what people are talking about when they talk about the question of “finding insights” in “big data”.

For the experts: I’ve attempted to describe/motivate principal component analysisfor laypeople. The graphics above were made using matplotlib.Read More