So, you’ve been working on machine learning for a while in the academia, predicting early case of Alzheimer from EEG signals for the good of humanity, perhaps? Then you’ve also read an article reporting/hyping how machine learning is used in a certain e-commerce website’s recommendation engine and bringing millions of dollar in revenue for them.
Now you decide to venture into new journey to the industry, leaving your mark and creating billion of dollars worth of business value by capitalizing your knowledge about decision tree and information theory. It shouldn’t be too hard, right? After all you have random forest in your arsenal.
Well, not so fast.
There are some inherent differences regarding how machine learning is practiced in industry and academy. Here, I would like to share some differences I observed, based on my short time and measly experience working on both sides of the coin. Of course, your mileage may vary, but I hope it can shed some light for aspiring data scientists and machine learning engineers.
- Data Availability
Data, data, data. Machine Learning can’t work without data, at least that much has been widely known. And industry should be literally drowning in data, right? Customer tracking data, personal information, and what not.
Unfortunately, that isn’t the case.
In the academia, you usually work on a standard benchmark dataset. This dataset usually comes in a nicely formatted tabular csv, ready to be loaded into Python dataframe and run into several of your favorite classifier algorithms. The extent on what you can do to is only limited by your imagination and your literature review result.
There is conscious effort from the academic community to create this kind of dataset. It makes it easy to compare new methods to the previous state-of-the-art result. It also encourages for reproducible research because everyone working on the same dataset (well, as long as the researcher also open their code).
In the industry, it’s usually not that straightforward.
There’s a famous saying that says, “Data science is 80% cleaning data and 20% complaining about cleaning data”.
Do you need this data? It’s only available in pdf as image, ouch.
Do you need that data? Well, too bad since the data is recorded manually as Excel files in some esoteric format that makes it hard to parse.
Do you need this info? Oops, sorry it still hasn’t been tracked yet. Please wait until the next quarter when it will be implemented into production code.
Do you need this information? It turns out you need to join several mega tables, containing billion rows, which can takes several days to run. Better back to browsing reddit in the meantime.
Sometimes, the data you need is not readily available at all. For example, in supervised learning you need the label and without that your hands are basically handcuffed. There are of course several methods that can be used to enrich your dataset for the much needed label.
If you have thousand of innocent, energetic users in your platform, why not take advantage of them? One example is Facebook, that in the past asked their own user to tag uploaded photos with the correct profile for each detected face. Now they can automatically detect new uploaded photo based on this past data. (Do you know who else is interested in such data? NSA).
Another case is the one Google did with recaptcha project. You get to protect your site from bot, and Google get the precious label for their data, which later can be used to build next billion dollar system Cloud Vision API. Talk about crafty.
2. Hire an AI-trainer
This article by Bloomberg reports the so-called AI service with human in the loop. Imagine this, people with college degree, living at the lower level of food pyramid, working by correcting inappropriate responses from bots in order to fulfill the over-promised of such services. Even though the company will eventually go under, the data that they would have collected is very valuable and can be sold to another interested party.
If you can’t persuade your HR department to hire people for permanent AI-trainer role as strategic key position, then you can consider to outsource the work outside. Amazon Mechanical Turk and Crowdflower are some examples of marketplace where you can pay humans to do tasks that are deemed too hard for computer to perform (right now) and a cinch for humans to do. Labeling images or sentences, answering survey question, etc. Even though one task usually only costs around one or two cents, it can snowball into very large sum of money, especially if you have a very large of data.
The good thing is, unlike in the academia where money is a big constraint, even for well-funded study like computer science (only second behind medicine in the US!), in the industry money is no object as long as it is aligned with business objective of the company.
Well, as long as you can convince your finance department, that is.
- Model interpretability
Through some unspeakable means, you, at last, have the data in your hand. Now it’s training time! You head to /r/ml to find out latest published algorithm for your problem and then look for open source implementation because you are too busy to implement it from scratch. The result looks promising! You then get to present your result in front of your stakeholders.
Sadly, when you are asked why the model behave like that, you cannot answer. Not because you lack the knowledge, but basically nobody in the entire planet knows how/why the algorithm works (*cough*deep learning*cough). Unfortunately, your current case needs at least some semblance of human understandable explanation to give to the user. You can’t judge a person having cancer from machine learning diagnostics without giving them reasons. Or in different scenario, reject someone’s transaction due to possibility of fraud without a clearly identifiable cause.
- Engineering constraint
In different occasion, you managed to create a new recommendation model with better RMSE score (It is 0.4% lower!) than the previously deployed model. Sadly, it was rejected again, now at the hands of the engineering department. Why, you ask? It turned out that due to the ungodly number of component in your ensemble model, the recommendation result cannot be calculated in a reasonable amount of time, or if it can, it required considerable computation power that exceeded the available budget.
No matter how good it is, it won’t ever go into production, at least until the advent of brand new revolutionary technology that makes it possible. But until that time, don’t get your hopes up. That is also the reason why the winning solution for Netflix challenge never saw the light of day.
So that’s pretty much the main difference for machine learning in industry and the academia that I observed. If you have encountered different case in your workplace, please hit me up in the comment section as I would love to learn from others’ case. Happy learning, machine!