Brewing the “perfect” coffee based on AI and big data?! What data science really is about
Nowadays, artificial intelligence (AI), machine learning (ML), data science, big data, etc have become the trendy and “sexist” buzzword in almost every industries, especially since DeepMind has developed AlphaGo that beat the world’s top human professional Go players and is far ahead of human players in attaining the Kami no Itte (神の一手, “Divine Move”) *.
* means “a God’s/perfect move”, a term from Japanese anime Hikaru no Go (ヒカルの碁)
This somewhat reminds me of the dot-com bubble during my adolescence, or the quantum/nanotechnology boom in recent years, where lots of people try to use the word “quantum” or “nano” in marketing their products even they are mostly irrelevant to quantum or nano physics. Would data science/AI be just another bubble about to burst? And what data science and/or big data actually really mean?
For real or scam? Brew coffee using AI & Big Data?
Before I introduce what data science really is, let me tell you a story first:
Last year there was a local news in Hong Kong “advertises”(?) that there is a new coffeehouse created by a coding school, which claims to have developed an AI system which analyses big data includes weather, day temperature, time of the day and even the level of affection to current news headlines to brew the most suitable coffee for customers.
Just a disclaimer: I haven’t visited the coffeehouse, so I cannot comment whether this is for real or just a scam like the Arist Coffee Brother[2], though to me this is more like just a gimmick.
If you ask me, to brew a good coffee, the most important are always the skills of the barista and the quality of the coffee beans. Other factors like weather, day temperature, news headlines are mostly irrelevant and at most secondary or with minor impact. What’s more, taste and level of affection are something subjective. For instance, Amy may like cold weather and bitter flavour coffee, whereas Bob may like hot weather and coffee with sour flavour, not to mention news headlines, especially politics news. So do customers need to first answer a set of personal questionnaires before buying a coffee there? How can one be sure there is a clear universally accepted “level of affection” to news headlines or even a recipe to brew the most suitable coffee for a customer?
Instead of investing money and time in developing a complex machine learning model that involves Natural Language Processing (NLP)^ and analyse minor or mostly irrelevant datasets, doing things like improving the skills of baristas and getting a great variety of best quality ingredients, or even just improving staff working environment would be definitely much more cost effective. Even if you do want to apply AI or big data to brew the best coffee, these tools should be applied to analyse the coffeehouse baristas and raw ingredients like coffee beans instead of those secondary or irrelevant data.
^ In simple words, NLP can be understood as translation between computer and human language. In particular how to program computers to turn natural language data into a bunch of ‘0’ and ‘1’s that are meaningful to computers and can be used for further analysis. Sentiment analysis in most machine learning models are done using NLP
Difference between data scientists and data analysts
Back to the topic, so what is data science and/or big data? To be honest, unlike mathematics topics, data science and big data are more like a “vague” concept. To some extent, there isn’t really a clear universally accepted definition. However, I will try use my experience working as a data scientist to explain how me and the industry usually think what“data science” is really about.
Firstly and the most obvious, data science must involve analysing data. Here data can be anything that includes whatever you can think of, like a word you have said, a place or website that you have visited, since “everything is data”, especially in the current digital era.
However, beside the most obvious, what’s more important are understanding the data, extracting their underlying insights and apply those insights to build mathematical and/or machine learning models that simulate scenarios and/or making objective predictions. And the main goal of a typical data science project in industries usually is to assist human in some decision making processes or even automate them. For example, based on your browsing data to predict what you would probably like and then advertise related contents to you (or even predicting your moral values or political views$, just like Big Brother is watching you in one big “powerful” country in the East) .
$ There are data ethics issues in doing so. Data ethics is also another important and interesting topic in data science, but I will leave that for a future blog post
Therefore, knowing how to write computer programmes or even software is very much a must for data scientists. Apart from these, understanding the whole machine learning model pipeline, from data collection in the very beginning to data cleaning/cleansing, data management and storage are also important, particularly data management if you are working with sensitive data like healthcare data. On the other hand, roles that simply using existing softwares like Power BI, Tableau or even Excel to perform data analysis tasks would usually just be considered as data analysts, but not data scientists. Though broadly speaking, simple curve fitting and regression analysis using Excel are also a type of machine learning.
Data science should be “scientific”
Apart from the above, as a former theoretical physicist, if we are to call data science a “science”, I think “real” data science projects should also be “scientific”, i.e. hold the spirit of natural science and based on scientific methods. In other words, data scientists should also understand (at least partially) the underlying logic and reasonings behind different machine learning algorithms. Sometimes even understand the causal relations between different datasets and any model simulations/predictions you made. Data scientists should not just focus on performing modelling and data analysis tasks, aim only for the best possible accuracies and then blindly trust the model results and treat machine learning models like a black box.
As you know, “correlation does not imply causation”. With gigantic amount of data in the world, it is unsurprising that you can always find spurious correlations among different totally unrelated datasets, e.g. see [3]
Besides interpreting the machine learning model results, understand how and why the model predictions are such, at least to some extent calculate and understand feature importance in the models, communication, data visualisation are also important. Data scientists need to translate and explain these knowledge and results using simple graphs and words that can be easily understood by a non-technical audience, which usually includes stakeholders and regulators.
This is the end of my first data science/AI blog post. Hope you enjoy it :) I wrote this blog post originally in Cantonese/Chinese as well, check it out if you are interested as well https://medium.com/@godfrey.leung.cosmo/%E6%A0%B9%E6%93%9Aai-%E5%A4%A7%E6%95%B8%E6%93%9A%E6%B2%96%E5%92%96%E5%95%A1-%E6%B7%BA%E8%AB%87%E7%94%9A%E9%BA%BC%E6%89%8D%E6%98%AF%E7%9C%9F-%E6%95%B8%E6%93%9A%E7%A7%91%E5%AD%B8-ac2123279669
Reference and further readings:
[1] https://www.marketing-interactive.com/preface-ai-coffee-big-data
[2] https://dandesim.one/2017/04/04/arist-coffee-brother-quits-nbition-scams-continue
[3] Spurious correlations examples https://www.tylervigen.com/spurious-correlations