Data Science is no doubt a hot topic in recent years. As machine learning becomes more and more popular, lots of companies believe it can turn data into invaluable treasure. Some of them even think that applying machine learning can allow them to find something that humans can’t discover. Do you also have these thoughts regarding data science? Are you eager to include machine learning as part of your business? Is data scientist one of your dream job position?
Last year, I had a chance to join a data science project. That was my first time dealing with real-world big data. My responsibility is to work with billions of traffic and car exam data and applied machine learning to help our client, a government department, improve their business performance. Before participating in the team, I was confident that machine learning can tell us everything valuable. However, after that project, I had a lot of different opinions about realizing data science in business. If you are considering pursuing a data science career or developing data science projects, I believe the below facts would be helpful to prepare your work.
4 facts hidden behind the know-it-all mask of data science
1. Data and models can’t tell you everything.
You can’t expect data and machine learning to tell you everything automatically. To be more specific, they can‘t develop a new business opportunity for your company by themselves. Moreover, they can’t tell you anything without clear definitions of problems and variables.
Take my experience for example. On the first day of the data science project, my boss asked me to use machine learning to automatically find some new, serious social events that my clients haven’t discovered. However, I gradually realized that machine learning models were just lines of codes, not the salesmen or managers who knew what our clients’ business was. That is to say, they don’t know what drivers care about and what’s the latest road safety problems. Eventually, I was embarrassed to tell them I couldn’t complete this assignment.
In the data science project, I also work to find what factors affect the occurrence of car accidents. I first give the model two features, the number of vehicles and average speed, without specifying a group of factors and accident types. Then, I was frustrated to know that the two features were not directly correlated with car accidents. The second time, I redefined the objective to let the model discover what environmental factors affect the occurrence of car accidents caused by specific reasons. Because of the clear definition and variables provided, the model searched for associated features with thorough consideration and brought me significant results.
2. Working with real-world big data requires lots of human efforts and knowledge.
Knowing that data science can’t provide you everything, I understood that you must put lots of effort and knowledge into it to make it successful. Unlike at school, in the real world, we don’t have clean datasets with low volume so that we can’t simply apply models on them without considering data formatting, missing value, or computer storage. Furthermore, we don’t have the correct answers to building models.
Real-world projects’ data usually is a mess and present in high volume. For instance, my project’s data was real-time and produced by the sensors all over Taiwanese roads. Because of the storage limitation, I had to first extract or create the required columns, sample the data, loaded the data in sequence, and split it into smaller files for processing. Then, I needed to deal with the organization issues by imputing the missing value and adjust the formats. All the pre-processing usually take me 70% of the total project time.
As for selecting features and modeling, I had to try different statistical learning or machine learning models to compare the results. This process required carefully checking the model’s fitness and examine different variables. Also, I needed to consider models’ interpretability and availability as the model would be used by our client who had little related professional knowledge. The above work usually is done in the remaining 30% of the project time by trial and error methods.
3. Issues in many important fields are usually neglected.
The above section mentioned knowledge is important, but something apart from them, such as legal factors related to privacy and data protection, is also vital for success in the data science business. For example, the European Union has enacted the General Data Protection Regulation (GDPR) to limit the usage of personal data and prohibited certain types of data to be transferred across borders. In other words, data can not be used, analyzed, or moved unless it is de-identified or its usage followed the law’s requirement. Failure to comply with the regulation will cost huge money. In 2019, Google was fined 50 million euros because its data team violated the GDPR.
Diving deep into the data science world, I also realized how frequently data and machine learning models can produce ethical problems. For instance, machine learning models are trained using past data for future prediction. In this case, if a company adopt AI recruiting using the training data that contained gender inequality or racial discrimination, it is very likely that the models will help the company to recruit more man or reject the minorities unintentionally. From this situation, you can understand these ethical issues are easily neglected by data and models themselves.
Furthermore, information bubbles created by the models of the recommendation systems are causing social problems. For example, social media recommends posts or news for users based on preferences, social background, personalities, connections, etc. When the social media recommended a political candidate first and the user liked it, it will provide the user with more posts that people who like that candidate also viewed. If the user keeps clicking the like button, more and more positive posts of that candidate will be given, meaning that the user will receive less negative news about that candidate. In the end, this situation may facilitate extremists who live in his information bubble and deny accepting others’ opinions.
4. The most important skill is interpreting data meaningfully.
Finally, with all fields of knowledge considering, you should know that models’ final results still exist biases because of statistical principles, engineers’ subjective judgments, and training data’s accuracy. That is to say, when you are interpreting the results produced by machine learning, you should try to understand the model’s assumptions, criteria, and variables as well as what the original data looks like. Moreover, check the logic of every argument made by the engineers. In this way, you can interpret data models, and final results more meaningfully.
A helpful method for interpreting data is to visualize your final results. most of the time, your machine learning models will give you some numerical analysis outcomes. Those results are very hard to be communicated with normal people without related backgrounds. In order not to confuse others and to convey your work, applying scattergrams, line charts, histograms, or pie charts on your data can be obvious for others at one glance. Visualization tools such as Tableau or Python’s Matplotlib would be useful for you when you work on this interpretation process.
Last but not least, after the understanding of your data and analysis process, you must know the right place for your models to apply. To be more specific, machine learning models are trained by past data, and those data usually have specific characteristics that can not be applied to the group without them. For example, if your model predicts the car accident rate of drivers and was trained using a sample under age 45, it would be meaningless if you put a 50-year-old man as input to predict his car accident rate on roads. Thus, apply the models carefully on the appropriate data to make the prediction meaningful.
Data science is a field that definitely can generate great value and improve our life. However, it can not know everything in this world without much human effort and knowledge put into it. With the above information behind the know-it-all mask of data science, I believe we data enthusiasts can bring better quality and outcomes for the society with thorough consideration and interpretation.