Your Data Science Toolkit: Requirements for Effectively Conducting Projects
Welcome to part two in our Demystifying Data Science (DS) series! Missed Part 1? Catch up here!
While the goal of Part 1 was to provide insight into DS basics, this installment aims to answer the question, “what is needed to make Data Science possible?”
In Part 1 we defined DS. Now this post will identify the elements key to DS projects, in no particular order of importance. Bear in mind that a DS project is similar to a research project in that there is no one perfect approach or combination of elements.
Question or hypothesis
Recall the scientific method (Yes, that thing you learned about in high school). Step one is determining your core question, and whenever possible a hypothesis. This will drive your research forward, but bear in mind that it can be further improved or even changed. Changing direction is fine, as long as you emphasize the value this research provides to your organization ahead of time.
As an example, say you want to undertake a fraud detection DS project. For most financial institutions, fraud is an everyday problem. Fraudulent groups are constantly changing their scheme, while regular clients typically operate within routines that give way to a pattern. It is simpler to model the latter (client behavior) than the former (fraud), therefore posing the question, “what does fraud look like?” is less favorable. The better, more actionable question is “what does typical client behavior look like?”
Equally important to your hypothesis is your data — there can’t be a project without it! Assume your data is garbage unless you have used it before.
For each new project you must instigate a data audit to assess quality. If your data fails this verification method, you need to correct your collection process in order to generate useful data.
Example: say your local airport has wifi hotspots throughout the facility. You want to use wifi connectivity data points to model how people move throughout the airport. Each record of connection shows the device, which hotspot it connected to, at what time and duration of connectivity. While parsing the data you find some impossibilities, such as a timestamp marking 100 seconds (e.g. 2017–01–01 06:20:100). This indicates an obvious mishandling of timestamps, and these inconsistencies in the data would render any analysis useless.
In case you don’t have data, then the obvious next step one is to gather it.
Example: say you have an online store, and want to better understand consumer shopping patterns. First, you must ensure your software is enabled to record not only the items purchased, but the pages visited, searches made (if any), products observed (even if not bought), etc. In this example, you have released a new product and want to know how it’s being received on social media. You must create a web crawler to automatically pull mentions on social feeds like Facebook and Twitter. Additionally, you may want to run some A/B tests with several different marketing campaigns to ensure you have enough data.
Once you have quality data, it must be processed and evaluated to assess if valuable conclusions can be drawn. You may have usage data (when, who, how long, areas of the screen the user was focused, etc), but if you want to truly analyze their shopping patterns, that data alone does not necessarily guarantee the best insights. Even if you have the shopping data, further analysis should be performed by an interdisciplinary team comprised of specialists in data science and the specific domain.
An interdisciplinary team
A Data Scientist is not the only role integral to DS. At a minimum you also require software engineers (or data engineers) and subject-matter experts working together.
The engineers help the data scientists improve the solution architecture to make it scalable and production-ready.
Example: when building a finance app, a data scientist can attempt to accurately predict the stock market, but should leave software scalability up to software engineers since they’re usually better prepared to deal with that.
While incredibly capable, data scientists are not wizards — they can’t absorb years of domain knowledge in a couple of weeks of data explorations. This is why subject-matter experts are necessary to help guide direction of analysis by verifying if the DS findings are useful.
A core component of Data Science is patience! Reaching a conclusion will take several iterations, and there is no way to know a priori if there will be valuable insights within a dataset.
Example: say you want to build a customer care solution that helps identify repeat requests or tickets. Because you are so sure you are recording data correctly, you expect 95% accuracy in your predictions. Fast forward a month later, and the DS team realizes some data is missing. The software engineering team takes another month to adjust the software to ensure it collects the correct data. When allis said and done, the DS team realizes they need to change focus, because their question and answers are different from when they began the project.
Sometimes in DS, there may not be a pattern to unveil in the end. However, this shouldn’t keep your team from making an attempt to uncover one!
Stay tuned for our next post, where we will talk about the difference between Big Data and Data Science!
Want to drive impact through data science? Join the team, see open positions here!
About the authors:
Ana has held Data Science and Engineering roles at Accenture and Tec de Monterrey. She earned a M.Sc. in Big Data Science from Queen Mary University of London, and her main interests are machine learning, network analysis and user behavior modeling.
Juan spent several years at (HP) Labs as a Research Assistant. He earned a M.Sc. in Statistics and Operational Research at the University of Edinburgh with distinction. He is currently a lecturer in the Industrial Engineering department of ITESM.