#66DaysOfData — Days 6 & 7: Hitting Walls and Hunting for More Data
On our last day of adventures, we hit a bit of a wall. Our data, although nice and clean and perfect and filled with great potential insights, it had absolutely no information about our goal: attrition prediction (predicting if a new hire will stay with the company for more than x months).
The first thing I wanted to reflect on is about hitting walls, something that happens much more often than people realize or want to think about it. Here is some gist of reality, the success of a data science project seldomly lies in your technical skills. You can be the best programmer, the most amazing mathematician, a crack statistician and even be able to come up with new and incredible Neural Network designs, and yet, if you do not truly comprehend what you are trying to achieve and have a deep understanding of the data you are working with, odds are your project will either fail or be useless at the end of the day.
I have been working with entrepreneurs and have been an entrepreneur for a long time, and this has taught me the importance of understanding both the endgame and the process first. Knowing what the endgame is allows you to make changes quickly and, on the fly, so as not to waste precious time building something that has little or no value. As an example, since the data I found has nothing about attrition, I need to make changes, either to find more data that I can use, which will be the next step, or to change what we are trying to accomplish, since we could focus on building other types of predictions with personality data and research, maybe even a theoretical model might be useful. But let us investigate the attrition side first.
On the other side is the process, and here I will stick to the Lean Startup philosophy: “Fail fast, fail cheap.” Here is a hard truth, you will fail many, many times when building something new. But failure is not the end, it is just a step in the right direction, many failures mean many steps. The important part is that those failures come fast, as soon as possible, and cheap, meaning you waste as few resources (in this case, time) as you can. Therefore, I always prefer to do light exploration first and keep an open process about the question to solve. If this all sounds familiar, it is because it is, it is called the scientific process. And here is the most important idea about data science: “Data Science is applying the scientific method, through available tools, to data, in order to answer something important.”
In keeping with the experimentation process and our fast and cheap philosophy, the next logical step is to find more data, some linked to attrition.
Hunting for data can take many forms, you can sometimes be lucky and find the perfect dataset on platforms such as Kaggle or even Github. Other times you need to navigate complex research databases. A quick tip, if you know of any researcher or university engaged in research related to what you are trying to do, write to them, more often than not they respond and are very helpful, just be patient, this last one is a high success option, but it takes time. Finally, do remember that there are many industry specific repositories and websites to look at.
For this case, since this is the #66DaysOfData challenge, I got stuck with the quicker search for myself approach. Here are some of the results I found:
Although non of these datasets have information on OCEAN personality, they do seem to be interesting to explore and understand what the available data is like out in the wild.
And this is the reason I call this data hunting, because you need to search, take and investigate, sometimes it’s a hit, others, a miss, but slowly it helps you understand what data is really like so we can start applying our tools and scientific process to it.
Next Time — Cracking open some more data.
Jack Raifer Baruch
About the Road to Data Science — #66DaysOfData Series
Road to Data Science series began after I experienced the first round of Ken Jee´s #66DaysOfData challenge back in 2020. Since we are starting the second round of the challenge, I thought it would be a good idea to add small articles every day where I can comment my progress.
I will be sharing all the notebooks, articles and data I can on GitHub: https://github.com/jackraifer/66DaysOfData-Road-to-Data-Science
Please do understand I might have to withhold some information, including code, data, visualizations and/or models, because of confidentiality regards. But I will try to share as much as possible.
Want to follow the #66DaysOfDataChallenge?
Just follow Ken Jee on twitter @KenJee_DS and join the #66DaysOfData challenge.