On Data Science
So, you keep hearing these terms like big data, data science, data lake thrown around the web lately. Your friends and colleagues who used to whine about their jobs being about cleaning, querying and moving data from one place to another all day now proudly call themselves data engineers or even data scientists for some reason. Of course, your real “scientist” friends (the physicists, mathematicians, biologists and statisticians) have changed their CVs to be called data scientists as well. You wonder what this data scientist thing that has been called the “sexiest job of the 21st century” is? More so, how do you become one?
As someone who started his career almost a decade ago as a fresh graduate who wanted to become a C++ programmer and felt distraught when he was made to work as a data analyst instead, to someone who has that title on his Linkedin and went on to start his own company with “data” in the name, I often get questions from people on this topic. This is my attempt to answer that question as I’ve interpreted it for my company, and hope it will help others too.
It may not look obvious to the uninitiated but “data” has been a part of human civilisation for a long time. As far back as 18,000 BCE, pre-historic people were using marks on stones or bones to keep a tally of their supplies to know when their food would run out. The word itself is the plural form of the latin word “datum”, which means “(something) given”. Here I will use it in the modern sense of - qualitative or quantitative information that is collected in a digital or digitizable form and can be analysed and reported to draw knowledge or wisdom from it to make better decisions.
Defining it in these terms, it’s easy to see that data is all around us. Everyone from researchers to businesses, and even normal folks like you and me are producing it. For every phone call you make, every site you visit, every sale a business makes, every statistic the government collects, a phenomenal amount of data is being generated everyday. In fact, it is estimated that 2.5 Quintillion bytes* of data is now produced each day driven by the proliferation of billions of mobile devices. As Yoda would say, — “It surrounds us, and binds us.”
Now that we know there is a lot of data out there, the next obvious step is to use this valuable resource to our advantage. In its original raw form, data is of relatively little value. Until the last decade, storing and processing large amounts of data was really expensive and only the biggest companies and universities had the equipments (supercomputers) capable of crunching data on a large scale. But thanks to Moore’s Law, that has changed dramatically over the last decade. I remember my first desktop in the early 2000s had 128MB of RAM and a 20GB hard disk. I’m now (in 2017) writing this blog on a laptop with 8GB RAM and half-a-terabyte of hard disk, while also running a large CPU intensive data processing task on a server with 32 CPU cores and 128GB of RAM. In short, we now have a lot of data stored up and have the computing power to process it. This means, we need a lot of people who are good at crunching, analysing and presenting that. The data scientist was born.
Unfortunately, this field has been overrun by hype. We have had the tech world drowned out with terms like Data Mining to Big Data, Data Science, and now Machine Learning & AI, which makes it hard to separate the signal from the noise. Despite that, it is good to have a name for any kind of profession and we have to pick one. Personally “data science” is the one I prefer. The term data scientist must have also been derived from it, which makes it the more sensible term to use. But, like its modern day cousin computer science, we need to understand what the “science” in the name stands for. At its most fundamental level computer science is about abstractions and how to combine simpler concepts to create more complex systems. Similarly, as I’ve understood it, the fundamental task of data science is the preparation of data and the application of mathematical models to that data to get an useful outcome. The final output can either be in visual forms of dashboards and charts or as inputs to upstream software layers.
Data Science = prepare(data) => model(prepared_data) => use(model_data) => knowledge
Like anyone who has only ever put together a Wordpress site isn’t quite a computer scientist, anyone who only ever used excel sum() function or written some SQL queries isn’t really a data scientist. Its rather a multi-disciplinary field that requires one to have the mathematical chops of a scientist or statistician, the engineering knowhow of a seasoned developer, the aesthetic knowledge of a designer, the communication skills of a leader, and the relevant domain expertise . Its easy to see people with such skills will be very rare, what is called the “perfect data scientist” in the diagram below.
So, although we might have a lot of people with the title data scientist around these days, there are varying definitions of the term. The skill required to configure cloud servers to run a 100TB Spark jobs, or to visualise a billion data points so that even non-technical people will understand it is fundamentally different from that required to be able to get an accuracy of 92% on a gradient boosting algorithm up from 75% using a weighted ensemble. In fact, I’d say most of the times this is the job for a team of people that combine these different skills.
That was one of the reasons CraftData Labs was born. The aim is to bring together a team of statisticians, scientists, engineers, and designers who really know their craft to enable businesses, organisations, governments and individuals to meet their business objectives. We will not only be focussed on the technical skills of statistics, computer science and mathematics, but also on good design and communication skills. After all, the best models in the world are useless if we can’t make them consumable and actionable to the user.
If I have to define what a data scientist is at CraftData, it would be someone who is adept in statistical calculations and mathematical models, understands the rigorous scientific method, and usually has a research background. Other titles such as data engineers, web developers, UI/UX designers, front-end developers, dashboard developers, database architects, data journalists etc., serve an important but different role. I hope having clear definitions of everyone’s role allows us to create a sustainable company capable of undertaking any challenging project.
Finally, eating my own dog-food here I’d say that I barely meet this definition of a data scientist myself. I certainly didn’t start out as so. But through the years of working in this discipline I do hope I’ve earned that title.
Thank you reading and I hope this has enabled you to get a better understanding of the data science phenomenon. I welcome any suggestions and comments here.