3 Hats of a Data Scientist

Adventures in swapping metaphorical headgear

Sharon H. Chou
Soluto Nashville
3 min readJan 29, 2018

--

Data Science is a strange creature, the hybrid chimeric offspring of machine learning and statistics. Its sightings are numerous yet confounding, because it’s often not well-defined and takes different forms depending on the data and people involved.

In the context of business problems, the “science” in data science really boils down to just “finding out more,” in the spirit of Scientia — “to know.” It aligns well with my personal driving principle of making discoveries. I want to uncover previously unknown patterns in how processes work, whether physical, biological, or sociological.

During my time working as a data scientist, I have rotated among three different roles at various stages of data-centric projects, characterized as the following:

The Forager / Hunter

Every data science problem starts with finding the necessary data. In an ideal world, it lives in data warehouses or databases like neatly labeled grocery store aisles and shelves. (Thank goodness for data architects and engineers!) In the non-ideal case, it grows in hard-to-reach mountain peaks or hangs out at the bottom of a murky data lake. Most of the time it’s somewhere in between.

On good days, you can write brief SQL SELECT queries to fetch the data from a database. On other days, you might need to summon APIs and contend with access tokens and certificates. Sometimes just exposing the data can help tremendously — if no one has had access to it before. Different teams can have different uses for the same dataset, and simply extracting it to a central location is a great enabler, be it an AWS database or Google Sheet.

The Chef

Like food preparation, there’s often a lot of data prep involved before applying any statistical or machine learning algorithm. To ensure the best result possible, the raw data needs to be cleaned and trimmed to its most relevant essence.

In any given use case, there are far more ways for the data to have wrong formats than right ones. Good datasets are mostly alike; bad datasets are problematic in their own ways, e.g. misspelled or missing names, emails, commas in CSV files, inconsistent timestamps due to time zones or Daylight Savings time.

For data tables, the next step is to slice and dice, or merge and combine, using either SQL, Python, or R. For more intricate data structures like nested JSON files, they need to be peeled like onions until they burn the eyes.

Most commonly used machine learning or statistical algorithms are well-developed now. They range from the simple-to-use-and-explain like regressions (think pots and pans), to the simple-to-use-but-difficult-to-explain like neural nets and random forests (think microwaves).

While these algorithms are often implemented in packaged libraries, the default settings don’t always work well. You cannot simply plug-and-play because the algorithms have many input parameters that need to be adjusted for different problems. Frying a fish requires different condiments than sautéeing Brussels sprouts, in addition to different cooking times and heat settings.

Once you’re happy with the analysis results (plated dish), the next step is presentation.

The Waitstaff

Neighborhood diner fare, like sandwiches or meatloaf can be brought out as is, because most people know what they are. The waiter does not (usually) need to explain where the ingredients came from. They are the equivalent of an Excel table with a title and headers, or bar charts with axis labels. The information should be easy to digest, the trends obvious to see, and actionable items readily apparent.

The more complex visualizations such as cluster maps or multiple histograms with logarithmic scales need more finesse to make them understandable–like when a maitre d’ describes a dish as the perfect pairing between alpine moss foam and deep sea bass. Complex plots can show more nuanced information, but the key points still need to be highlighted so the audience is not overwhelmed by sensory overload.

The goals of data presentation can be a combination of revealing insightful discoveries or calls to action. The stakeholders should see accurate data representations that they’re able to learn from and clearly interpret, in order to make the best decisions going forward.

If you’re interested in joining our team, feel free to check out the job openings at Soluto Nashville and send me a note!

--

--

Sharon H. Chou
Soluto Nashville

A data scientist/ data shepherd: one who herds data across the roaming prairie of ((d|m)is)information.