Transition into data science: bagging some wins

Raúl Vallejo
All The Data We Cannot See
2 min readJan 26, 2019

--

Companies with a highly trained quant department, will likely be looking into transition their existing team into a formal data science team. Amongst a few other reasons, medium size companies usually can’t afford to straight up hire an entire set of data scientists and analysts.

The paddles are a combination of Python, R, self-motivation, patience and big monitors

Top priority: showing some wins along the way (to keep the data science initiative afloat)

In reality, the transformation into a data team doesn’t substitute all previous responsabilities immediatly. When starting to build data infrastructure, it makes sense to begin from somewhere or something the team is familiar with.

Problem #1: Juggle between old tasks and new responsibilities

Solution #1: Find the main data source the team has worked with and start to automate the data ETL process

Much like the principal of building machine learning MVPs, the same applies when constructing data infrastructure. It is not reasonable to say “we will begin to work on machine learning solutions once the data infrasctructure is done”.

Data science systems must follow end-to-end reproducibility. This must be achieved iteratively. Otherwise, the team can spend months working (very hard) without anything to show for it. Something of value has to be shown to both IT teams and business teams.

Problem #2: The data infrastructure will never be fully done

Solution #2: When writing data processing code, the goal is to also deploy an initial data product, to complete the data pipeline and then scale up iteratively

The idea of quickly going through one iteration of the MVP process is to start covering previous analyses/reports with automated outputs (like dashboards). Each sprinting iteration of the data infrastructure development process will knock out a different analytical necessity the team used to do manually.

Now we have parallelized the work load. While one person is coding data processing scripts, another will be coding front-end data visualizations.

Now we can sprint faster…

--

--

Raúl Vallejo
All The Data We Cannot See

Actuary, statistician and certified Data Scientist. Music & concert junkie.