Collaboration between data engineers, data analysts and data scientists

How to efficiently release in production?

Effective teamwork (Apollo 13)

Wouldn’t it be great if we lived in a frictionless world where data engineers, data analysts and data scientists built a perfect common ground for efficient exchanges? Unfortunately we’re not quite there yet, but we’ve been analyzing Dailymotion’s recent Data journey that focuses on how data engineers work with data scientists and data analysts to improve production release. The ongoing challenge is to find the right balance between catering to each specific need and being in a generic customer/supplier relationship.


For those unfamiliar with our company, Dailymotion is a leading video platform company where 330 employees share their skills. Three years ago, our Data team was comprised of 3 people, today they are 24 of us (7 data engineers, 5 data scientists, 8 data analysts and 4 projects/product managers). As we scaled up, we had to learn how to exchange efficiently despite our different roles. This meant using the same vocabulary, going in the same direction and building a common framework where every data function had its place.

Our data products are mainly divided in two kinds:

  • Releasing analytics products
  • Industrializing machine learning pipelines

Releasing analytics products

We’ll begin first by tackling how to efficiently build data analytics products, this means making aggregated data available via different mediums reflecting different business dimensions (This could be within the company or external to Partner). This section covers the collaboration between data engineers and data analysts.

The main challenges are:

  • Translating business rules to code without having data engineers on the critical path each time a new rule is added/changed and without having to build a new domain-specific language above SQL available in a user-friendly web interface for a business person.
  • Scheduling data aggregations/consolidations — the pattern of these aggregations/consolidations (not the business logic, but the technical pattern) are often the same. As the business translation to technical implementation is mainly conducted by data analysts (primary tools are SQL and tableau, second one Python), it would not be very efficient to have the data engineers in the loop just to repeat an existing technical workflow.

Our data lake is on BigQuery and is therefore accessible by SQL queries. Our scheduling tool is Apache Airflow, which allows us to define our workflows (aka DAG) in Python which makes them versionable on Github and also integrated into our Dailymotion CI/CD process. Data engineers have delivered some dockerized tooling around airflow and have also built some generic operators to fill general needs, such as “check if this data is available before launching my computation“ or “launch my SQL query on BigQuery and export result to a storage”. This means that our Data Analysts can instantiate this docker with all dependencies already installed and iterate on their workflow.

After iterating for a few months we’ve finally converged to this framework:

Project workflow

The following steps for putting a new project into production within this framework are:

1. Data analysts and data engineers discuss the new workflow together, data analysts bring the business needs and engineers check that all existing operators can answer those needs. The data engineers are also responsible for maintaining the sustainability of the global architecture (no duplicate flow, sufficient optimization and identified, potential points of failure…)

2. Data analysts implement the workflow

3. Data analysts can iterate and test it within a local environment which contains all the dependencies

4. When they are ready to test the workflow in stage, data engineers go through a code review

5. Data analysts can test their code in stage

6. Data engineers and data analysts release in production together

7. Data analysts set up a monitoring dashboard and the alert tool (in our case: datadog)

8. Data analysts and data engineers are both owners of the run after the product release

Demystifying Apache Airflow brings autonomy to data analysts and frees up time for data engineers. Within this framework, data analysts can focus on the business/analytical side of a project and do not depend on data engineers each time there is a functional addition/modification on a pipeline. Data engineers can focus on pipeline idempotency, integration of new sources inside the data lake, on data lineage and tooling. The workload related to production release is therefore rebalanced between the two roles.

Industrializing machine learning pipeline

The second part of the journey consists of efficiently releasing new machine-learning algorithms into production. This means updating a machine learning model each time new data is available, this also means being able to apply a model in batch or real-time and finally having the ability to track if the model still performs well over time. We can separate them into three use cases: training, prediction, and evaluation. This section covers the collaboration between data engineers and data scientists from a technical standpoint to release into production.

Machine learning blueprint

dialogue of the deaf (translation of french expression)
Have you ever realized after a long debate that you were all talking about the same thing from the beginning but using different words?

The first step is to define general components which cover all of the data scientists’ implementations which data engineers can then use and industrialize to fit use cases.

A typical machine learning algorithm is divided into five main components:

  • Extract: fetching from DB data to use
  • Preprocess: building features from the data
  • Train: create a model with the features
  • Predict: apply the model to give a score
  • Evaluate: track the performance of a model (and between model) overtime

We have defined that each component input/output should be in storage (in our case Google Cloud Storage, see below). There are several advantages using this method, but the two main ones are: break down into steps a run and be able to investigate old runs.

gs://dailymotion-$project-$env-$algoname/$version/$token/extract
gs://dailymotion-$project-$env-$algoname/$version/$token/preprocess
gs://dailymotion-$project-$env-$algoname/$version/$token/train
gs://dailymotion-$project-$env-$algoname/$version/$token/evaluate
gs://dailymotion-$project-$env-$algoname/$version/$token/predict
  • $project: is project’s name
  • $env: is environment (i.e. dev/stage/prod)
  • $algoname: is the name of the machine learning algorithm
  • $version: is the version of this algorithm
  • $token: token is a universally unique identifier to reference a run

Now that we have different components, we can mix them to fit our use cases:

  • Training: extract, preprocess, train, predict, evaluate
  • Prediction: extract, preprocess, predict
  • Evaluation: evaluate

Blueprint instantiation

It’s not rocket science

Regarding the extract step, our Data team mainly use BigQuery, data scientists fetch the dataset via SQL queries.

Secondly, for the preprocess step, data scientists and data engineers work together to make sure preprocess computation is scalable. In our case, we use Python Apache Beam. Data engineers build a common science library which hides the cluster-computing framework with a friendly “Scikit-learn” style.

Sample of a simple preprocessing step (cluster-computing framework abstracted):

p = (
DailymotionDataflowSession()
.set_runner(“DataflowRunner”)
.set_machine_type(“n1-highmem-8”)
.set_autoscaling_algorithm(“THROUGHPUT_BASED”)

.build()
)
pcoll = DailymotionDataflowSession.read(
pipeline=p,
path=”gs://…/“ or “/Users/local/..”
)
tokenized_pcoll = TokenizerModel.transform(
pcoll,
columns=[“title”, “description”…],
text_transformer=clean_and_tokenize, # customised row transformer
)
bag_of_word = (
BagOfWord()
.set_standardize(True)
.set_min_count(10)
)
bag_of_word_model = bag_of_word.fit(tokenized_pcoll)
transform_train_pcoll = bag_of_word_model.transform(pcoll)

See this article for more details on bag-of-words representation for video channels:

Custom method defined by data scientists:

def clean_and_tokenize(element):
# Use specific code from beautifulsoup, nltk, polyglot…
return element_transformed

Thirdly, for the training, prediction and evaluation stages, data scientists use Tensorflow or Scikit-learn.

For the scheduling part, data engineers provide an Apache Airflow template to run all the steps. Each component is dockerized and triggered by a KubernetesPodOperator, therefore the DevOps side is abstracted and data scientists can focus on the science part.

An example on how to implement the trigger of a train job via Airflow:

train = KubernetesPodOperator(
task_id=’train’,
name=’airflowjob-{}’.format(‘{{ ds_nodash }}’),
image=‘docker-path/data-topicgeneralizer:{ENV}’.format(ENV=ENV),
cmds=[
‘python’,
‘-m’,
‘topicgeneralizer.train’,
‘ — project={}’.format(dst.TOPICS_PROJECT),
‘ — token={token}’.format(token=TMP_TOKEN_XCOM),
],
resources=Resources(request_cpu=’1000m’, request_memory=’11Gi’),

)

Furthermore, data engineers build API and setup CI/CD on different repositories. To sum up, data engineers and data scientists collaboration :

By working closely together, data scientists are able to focus on the scientific part and within this framework, they have the autonomy to build prod-ready pipelines. Data engineers can focus on scalability, reusability and ensure that the pipeline input/output respects the global architecture. The framework also ensures consistency between feature teams working on different machine learning pipelines.

After building the common ground, in Dailymotion, the technologies used by the three functions can be represented as follows:


It all boils down to this. Data engineers, data analysts, and data scientists should work in a collaborative manner to deliver new products efficiently. This is done by creating the right balance between building generic services and implementing each specific need separately. This helps engineers develop analytical and scientific skills to write prod-ready code. Collaboration also improves knowledge sharing around the data lake between the three roles thus making data projects more agile and triggering stronger long-term results.