Collaboration between data engineers, data analysts and data scientists

How to efficiently release in production?

Germain Tanguy
May 9 · 7 min read
Effective teamwork (Apollo 13)

Project workflow

Industrializing machine learning pipeline

Machine learning blueprint

dialogue of the deaf (translation of french expression)
gs://dailymotion-$project-$env-$algoname/$version/$token/extract
gs://dailymotion-$project-$env-$algoname/$version/$token/preprocess
gs://dailymotion-$project-$env-$algoname/$version/$token/train
gs://dailymotion-$project-$env-$algoname/$version/$token/evaluate
gs://dailymotion-$project-$env-$algoname/$version/$token/predict

Blueprint instantiation

It’s not rocket science
p = (
  DailymotionDataflowSession()
    .set_runner(“DataflowRunner”)
    .set_machine_type(“n1-highmem-8”)
    .set_autoscaling_algorithm(“THROUGHPUT_BASED”)
    …
    .build()
)pcoll = DailymotionDataflowSession.read(
  pipeline=p,
  path=”gs://…/“ or “/Users/local/..”
)tokenized_pcoll = TokenizerModel.transform(
  pcoll,
  columns=[“title”, “description”…],
  text_transformer=clean_and_tokenize, # customised row transformer
)bag_of_word = (
  BagOfWord()
    .set_standardize(True)
    .set_min_count(10)
)bag_of_word_model = bag_of_word.fit(tokenized_pcoll)
transform_train_pcoll = bag_of_word_model.transform(pcoll)
…
def clean_and_tokenize(element):
  # Use specific code from beautifulsoup, nltk, polyglot…
  return element_transformed
train = KubernetesPodOperator(
  task_id=’train’,
  name=’airflowjob-{}’.format(‘{{ ds_nodash }}’),
  image=‘docker-path/data-topicgeneralizer:{ENV}’.format(ENV=ENV),
  cmds=[
    ‘python’,
    ‘-m’,
    ‘topicgeneralizer.train’,
    ‘ — project={}’.format(dst.TOPICS_PROJECT),
    ‘ — token={token}’.format(token=TMP_TOKEN_XCOM),
  ],
  resources=Resources(request_cpu=’1000m’, request_memory=’11Gi’),
  …
)

Dailymotion

The home for videos that matter

Germain Tanguy

Written by

Senior Data engineer @Dailymotion

Dailymotion

The home for videos that matter