Tracking Data Provenance in an ML Pipeline with Dotscience

Published in

Dotscience

3 min readFeb 12, 2019

Hi everyone! I’m Luke, the founder here at Dotscience. Someone asked me recently if you can build data pipelines in Dotscience and track the provenance of each of the steps — the answer is yes!

I wanted to share how you can put together a simple ML pipeline, which:

Generates 200MB of random numbers between 0–1, and an “answer” which says whether the number is > 0.5 (the threshold)
Filters the random numbers, stripping out ones < 0.1 or > 0.9
Guesses what the threshold is based on the data

Now, this is a very stupid example of machine learning (it doesn’t even use any libraries like sklearn or TF, it’s just guessing a number) but it is very simple code, and shows how with just a bit of Dotscience annotation, the following Provenance Graph emerges:

Here’s the script I used to drive the pipeline:

#!/bin/bash
export PROJECT="<your dotscience project id>"
export IMAGE="quay.io/dotmesh/dotscience-python3:latest"for STEP in ingest_data.py transform_data.py train_model.py; do
    echo "================================================="
    echo "Starting $STEP..."
    echo "================================================="
    ds run . $PROJECT $IMAGE python $STEP
done

And the three steps, ingest_data.py:

import dotscience as ds; ds.script()
import random, osos.system("mkdir -p data")f = open(ds.output("data/random_numbers.txt"), "w")
for i in range(10000000):
    n = random.random()
    f.write(repr(n) + "," + ("1" if n > 0.5 else "0") + "\n")
f.close()
ds.publish()

Then also transform_data.py:

import dotscience as ds; ds.script()skipped = 0
kept = 0f = open(ds.output("data/transformed.txt"), "w")
for line in open(ds.input("data/random_numbers.txt"), "r"):
    if line.startswith("0.0") or line.startswith("0.9"):
        skipped += 1
    else:
        kept += 1
        f.write(line)ds.summary("skipped", skipped)
ds.summary("kept", kept)
ds.publish()

And finally, train_model.py:

import dotscience as ds; ds.script()# try to estimate a lower bound for val == 1
# super stupid machine learning examplefor line in open(ds.input("data/transformed.txt")):
    if "," in line:
        num, val = line.strip().split(",")
        num = float(num)
        val = int(val)        lowest_positive = 1        if val:
            if num < lowest_positive:
                lowest_positive = numf = open(ds.output("estimator.model"), "w")
f.write(repr(lowest_positive))
f.close()ds.summary("lowest_positive", lowest_positive)
ds.publish()

Put them all together, and you get a nice simple data pipeline, with full data provenance tracked!

Note that:

the pipeline.sh script wraps the execution of the steps in ds run commands, which wraps the execution in a fully tracked & versioned environment and executes it on a Runner
each Python script declares its input files with ds.input, and its output files with ds.output, this is all you need for Dotscience to track provenance — it also automatically tracks all the versions of the data & code
Python scripts which generate summary statistics (aka metrics) also declare those with ds.summary

With the latter, you also get a nice Dashboard of how the summary statistics change over time:

Interested in trying Dotscience yourself? Get in touch! I’d love to show you around 😄

Want to see a more interesting example that queries & versions data from an ever-changing SQL database, uses real ML libraries with parameters as well as summary stats, and deploys the model into production on Kubernetes? We’re working on this E2E demo, watch this space…

Tracking Data Provenance in an ML Pipeline with Dotscience

Written by Luke Marsden