Extracting labels, windowing multivariate series, multiple TF Record file shards and other useful tips for dealing with sequential data

Image for post
Image for post
Image by stux from Pixabay

The tf.data.Dataset API is a very efficient pipeline builder. Time Series Tasks can be a bit tricky to implement properly. In this article, we are going to dive deep into common tasks:

  • Windowing Labelled Data
  • Windowing Unlabelled Data by Looking Ahead
  • Sharding TF Record Files Tips for Efficiency and No Data Loss

Let’s begin!

Windowing Labelled Data

With the dataset api this is simple to do. Assume the following configuration. input feature is and label is .

a, b
1, 0
2, 0
3, 1
4, 0
5, 0
6, 1

Each row can be described by a tensor shaped . …

Quit depending on positional indices and input value ordering. Start relying on named inputs and outputs. Avoiding data wiring errors

Image for post
Image for post
Image by Daniel Dino-Slofer from Pixabay

Named inputs and outputs are essentially dictionaries with string keys and tensor values.


  1. Defence Against Feature Reordering
  2. Self — Sufficient Model Serving Signatures and Metadata
  3. Renaming and Absent Feature Protection

Most machine learning pipelines read data from a structured source ( database, CSV files/ Pandas Dataframes , TF Records), perform feature selection, cleaning, (and possibly) preprocessing, passing a raw multidimensional array (tensor) to a model along with another tensor representing the correct prediction for each input sample.

Reorder or rename input features in production?Useless results or the client — side breaks in production

Absent Features? Missing Data? Bad output value interpretation? Mixing up integer indices by mistake? Useless Results or the client — side breaks in…

A creative PoseNet application that runs on your browser and tries to predict if you’re jumping, crouching, or staying still

Image for post
Image for post
Screenshot from chromedino.com

You all know what this game is about. This is the best service-offline-sorry page in the world. People have made simple bots that time the dino’s jump to beat the game to reinforcement learning agents with CNN state encoders.

It’s a game and we’re supposed to have fun. Today, I’ll walk you through how to write some JavaScript code to play the game by jumping around in your room.

This thing is hard to play.

You can try the game here and view the full source code here.

Overcoming Tech Barriers

Setting up a small webpage with basic javascript support to get a webcam feed and a dino game container is trivial for seasoned developers. All you need is latest chrome, a tag, some JavaScript snippets to load a webcam feed from stackoveflow and the ripped t-rex game. …

An organised codebase enables you to implement changes faster and make less mistakes, ultimately leading to higher code and model quality. Read more to learn how to structure your ML projects with Tensorflow Extended (TFX), the easy and straightforward way.

Image for post
Image for post
Image by Francis Ray from Pixabay

Project Structure: Requirements

  • Enable experimentation with pipelines
  • Support both a execution mode and a execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud.
  • across pipeline variants if it makes sense to do so
  • Provide an easy to use for executing pipelines with different and data

A correct implementation also ensures that tests are easy to incorporate in your workflow.

Project Structure: Design Decisions

  • Use Python.
  • Use Tensorflow Extended (TFX) as the pipeline framework.

In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle. …

A quick api overview and a self-contained example of fluent-tfx

If this production e2e ML pipelines thing seems new to you, please read the TFX guide first.

On the other hand, if you’ve used TFX before, or planning to deploy a machine learning model, you’re in the right place.

Image for post
Image for post
Image by Michal Jarmoluk from Pixabay

But Tensorflow Extended is already fully capable to construct e2e pipelines by itself, why bother to use another API ?

  • Verbose and long code definitions. Actual preprocessing and training code can be as lengthy as an actual pipeline component definition.
  • Lack of sensible defaults. You have to manually specify inputs and outputs to everything. This allows maximum flexibility on one hand, but on the other 99% of cases, most of the IOs can be automatically wired.

Why it exists and how it’s used in Beam Pipeline Components

Image for post
Image for post
Image from https://www.tensorflow.org/tfx/guide/mlmd

ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developer and data scientist workflows.

TensorFlow Extended (TFX) is an end-to-end platform for deploying production ML pipelines

The current version of ML Metadata by the time this article is being published is v0.22 (tfx is also v0.22). The API is mature enough to allow for mainstream usage and deployment on the public cloud. Tensorflow Extended uses this extensively for component — component communication, lineage tracking, and other tasks.

We are going to run a very simple pipeline that is just going to generate statistics and the schema for a sample csv of the famous Chicago Taxi Trips dataset. …

A practical and self-contained example using GCP Dataflow

The fully end to end example that tensorflow extended provides by running produces 17 files scattered in 5 directories. If you are looking for a smaller, simpler and self contained example that actually runs on the cloud and not locally, this is what you are looking for. Cloud services setup is also mentioned here.

Image for post
Image for post
Picture from pexels.com

What’s going to be covered

We are going to generate statistics and a schema for the Chicago taxi trips csv dataset that you can find by running the command under the directory.

Generated artifacts such as data statistics or the schema are going to be viewed from a jupyter notebook, by connecting to the ML Metadata store or just by downloading artifacts from simple file/binary storage. …

Motivation, intuition and the process behind this series of articles

Hi there. I’m Theodoros, a Computer Engineering Student here in Greece and I love deep learning.

Welcome to the Understanding Machine Learning in Production. In this article we are going to go over what the main objective of this series is all about and a rough outline of what is going to be covered.

Image for post
Image for post

I’m creating these articles because I feel that although the tensorflow ecosystem and high level APIs like keras along with all these free (and non free) tools and services that big companies provide online, like the famous google colab, lower entry barriers to machine learning, the whole ecosystem on the other hand has got so big and it is hard to get a grasp of it.

and Custom TFX Pipeline Components

Image for post
Image for post

Apache Beam

Apache Beam got incubated at Google. It’s an evolution of MapReduce in some sense. You see, Map Reduce has changed the way big data processing works. There are both open source solutions supporting it as (Hadoop, Spark) and cloud solutions provided as a service (GCP Dataflow).

Technologies have been made specifically for different workloads. Apache Flink for strictly stream processing. Spark for batch loads.

Beam is not an execution engine like Spark or Dataflow. It’s an API to define streaming or Batch processing workloads that run as a Directed Acyclic Graph independent of the execution engine and programming language. …

Standardised interfaces, performance and ease of use.

This is a quick review on the various data sources and formats that are commonly used or recommended in the tensorflow ecosystem.

Image for post
Image for post

The Apache Beam Pipeline runs each component as a different task that receives inputs and outputs, as a standalone workload. It can be thought as a Directed Acyclic Graph.

Key Considerations

  1. It did not take long for tensorflow to evolve from a neural network to a complete ecosystem. All the different components need standardised data formats and protocols to work in harmony. Ideally everything technology used should be language and platform independent.
  2. A step on the pipeline does more than using inputs from the previous step and producing outputs for the next components. Multiple outputs exist that get used from components more than 1 step ahead in the pipeline. There is also a need to be able to asynchronously view results of a run while it’s executing or an arbitrary time in the future. That’s the need for a persistent artifact store. …


Theodoros Ntakouris

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store