TensorFlow Data for Dummies

Demystifying the TensorFlow Documentation for Tf.Data

Revannth V
May 28, 2020 · 6 min read

Disclaimer: Before you read on, this isn't exactly a ‘dummies’ course. You will need a basic understanding of Machine Learning(which you can pick up here) and the realization that the TensorFlow Documentation is unnecessarily complex( which you can realize here).

This is a tutorial for Tf.Data 2.0

Don't get me wrong! TensorFlow is a beautiful library. More often than not, Data Scientists turn towards it for most of their Modeling activities. The USP (Unique Selling Point) of this library is how it readily integrates with any Machine Learning Framework irrespective of the infrastructure. Now, I could try and make you fall in love with it but that's not why you’re here!

Picture Credits

Tensorflow is vast and overwhelming for anyone to dive into. This article would be my attempt to help you understand this library, specifically the TensorFlow Data API. So let’s jump right in!

A Complex TF WorkFlow

The flow above is representative of the usage of TensorFlow Data. It is used in one way or the other throughout a complex ML workflow. The flow shows how a component of the Workflow interacts with the other. It has a uni-directional flow. The data is generally present in external locations and is brought into memory in small chunks. These chunks are stored as Datasets. The datasets are transferred between each step.

TensorFlow Data API takes a chunk of data, processes it, and passes it onto the next step of the workflow. The processing is generally called transformation.

All in all, the TensorFlow Data API can be abstracted into three primary components :

  1. Data Extraction: The process of bringing data from an external location/previous step of workflow into the memory.
  2. Data Transformation: This is the business logic if you may wherein the data brought in is processed.
  3. Data Load: This step ensures that the transformed data is sent to the next step of the workflow/output location of your transformed data.

Disclaimer: You can skip the next paragraph and the diagram attached with it for a simplified understanding of the API.

Before we jump into these two components, lets quickly look at the structure of a Dataset. The Dataset is the backbone of the entire Tf.data API. It takes the data from external locations and creates graph-like references which are parsed using the abstracted TensorFlow subclasses. Each Dataset, when iterated upon, call the parsers and bring the data into the memory.

Dataset Structure

Data Extraction

The data has to be moved in chunks from the source and into the memory before processing it. While we do that, we also have to be cognizant of the fact that the processing activity must be consistent for every chunk of data. To ensure that this happens, an is created which processes each chunk with the same set of operations.

Input Function structure

Data Extraction deals with reading data and creating the Dataset Object. This can be done in six ways :

  1. Consuming NumPy arrays

If all of your input data fits in memory, the simplest way to create a from them is to convert them to objects and use .

2. Consuming CSV files

The CSV file format is a popular format for storing tabular data in plain text. Hence, has a plugin for the same.

3. Consuming Python Generators

If your data is being served by a python function(let's say a Python scraper), tf.data API can easily pick this up and parse it.

4. From TFRecord Data

This is a rare and complex case scenario but some workflows use the TFRecord format and it is good to know about it

The API supports a variety of file formats so that you can process large datasets that do not fit in memory. For example, the TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. The class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.

5. Consuming text data

Many datasets are distributed as one or more text files(most Natural Language Processing Use Cases). The provides an easy way to extract lines from one or more text files. Given one or more filenames, a will produce one string-valued element per line of those files.

6. Consuming sets of files deployed either on a GCS Bucket or a S3 bucket

There are many datasets distributed as a set of files, where each file is an example.

Data Transformation

Any kind of operation done on the Dataset to change its natural structure is can be called a transformation. There are four different transformations that are generally applied to a dataset.

1. Shuffling

Shuffling is simply the process of randomizing the order of selecting the training data from the external location. Once the data is Extracted, it is jumbled randomly (not preserving the original order of the data). This ensures that the model doesn't learn any inherent patterns present in the dataset.

With Shuffle

2. Repeat

Most Machine Learning Models are trained on several iterations of the training data. These iterations are referred to as epochs. Each epoch generally has the data represented in a different pattern to ensure that the Model isn’t picking up some patterns innately present in the dataset.

Repeat is the processing of repeating the entire dataset multiple times.

With Repeat

3. Batching

Batching is the process of taking bringing a set of elements into the memory to apply the transformation function. In the case of image data, this could be to add some cells for padding. The simplest form of batching stacks consecutive elements of a dataset into a single element. The transformation does exactly this, with the same constraints as the operator, applied to each component of the elements: i.e. for each component i, all elements must have a tensor of the exact same shape.

With Batching

4. Map

Map is a comparatively less common transformation. It allows you to call a custom function on your dataset. But while using this, the developer has to be cognizant of using making sure that the changes made are consistent across the records.

Data Load

We are almost done here! Data load is a very simple concept. It ensures that the data received after transformation is ready to be loaded into the next step. This is achieved using the method which generates an iterator that automatically fetches and transfers your data to the next step.

TF data API Cheatsheet

Thank you so much for reading!

Do let me know how you liked the article. I plan on demystifying more modules of the TensorFlow Library. Tune in for them!

The Startup

Get smarter at building your thing. Join The Startup’s +800K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store