TensorFlow Data for Dummies
Disclaimer: Before you read on, this isn't exactly a ‘dummies’ course. You will need a basic understanding of Machine Learning(which you can pick up here) and the realization that the TensorFlow Documentation is unnecessarily complex( which you can realize here).
This is a tutorial for Tf.Data 2.0
Don't get me wrong! TensorFlow is a beautiful library. More often than not, Data Scientists turn towards it for most of their Modeling activities. The USP (Unique Selling Point) of this library is how it readily integrates with any Machine Learning Framework irrespective of the infrastructure. Now, I could try and make you fall in love with it but that's not why you’re here!
Tensorflow is vast and overwhelming for anyone to dive into. This article would be my attempt to help you understand this library, specifically the TensorFlow Data API. So let’s jump right in!
The flow above is representative of the usage of TensorFlow Data. It is used in one way or the other throughout a complex ML workflow. The flow shows how a component of the Workflow interacts with the other. It has a uni-directional flow. The data is generally present in external locations and is brought into memory in small chunks. These chunks are stored as Datasets. The datasets are transferred between each step.
TensorFlow Data API takes a chunk of data, processes it, and passes it onto the next step of the workflow. The processing is generally called transformation.
All in all, the TensorFlow Data API can be abstracted into three primary components :
- Data Extraction: The process of bringing data from an external location/previous step of workflow into the memory.
- Data Transformation: This is the business logic if you may wherein the data brought in is processed.
- Data Load: This step ensures that the transformed data is sent to the next step of the workflow/output location of your transformed data.
Disclaimer: You can skip the next paragraph and the diagram attached with it for a simplified understanding of the API.
Before we jump into these two components, lets quickly look at the structure of a Dataset. The Dataset is the backbone of the entire Tf.data API. It takes the data from external locations and creates graph-like references which are parsed using the abstracted TensorFlow subclasses. Each Dataset, when iterated upon, call the parsers and bring the data into the memory.
The data has to be moved in chunks from the source and into the memory before processing it. While we do that, we also have to be cognizant of the fact that the processing activity must be consistent for every chunk of data. To ensure that this happens, an
input_function is created which processes each chunk with the same set of operations.
Data Extraction deals with reading data and creating the Dataset Object. This can be done in six ways :
- Consuming NumPy arrays
2. Consuming CSV files
The CSV file format is a popular format for storing tabular data in plain text. Hence,
Tf.Data has a plugin for the same.
3. Consuming Python Generators
If your data is being served by a python function(let's say a Python scraper), tf.data API can easily pick this up and parse it.
4. From TFRecord Data
This is a rare and complex case scenario but some workflows use the TFRecord format and it is good to know about it
tf.data API supports a variety of file formats so that you can process large datasets that do not fit in memory. For example, the TFRecord file format is a simple record-oriented binary format that many TensorFlow applications use for training data. The
tf.data.TFRecordDataset class enables you to stream over the contents of one or more TFRecord files as part of an input pipeline.
5. Consuming text data
Many datasets are distributed as one or more text files(most Natural Language Processing Use Cases). The
tf.data.TextLineDataset provides an easy way to extract lines from one or more text files. Given one or more filenames, a
TextLineDataset will produce one string-valued element per line of those files.
6. Consuming sets of files deployed either on a GCS Bucket or a S3 bucket
There are many datasets distributed as a set of files, where each file is an example.
Any kind of operation done on the Dataset to change its natural structure is can be called a transformation. There are four different transformations that are generally applied to a dataset.
Shuffling is simply the process of randomizing the order of selecting the training data from the external location. Once the data is Extracted, it is jumbled randomly (not preserving the original order of the data). This ensures that the model doesn't learn any inherent patterns present in the dataset.
Most Machine Learning Models are trained on several iterations of the training data. These iterations are referred to as epochs. Each epoch generally has the data represented in a different pattern to ensure that the Model isn’t picking up some patterns innately present in the dataset.
Repeat is the processing of repeating the entire dataset multiple times.
Batching is the process of taking bringing a set of elements into the memory to apply the transformation function. In the case of image data, this could be to add some cells for padding. The simplest form of batching stacks
n consecutive elements of a dataset into a single element. The
Dataset.batch() transformation does exactly this, with the same constraints as the
tf.stack() operator, applied to each component of the elements: i.e. for each component i, all elements must have a tensor of the exact same shape.
Map is a comparatively less common transformation. It allows you to call a custom function on your dataset. But while using this, the developer has to be cognizant of using making sure that the changes made are consistent across the records.
dataset = dataset.map(lambda record : replace_nulls(record)) #custom function to handle NULL values
We are almost done here! Data load is a very simple concept. It ensures that the data received after transformation is ready to be loaded into the next step. This is achieved using the
prefetch method which generates an iterator that automatically fetches and transfers your data to the next step.
Thank you so much for reading!
Do let me know how you liked the article. I plan on demystifying more modules of the TensorFlow Library. Tune in for them!