What is tf.data.Dataset? Part 1.

­
2 min readApr 23, 2019

--

They always say that the simplest way of using tf.data.Dataset is the from_tensor_slices method. So give it a try. Suppose that I have a list of image file names and the variable name of that list is all_image_paths.

all_image_paths

The output is something like

['../input/train_images/5998cfa4-23d2-11e8-a6a3-ec086b02610b.jpg',
'../input/train_images/588a679f-23d2-11e8-a6a3-ec086b02610b.jpg',
'../input/train_images/59279ce3-23d2-11e8-a6a3-ec086b02610b.jpg',
'../input/train_images/5a2af4ab-23d2-11e8-a6a3-ec086b02610b.jpg',
'../input/train_images/599fbd89-23d2-11e8-a6a3-ec086b02610b.jpg',

The length of the list is 196,299. The last index should be 196,298. It’s rather long. One of the good habits in programming is to begin a few lines of codes for a relatively small number of data. When it succeeds then we can move on, and when it doesn’t then we start debugging. Since we have only a couple of data points and a few lines of codes we can easily expect the behavior of the codes and hopefully correct the errors we have. The point here is not to try to make a full program that runs completely. It will not work!

So start with just one data point.

# all_image_paths[0]a = tf.data.Dataset.from_tensor_slices( all_image_paths[0] )# Unbatching a tensor is only supported for rank >= 1

Hmm…We have an error at the very beginning of the codes. But the next code with all data in it is working.

a = tf.data.Dataset.from_tensor_slices( all_image_paths )

So my advise of beginning small is wrong? Actually, it is not. It gives us a chance to learn something more than we need right now but will be needed in the near future.

Let’s take a look at the error message we have. It says something about rank. What is a rank? If you took a linear algebra course, you might be heard of it. If you didn’t take the course, you are still lucky. It is not the same rank as in the linear algebra. It’s much more simple.

This is the explanation of the tensorflow team. The rank of a tensor is not the same as the rank of a matrix. The rank of a tensor is the number of indices required to uniquely select each element of the tensor. Rank is also known as “order”, “degree”, or “ndims.”

Ok, then how can we figure out the rank of data? The function tf.rank is the answer.

print(tf.rank(all_image_paths[0]))
print(tf.rank(all_image_paths))

Since all_image_paths[0] is just a string, its rank is 0. And since all_image_paths is a list of strings, in order to uniquely select each file name we need to know the exact location in the list, therefore the rank is 1. So how could we change the first code? Just make it as a list.

a = tf.data.Dataset.from_tensor_slices( all_image_paths[0:1] )

Now we have our first tf.data.Dataset.

--

--