‘Data’ + ‘Source’ == ‘DataSource’ >> True
I don’t know why I am coming up with this cheeky title as an introduction to DataSource — maybe because that’s all it really is in FastAI? A DataSource really is the source for all your data — train, valid & test (can be more subsets too — depending on however many are required).
Think of DataSource as a central place consisting of items(data) with various different pipelines applied on it to get the Xs and ys.

As is common theme in most of my articles, it is first important to get an intuition about the topic before we start digging into the code and the intuition that I get from DataSource is shown in the image above. A DataSource has a central place (a source) that consists of all the items and we apply a set of Transforms on these items to get the Xs and ys.
In the pets example (see pets tutorial here), since the items are .jpg images, if we want to get to a point where we want to train a Neural Net, we will need two things from these images — a Tensor representing the image and a Label to train the model on. Therefore, if we apply two set of transforms on this List of items, we can get our Xs and ys.
And this is what DataSource does!
Inputs and Outputs
In this post, we will be creating the same DataSource object as in the last post here. We will create the Pets DataSource passing in the items, Transforms and Train/Test splits. Internally, when we were using the DataBlocks API, in step-4, the same DataSource object was created to extract the DataBunch from it by passing in item_tfms and batch_tfms as well.
So let’s get the things we need to create a DataSource:
np.random.seed(2)
path = untar_data(URLs.PETS)
pat = r'/([^/]+)_\d+.jpg$'
items = get_image_files(path/'images')[:5]
tfms = L([PILBase.create], [RegexLabeller(pat), Categorize])
splits = RandomSplitter(0.2)(items)items: The list of items or our Dataset (source of data);tfms: A list of Transforms that will be applied to the data, can be a list of two transforms, one for Xand one for y;splits: Split indexes that tell how to split the data into train and valid set (or more);
I am only using the first 5 images in items as a dummy example for the purpose of this post. You can also run it on the complete dataset, just remove the [:5] at the end of line items = get_image_files(path/’images’)[:5].
items
>> (#5) [/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg,/home/ubuntu/.fastai/data/oxford-iiitpet/images/Abyssinian_92.jpg,/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg]tfms
>> (#2) [[<bound method PILBase.create of <class 'local.vision.core.PILBase'>>],[<local.data.transforms.RegexLabeller object at 0x7efee98d3cf8>, <class 'local.data.transforms.Categorize'>]]splits
>> ((#4) [1,0,3,2], (#1) [4])
Along with items, we also pass in a list of Transforms — in this case, as mentioned, a set of X transforms and a set of y transforms. The X transforms convert the .jpg image to PILImage and the set of y transforms that first extract the label from the file path and also categorize the labels since we can’t feed str labels directly to train the Neural Net.
As long as we keep it in mind that DataSource is really a set of X and y Transforms, we are all good! Inside FastAI, A TfmdList applies a set of transforms to every item in the list. So we have two TfmdLists — one for X to create the image and one for y to extract the label and categorize.
Now, let’s create the DataSource object and check out the outputs. The TfmdLists get stored as tls attribute on the object.
dsrc = DataSource(items, tfms=tfms, splits=splits)
dsrc.tls>>(#2) [TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Abyssinian_92.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg')]
tfms - (#1) [Transform: True (object,object) -> create ],TfmdList: [PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/keeshond_34.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Siamese_178.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/german_shorthaired_94.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/Abyssinian_92.jpg'), PosixPath('/home/ubuntu/.fastai/data/oxford-iiit-pet/images/basset_hound_111.jpg')]
tfms - (#2) [Transform: True (object,object) -> RegexLabeller ,Categorize: True (object,object) -> encodes (object,object) -> decodes]]
If you look closely, we have two TfmdLists having the same set of items but two different sets of X and y transforms. The first TfmdList has tfms [Transform: True (object,object) -> create ], while the second TfmdList has tfms [Transform: True (object,object) -> RegexLabeller ,Categorize: True (object,object) -> encodes (object,object) -> decodes], I really wasn’t lying when I said that’s exactly what a DataSource does :) (bad attempt for a joke, I know).
Let’s check the outputs,
dsrc[0]>> (<local.vision.core.PILBase image mode=RGB size=375x500 at 0x7FEE392BFF60>,
tensor(3))
So, as expected, we get a PILImage as our X and a category as our y. We can check what the category represents by looking inside the vocab of the DataSource object.
dsrc.vocab(3)>> 'keeshond'
Therefore, the image that we have is that of a keeshond. Let’s visualise the image.
dsrc[0][0]
To summarise and reiterate, a DataSource has a list of TfmdLists, in this case 2 TfmdLists, one to extract X variable with tfms [PILImage.create], and another to get the y variable with tfms [RegexLabeller, Categorize]. Finally, when check the 0th item in dsrc, we get a PILImage and a category as tensor(3), which represents keeshond.
Show me the code, please
Now that we know what a DataSource does and also the inputs and outputs that go into it, let’s try and understand what really goes on underneath. I won’t be looking at TfmdLists in this post, but, rather in a future one where we will take a deep dive into TfmdLists and also Transforms.
Above is the source code for DataSource, few things to note:
- When we first instantiate a
DataSourceobject, the first thing to happen is thattlsget added to theobject, like so:
self.tls = L(tls if tls else [TfmdList(items, t, **kwargs) for t in L(ifnone(tfms,[None]))])For every list of tfms(passed to the constructor), create a TfmdList passing in the items and Tranforms as t. Therefore, we get two TfmdLists, one for X and one for y, since tfms was a list of two lists of transforms that, if you remember, looked like (#2) [[<bound method PILBase.create of <class ‘local.vision.core.PILBase’>>],[<local.data.transforms.RegexLabeller object at 0x7efee98d3cf8>, <class ‘local.data.transforms.Categorize’>]].
We could also just created two TfmdLists and passed them to create a DataSource object. This does occur inside the subset method of DataSource.
Also, key thing to note is that both the TfmdLists get the same items.
2. Calling __getitem__ on the DataSource object calls __getitem__ on both the TfmdLists returning a tuple. Since, we have 2 TfmdLists in this example, we get a tuple of two items — Xand y.
3. __len__ on dsrc returns the __len__ on the first TfmdList. def __len__(self): return len(self.tls[0]), since the the first TfmdList has the same number of items as the second, this essentially returns the number of items in dsrc.
4. dsrc has two subsets — subset(0) and subset(1). subset(0) represents train set, while subset(1)represents valid set. This is explained in more detail in the post.
5. dsrc has an items attribute that returns the list of itemsin the first TfmdList, since both TfmdLists have the same number of items.
6. items.setter: items is a property, inside the DataSource class, therefore, when we call the dsrc.items, it invokes the getterinside itemsand returns self.tls[0].items,or in simpler terms, the list of items of the first TfmdList. However, the key thing to note here is that if we ever update the list of items, it would need to be updated in every TfmdList and FastAI takes care of this by overwriting the items.setterlike so:
@items.setter
def items(self, v):
for tl in self.tls: tl.items = vTherefore, upon setting items, they get set on each of the TfmdLists. Otherwise, this would have led to some inconsistencies.
Now let’s look two of the methods individually:
DataSource `__getitem__`
def __getitem__(self, it):
res = tuple([tl[it] for tl in self.tls])
return res if is_indexer(it) else list(zip(*res))Let’s suppose we are looking for item 0, in that case it equals 0. And the res becomes tuple([tl[0] for tl in self.tls]). For each of the TfmdLists, return the 0th item please, and return it as a tuple. It is actually inside the TfmdLists, that the Transforms get called, we will be looking at this in a later blog post.
DataSource `subset`
def subset(self, i): return type(self)(tls=L(tl.subset(i) for tl in self.tls), n_inp=self.n_inp)Let’s check out subsets next. As we can see, the actual logic of subsets is implemented inside the TfmdList. All DataSource does is to return the ith subset from TfmLists and returning it as a DataSource. Remember, there were two ways to reate a DataSource —
- Pass in
items,Transformsandsplits - Pass in
TfmdLists
When, we call subset, we use the second method to create a DataSource, as you will see in a future blog post, calling subset on a TfmdList returns a TfmdList.
Now I mentioned, that dsrc.subset(0) can also be accessed via dsrc.train and dsrc.subset(1) can be accessed via dsrc.valid. Where does this magic happen?
It’s inside FilteredBase, the class DataSource inherits from.
FilteredBase.train,FilteredBase.valid = add_props(lambda i,x: x.subset(i), 2)Inside, FilteredBase, train is a property that refers to subset(0) while valid is a property that refers to subset(1).
DataSource `decode`
def decode(self, o, full=True): return tuple(tl.decode(o_, full=full) for o_,tl in zip(o,tuplify(self.tls, match=o)))Something I haven’t touched on yet, is the decode method inside DataSource. At this stage, if you have followed Jeremy’s walkthrus here, then you know already are aware about encodes and decodes. If not, I highly recommend watching this video here to get an intuition. encodes is something that transforms an item from A to B.
Example, Categorize encodes str such as keeshond to tensor(3), while decodes will decode tensor(3) back to keeshond. The calls, to decode also occur inside the TfmdList as well. We will cover this in detail in my upcoming post. For now, I want you to have an intuition about encodes and decodes that encodes transforms something from A to B, while decodes, if implemented usually, brings it back from B to A like so:
encoded = dsrc[0]; encoded
>> (<local.vision.core.PILBase image mode=RGB size=375x500 at 0x7F7E16640E48>,
tensor(3))decoded = dsrc.decode(encoded); decoded
>> (<local.vision.core.PILBase image mode=RGB size=375x500 at 0x7F7E16640E48>,
'keeshond')
With this sort of intuition and understanding, let’s conclude this overview of DataSource. We will build on top of this post when we look into TfmdLists, which is where the call to Transforms occurs and also decodes method in more detail. This will help us understand how is it that we get a (PILImage, tensor(3)), when we ask for the first item in dsrc like so dsrc[0].
For a deep dive into TfmdLists, and to further understand DataSource, refer to another post here.
