Archive: How to build an Intake catalog
September 13, 2018
Originally published at www.informaticslab.co.uk by Jacob Tomlinson.
Intake is a new library from Anaconda to take the pain out of loading common datasets into your Python analysis. It allows you to package a dataset so that it can be installed via Conda and imported into your Python session using the intake
libary.
This tutorial will walk through writing an Intake catalog, building a conda package and using it within your Python session.
Requirements
For this tutorial you will need a functioning Python environment and the Conda package manager. If you don’t have this I recommend you follow the docs to get set up.
We will also need to install Intake, the Conda build tools and the Anaconda Cloud tools.
$ conda install -c intake intake conda-build anaconda-client
You will need to create an Anaconda Cloud account and login using the command line. We will also configure conda to automatically upload all packages we build to Anaconda Cloud.
$ anaconda login
Using Anaconda API: https://api.anaconda.org
Username: <username>
<username>'s Password:
login successful
$ conda config --set anaconda_upload yes
Getting some data
For this tutorial we need some data to work with. To keep this tutorial simple we are going to steer clear of large, complex, multidimensional weather data and work with a smaller and more tabular dataset. Luckily here in Exeter we have a data portal called the Exeter Data Mill which contains datasets about the city. One of those datasets is historic ticket sales in our car parks which has been shared by Exeter City Council.
We are going to work with the full raw dataset which is sales per hour for 28 car parks over four years. The data is stored as a CSV file containing just under 1 Million rows and is around 88MB.
The data is hosted on AWS S3 and can be found at the following URL:
Writing an Intake manifest
The Intake library automatically populates its catalog from YAML manifest files which are placed in $PREFIX/share/intake/
where PREFIX
is the Conda environment path. This means other Conda packages can drop YAML files into that directory and Intake will pick them up.
For our car park dataset we are going to create a minimal manifest file which will include a description of the data, the Intake driver to use, the URL to download the CSV from and some metadata about the canonical location.
# car-park-tickets-sold.yaml
sources:
car_park_tickets_sold:
description: Data about the number of tickets sold in Exeter car parks from Exeter City Council (https://exeterdatamill.com/dataset/car-park-tickets-sold)
driver: csv
args:
urlpath: 'https://s3-eu-west-1.amazonaws.com/files.datapress.com/exeter/dataset/car-park-tickets-sold/2018-07-26T10%3A23%3A33.34/TickSalesbySiteByDateByHour_20140305-20180717.csv'
metadata:
origin_url: 'https://exeterdatamill.com/download/car-park-tickets-sold/cf542d64-0dea-4370-9006-a9e5f965ce1a/TickSalesbySiteByDateByHour_20140305-20180717.csv'
Note here that we are using the csv
intake driver which means Intake will give us a Pandas Dataframe by default when we try to load this data, but can also provide us with a Dask Dataframe if we wish. We could also install Intake plugins to extend this and allow us to get the data in different ways.
Writing a Conda package
Now that we have our manifest file we need to package it so that anyone can install our Intake catalog.
A Conda package requires two files; a meta.yaml
file which describes the package and a build.sh
file which will be executed at install time.
To keep all of this together I’ve created an example repository on GitHub which you can have a look at. Within the repository we will create a directory called car-park-tickets-sold
to put our Conda build files in.
META.YAML
The meta.yaml
file only needs to contain the package.name
and package.version
properties to be a valid package, but we will also add a few other things. We will specify that this package will run on a generic
architecture as it is only a config file and the CPU architecture doesn’t matter. We will specify that the package needs intake
to be installed at run time, it wouldn’t be much use without it! Finally we will include some info about data licensing which I’ve copied from the original data on the Exeter Data Mill.
# meta.yaml
package:
version: '1.0.3'
name: 'data-exeter-car-park-tickets-sold'build:
number: 0
noarch: genericrequirements:
run:
- intake
build: []about:
description: Data about the number of tickets sold in Exeter car parks from Exeter City Council (https://exeterdatamill.com/dataset/car-park-tickets-sold)
license: OGL v3
license_family: OTHER
summary: Data about Exeter car park ticket sales
BUILD.SH
The build.sh
script will simply create a subdirectory in the shared Intake directory and copy our Intake manifest into it.
#!/bin/bashmkdir -p $PREFIX/share/intake/exeter-car-park-tickets-sold
cp $RECIPE_DIR/car-park-tickets-sold.yaml $PREFIX/share/intake/
CAR-PARK-TICKETS-SOLD.YAML
We also need to copy the Intake manifest we wrote before into the package.
Building and publishing the package
Now that we have these three files in our car-park-tickets-sold
directory we need to build it with the Conda build tools. Note that we need to specify -c intake
to ensure that Conda can find the intake
package.
conda build -c intake car-park-tickets-sold
This will build the package, output a file called data-exeter-car-park-tickets-sold-1.0.0-0.tar.bz2
and upload it to the Anaconda repository thanks to the credentials we set up earlier. This will then be listed under our username as a new package.
Installing our package
Now that our package exists and is publicly accessible we can install it using conda install -c
. For example anyone can install my demo version of the car parking dataset with the following command:
conda install -c jacobtomlinson data-exeter-car-park-tickets-sold
Using our package
Now that we have installed our dataset we should be able to import intake
in a python session and see our data in the catalog list.
Python 3.6.3 |Anaconda, Inc.| (default, Nov 9 2017, 00:19:18)
[GCC 7.2.0] on linux
>>> import intake
>>> list(intake.cat)
['car_park_tickets_sold']
We can then read the dataset into a pandas dataframe and head the first five rows.
>>> df = intake.cat.car_park_tickets_sold.read()
>>> df.head()
Year Month Date Hour Site Tickets SiteSub
0 2014 Mar 2014-03-05 08:00 Purchase Count - Bampfylde Street Car Park 10.0 Bampfylde Street Car Park
1 2014 Mar 2014-03-05 16:00 Purchase Count - Bampfylde Street Car Park 2.0 Bampfylde Street Car Park
2 2014 Mar 2014-03-06 08:00 Purchase Count - Bampfylde Street Car Park NaN Bampfylde Street Car Park
3 2014 Mar 2014-03-06 09:00 Purchase Count - Bampfylde Street Car Park NaN Bampfylde Street Car Park
4 2014 Mar 2014-03-06 10:00 Purchase Count - Bampfylde Street Car Park NaN Bampfylde Street Car Park
Conclusion
That’s it! We have packaged a very simple CSV dataset from an open data catalog into an Intake Conda package, installed it into our environment and loaded the data.
This is just scratching the surface of what Intake can do and I urge you to explore the documentation to learn more about what you can do.
Originally published at www.informaticslab.co.uk by Jacob Tomlinson.