Searching the College de France

6 min readAug 28, 2017

This is part 1 of a series I plan to write to explain the development, testing and productionization of a project I am working on during my work leave.

We’ll go through the process of periodically scraping a website, storing information in a database, use a speech-to-text API to transcribe minutes/hours/days of audio, intelligently enable full text search on the transcriptions, setup a Kubernetes cluster to orchestrate all that in the Cloud, talk a bit about finance and finally serve that data in a reliable way, with our own domain, Google-backed DNS and HTTPs using let’s encrypt.

I mainly worked on that project during my leave because I had this idea for some time and although I knew how to implement all that with the infrastructure we have at work for free, I had very little idea on how to accomplish it in the outside world and how much it would actually cost to run something like that… This was definitely a learning and humbling experience which I hope this series will show and also hopefully help other people along the way who might have faced the same issues as I did.

What is the College de France?

The College de France is a French research institute, let me quote wikipedia for a proper description:

The Collège is considered to be France’s most prestigious research university.[1][2] It does not grant degrees. Each professor is required to give lectures where attendance is free and open to anyone. Professors, about 50 in number, are chosen by the professors themselves, from a variety of disciplines, in both science and the humanities.

What I really love about the College de France is that most if not all of the lessons are available online, either in audio-only or video format or both. This makes great research available to everyone who understands the language (most of the lessons are in French, but sometimes for symposiums foreign professors can be invited and thus English makes its way in.

I found that listening to some podcasts which I would have never thought of listening to before — who would think listening to a 10H lesson about recent discoveries in archeology in Pompeii, immunology, cellular replication mechanisms or even 30H about anthropology of the landscape (I am almost at the end of this one…) would be interesting? I did not before, now I can’t wait for the next lessons! — really broadened my interests and made me aware of research happening outside of my field of expertise.

Listening to all these podcasts, I noticed that there is a lot of overlap between lessons, professors, discoveries and what we weirdly call hard and soft sciences but the College de France offered no way to search for anything on its website outside of the metadata of the lessons (title, professor, etc.), I want to go one level deeper and allow people to search the actual lessons themselves. This is where it all becomes a bit technical.

Scraping for good

Scraping is the process of going through a website and extracting information out of it, internally at Google it might seems strange but we never perform such task because the whole internet is already scraped for us and available in a somewhat structured format on which we just run MapReduces or more complex pipelines to extract the data we want from it.

This time however, we’ll have to do it ourselves. For this task I usually fire up an ipython interpreter and use the beautiful soup library to parse the content, this is what we’ll do here too!

So let’s get going and install python 3.6 and create a virtual environment with the necessary dependencies:

Okay now we can start playing with the website, fetch its contents and store it somewhere. Looking at the dependencies you saw that we’re going to use Google Cloud Datastore for that, it is a noSQL type database that has a high bar for its free-tier and it will come in handy in the next parts of this series you’ll see.

The audio files cannot be enumerated easily, and we’ll need the metadata around them anyway so we’ll have to go through all of the archive and for that we can use the basic search interface that the College de France provides, we can restrict to audio pages only and then we’ll scrape them one by one.

Professors at the College de France are amazing but they cannot realistically give more than one lesson at the same time so a tuple (Date, Professor, Start time) will yield a pretty solid key for our lessons.

Writing the actual scraper, I found that there are about ~6K pages with audio, and I can successfully extract ~2.5K of them, the rest are pages where the date or end time are not specified and I am refraining from scraping those for now, we’ll get to why later when talking finance a bit…

Playing with Docker and non-English locales

It might come to no surprises that most Docker images come with English as their default locale, this hindered my development because I needed to parse a date written in proper French like “12 Juillet 2016” but python would only let me do this if I had the French locale installed on my system and set as the default or dynamically set in the python runtime.

So I wrote a Dockerfile that starts from the slim 3.6 python base image, then added the locale I wanted and made it the default:

The timezone here is not necessary but I left it for documentation purpose and consistency in locales/timezones for the image.

With that done, parsing the following works as intended.

Next thing we want to do is to run this periodically and possibly from a machine that is closer to France than where I am now (Japan…), time to setup our kubernetes cluster!

Setting up a kubernetes cluster

We’ll need a cluster with just one node for now, running a f1-micro instance is enough but we’ll need more later so let’s go with a standard n1 machine type:

A lot went in there, let’s recap:

We’ll be running in Europe to be close to the College de France website.
We create a cluster with one node and the latest kubernetes version to take advantages of some new features that I’ll go through later.
We built docker images and stored them in our own private registry within the Google cloud.
We ran the scraper job.

Overall it should only take a few minutes to go through the 6K audio pages and store the results in Datastore.

The kubernetes job config in itself is pretty simple, it’s a basic “Job” aka. programs that runs to completion.

You can see we import the docker image we just built then set a few requirements in terms of cpu and ram (I found that the defaults were often overkill so I scaled those down a bit), pass our credentials via a secret created previously and run with custom args to stop whenever we notice a page that was already scraped.

With that done, we’re all setup to actually do something with the data!

End of part 1

I hope this was entertaining and useful, it definitely was for me!

For now this cost us nothing, if you created the cluster with an f1-micro instance you’re still in the free tier… this is going to change soon however! Stay tuned for part 2 where we’ll go through the process of transcribing the audio files on demand and enabling full-text search indexing on the results!

As always, the code is free and open source:

attwad/cdf-scraper

cdf-scraper - Scraper for College de France audio pages

github.com