Searching the College de France — part 1.5

Timothé Faudot
2 min readSep 1, 2017

--

This is a follow-up to my first post of this series after I figured I forgot to mention a few interesting things, so without further a due:

Periodic scraping

I mentioned I wanted to periodically scrape the College de France website but didn’t mention how, it’s because right now it is not setup for that, I have to run the job manually when I feel like it.

Kubernetes supports cron jobs and in fact I have the scraper cron checked in the Github repository and ready to be used but to be able to use it on Google Container Engine you need to be bringing up the kubernetes api proxy with the alpha features turned on and although I know Google beta are pretty strong (and are turned on by default, that’s a good sign), alpha features are definitely not and I don’t want to have all the other alpha features that I don’t care about mess up my cluster so I didn’t do it.

Robots.txt

A friend mentioned that I should have followed the robots.txt directives if any, I thought there were none because this URL gave a 404 but I just missed the www. part… we’ll come back to that later when we’ll setup our own DNS and CNAME for the www. subdomain which the College de France should have done for that really.

In the meantime I used the robotparser from the python standard library and added it to my scraper image so this is now done.

I will now start writing the second part of this series, focusing on the audio transcription and full text indexing. Stay tuned!

--

--