Searching the College de France — part 2

Timothé Faudot
10 min readSep 3, 2017

This is part 2 of a series, if you haven’t read part 1 you can find it here.

Quick recap: we have a scraper that populates lesson titles, lecturer names, audio files links, dates, language and audio duration from pages in the College de France website that we can now go over and try to do something with.

Audio transcription via Google Speech to Text API

Google offers a speech to text API via its cloud platform and we’re going to use it to transcribe lessons from the College de France.

There are two ways it works depending on the length of the audio, for very short audio files — less than 1 minute — you can stream the content and receive the transcription synchronously, this obviously doesn’t work for us and we’ll have to use the asynchronous API which is still quite fast as we’ll see and allows up to 3H of audio to be transcribed.

Setting up the clients

We’ll use Go for the rest of this project mostly, so we can setup a speech client via the official client libraries.

We also need to create a storage bucket to store the audio files to be transcribed, this is necessary because the asynchronous API only accepts files that are stored in Google’s storage system (like amazon’s S3). We’ll use a regional bucket to save money and because our downloads will only be coming from Europe let’s put in in Europe:

Storage bucket with a name auto-generated by Google

Converting the files

The audio files also need to be in FLAC format so we’ll need to convert the mp3 offered by the College de France first, this can be done via the sox command line utility easily, we can even split the files if we need to to respect the 3H limit imposed by the Speech API.

# Install on OSX for local dev:
brew install sox --with-lame --with-flac
# Or on a distro with apt-get:
sudo apt-get install sox libsox-fmt-mp3
# Then we can run this command:
sox \
antoine-compagnon.2017-06-20-17-45-00-a-fr.mp3 \
antoine-compagnon.2017-06-20-17-45-00-a-fr.flac \
channels 1 rate 16k trim 0 10790 : newfile : restart
# The result file(s) are trimmed to <3H max each and are in the right format for the Speech API.

The way to call this from Go is easy, here is an excerpt from the actual code that I used:

We can now call the Speech API using the converted file path in our bucket as an input parameter, specify the French language if appropriate and wait for the results to come in… This is our request:

req := &speechpb.LongRunningRecognizeRequest{
Config: &speechpb.RecognitionConfig{
Encoding: speechpb.RecognitionConfig_FLAC,
SampleRateHertz: 16000,
LanguageCode: language.Make(lang).String(), // Must be a BCP-47 identifier.
SpeechContexts: []*speechpb.SpeechContext{
{Phrases: hints},
},
},
Audio: &speechpb.RecognitionAudio{
AudioSource: &speechpb.RecognitionAudio_Uri{Uri: gcsURI},
},
}

The Phrases hints are sentences you can give the API to help with the recognition, I used the Chaire, lecturer, lesson name etc, whatever text I could associate with a given audio file. You can fine the whole code on Github.

Now the transcriptions are returned to us by the Speech API in chunks of roughly ~1min each, that’s great because we can index them all in chunks too and we’ll be able to give some context to the users who searched for something!

Building the image

The docker image for our worker is not as straightforward as it would seem because we have to get sox in, we can do this with this Dockerfile:

FROM golang:1.8WORKDIR /go/src/app
COPY . .
RUN go-wrapper download
RUN go-wrapper install
# Install sox.
RUN apt-get clean && apt-get -y update && apt-get install -y sox libsox-fmt-mp3
# Provide a sensible default run command.
CMD ["go-wrapper", "run", "--project_id=college-de-france", "--bucket=healthy-cycle-9484", "--sox_path=sox", "--elastic_address=http://127.0.0.1:9200"]

Full text indexing using Elasticsearch

We’ll use Elasticsearch for indexing, why? because it’s one of the most powerful indexing engine out there, it’s free, open source, it’s used in many contexts and although I played a bit with it before I wanted to know more about it.

Aside

I know AppEngine also supports some kind of full text search and this whole project could perhaps have been built with a single AppEngine application for much cheaper that what this is going to cost me, but I wouldn’t have learned much doing so, and I don’t believe the ElasticSearch analyzers are quite matched yet by AppEngine’s, it also has awesome support for French normalization and a simple easy query language. There is also the process of encoding the audio files to FLAC which I am unsure how feasible it would have been within the classic AppEngine environment, it may be feasible with the flexible environment.

Elasticsearch configuration

Configuration can be done via multiple mechanisms, we’ll be using the official Docker image and thus we can either build our own image based out of it, with some tuning for our needs, or mount a special config yaml for elasticsearch, provide configuration variable via environment variables or via the start command line. We’ll go for the later because although the environment variable method sounds the safer (as we do not touch any default setting that the docker image already provides) there is currently a limitation on the version of Kubernetes proposed on GCP clusters that doesn’t support dots as environment variable names and that’s what elasticsearch expects… The other ones seem overkill, we do not need a special image and mounting a config of our own erases all the default config provided which led me to hours of debugging earlier on…

There are a few important settings duly noted in the official documentation and the docker image with our custom settings as shown below work nicely, I want to be cheap, so I’ll run a single node, it’ll be yellow, not green, whatever this is fine for our corpus size, we’ll pick a node that’s not preemptible because we want it always running, we’ll recreate the instance on update because we don’t have the resources to do a rolling update, we’ll disable username/passwords via xpack because our instance is only exposed to our cluster, not the internet, and that gives us:

The free tier for disks is 30G so I created a disk using:

gcloud compute disks create --size 30GB --type pd-standard es-disk

Having this disk mounted like that means we can only use one instance of elasticsearch because gcePersistentDisks can only be mounted read/write by one instance, this is totally fine for now as I said the size of our corpus and the ratio of write vs reads is so low we can live with a single master node for a while.

Note that I also request 2.5Gi of RAM because the default java settings for this image is to request 2Gi so I gave it a little buffer, my cluster has currently RAM available than CPU…

Setting up the index

After we deploy our instance, we have to configure it, so let’s get its pod identifier and get a bash shell in it:

$ kubectl get pods
NAME READY STATUS RESTARTS AGE
elasticsearch-1333184046-x22zq 1/1 Running 0 2d
worker-3127522000-9ndt9 1/1 Running 0 2d
$ kubectl exec -it elasticsearch-1333184046-x22zq -- /bin/bash
[elasticsearch@elasticsearch-1333184046-x22zq ~]$ whoami
elasticsearch

You can see this runs as elasticsearch user which has access to the DB as is because we disabled xpack, we could have changed the password here if we kept it enabled by doing:

curl -XPUT -u elastic 'localhost:9200/_xpack/security/user/elastic/_password' -H "Content-Type: application/json" -d '{
"password" : "elaspass"
}'

But for now, we need to create our index:

curl -XPUT "http://localhost:9200/course" -H 'Content-Type: application/json' -d'
{
"mappings": {
"section": {
"properties": {
"transcript": {
"type": "text",
"analyzer": "french"
}
}
}
}
}'

Our index is going to be named “Course” and will have “section”s in it which will contain a subset of what’s available in datastore for a quick preview without having to do two roundtrips (chaire name, lesson title, lecturer, etc), and most importantly, it sets the transcript field type to be “text” which will enable the full text indexing and tokenization to happen with the French analyzer which will take care of indexing only what’s meaningful in French as well as normalizing the input (removing plurals, accents, etc.).

We can check the health of our cluster, as I said it’s going to be yellow because we only have one node:

[elasticsearch@elasticsearch-1333184046-x22zq ~]$ curl -XGET 'localhost:9200/_cluster/health?pretty'
{
"cluster_name" : "esdb",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 25,
"active_shards" : 25,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 25,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 50.0
}

And we can bulk index some documents:

curl -H "Content-Type: application/json" -XPOST 'localhost:9200/_bulk?pretty' --data-binary "@trans.json"

Then we can test a search query!

[elasticsearch@elasticsearch-1333184046-x22zq ~]$ curl -XGET "http://localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
> {
> "query": {
> "simple_query_string" : {
> "query": "merci",
> "fields": ["transcript"],
> "analyzer": "french"
> }
> }
> }'
{
"took" : 12,
"timed_out" : false,
"_shards" : {
"total" : 25,
"successful" : 25,
"failed" : 0
},
"hits" : {
"total" : 6,
"max_score" : 4.1156282,
"hits" : [
{
"_index" : "course",
"_type" : "transcript",
"_id" : "AV5GKAw7ebtkbxwtBhZ1",
"_score" : 3.487345,
"_source" : {
"title" : "Recherche fondamentale, Inventions et Innovations",
"lecturer" : "Didier Roux",
"function" : "Membre de l'Institut, Vice-président recherche, développement et innovation, Saint-Gobain",
"lesson_type" : "Leçon inaugurale",
"type_title" : "Recherche fondamentale, Inventions et Innovations",
"chaire" : "Innovation technologique Liliane Bettencourt (2016-2017)",
"lang" : "fr",
"source_url" : "http://www.college-de-france.fr/site/didier-roux/inaugural-lecture-2017-03-02-18h00.htm",
"Serial" : 10,
"transcript" : " une telle activité ne vas pas sur certains signes de reconnaissance institutionnelle hauteur de plus de 150 à t'expliquer de 14 brevets c'est rien de nombreux prix et distinctions dans le Grand Prix IBM matériaux Le Grand Prix de l'Académie des Sciences mairie bourg d'Aix ce dernier prix c'est souvent le cas a précédé votre élection à l'Académie des Sciences ce que vous Coiff d'une double casquette et aussi à l'Académie l'âge de piste de l'espéron il est aujourd'hui le 11e titulaire de la chaire d'innovation technologique liliane bettencourt du Collège de France statistique je vais maintenant vous laisser la parole pour une leçon inaugurale dans l'intitulé recherche fondamentale inventions et innovations recouvre parfaitement le champ étendue de vos activités merci Alain merci beaucoup"
}
...

We’ll use simple_query_string as it’s a robust (as in “never fails” on purpose) query mechanism that is safe to expose to frontends, and we’ll also analyze the input query with the same French analyzer that we used to index the documents.

You see that the transcription excerpt here is quite long and tends to fail on some proper nouns but other than that it’s quite successful :-)

Connecting the dots

So we have our worker and an elasticsearch instance running in the same cluster, how does the worker communicate to the elasticsearch instance? Let’s explain that!

After applying the kubernetes deployment, we have to “expose” our elasticsearch service to the nodes in the cluster:

kubectl expose deployment/elasticsearch

Then we can check that we have an elasticsearch service and it points to an actual endpoint:

$ kubectl get services elasticsearch
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
elasticsearch 10.43.252.70 <none> 9200/TCP 12d
$ kubectl get endpoints elasticsearch
NAME ENDPOINTS AGE
elasticsearch 10.40.1.73:9200 12d

Now when deploying our worker, we have two ways of knowing where elasticsearch endpoints are

  • Via environment variables:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
elasticsearch-1333184046-x22zq 1/1 Running 0 2d
worker-3127522000-9ndt9 1/1 Running 0 2d
$ kubectl exec -it worker-3127522000-9ndt9 -- printenv | grep ELASTICSEARCH_SERVICE
ELASTICSEARCH_SERVICE_PORT=9200
ELASTICSEARCH_SERVICE_HOST=10.43.252.70
  • Or simply via DNS, GCP automatically spawns a kube-dns within your cluster that takes care of exposing the services to the nodes within it, for example our elasticsearch service can be accessed easily via http://elasticsearch.default.svc.cluster.local:9200 from within the nodes, or even more simpler: http://elasticsearch:9200 because both the worker and elasticsearch are within the default namespace. What we need however is to specify the fact that we want to get hosts from DNS when deploying our worker, this is done by setting the following environment variable in the deployment spec:
env:
- name: GET_HOSTS_FROM
value: dns

And then we can just specify our arguments to launch the worker:

args: ["run", "--project_id=college-de-france", "--bucket=healthy-cycle-9484", "--sox_path=sox", "--elastic_address=http://elasticsearch:9200"]

and it works!

as best practice, I created a different service account for the worker as it will access the datastore in read/write but also the storage bucket to save audio files and text transcripts.

Final note on worker and composition

One of the reason I love Go is how easy it makes using composition, it doesn’t have classes to start with and it forces you to think in terms of small interfaces that do one thing and do it well — à la UNIX —which you compose into structs that actually do the job you want.

Here my worker is just:

type Worker struct {
uploader upload.FileUploader
transcriber transcribe.Transcriber
broker money.Broker
picker pick.Picker
indexer indexer.Indexer
soxPath string
httpClient *http.Client
health health.Checker
}

the uploader handles storing audio files to the cloud storage, transcriber handles the audio transcription, broker handles the money checks (we’ll come to that in a later story), picker picks stuff that is scheduled to be transcribed, indexer handles the indexing, health checks for the healthiness of Elasticsearch before attempting a heavy/costly operation, etc.

6/9 members are interfaces with one or two methods in them, at no point these are talking about a specific implementation, which means I can switch to AWS if I want to by swapping the actual implementation of any of these for example, each are in their own package and provide some isolation from the main program and can be unit tested independently, I feel safe when I look at code like that, Go makes it easy to refactor things around and swap interfaces implementations when needs be, I don’t have to care about formatting my code as it is done automatically, imports are handled too, overall it just feels really nice :-)

I’m saying all this because the next part of this series will be about the frontend! We’ll be building this in Node (because we’re in 2017 and nothing else seems to exist anymore)+Angular (because I knew 1.0 and I wanted to learn the new version)+Go (because of the reasons listed above).

So stay tuned, I might rant a bit in the next series because coming from the backend to the frontend can be daunting but we’ll not rage-quit just yet and see some really nice things too!

As always the code is available on Github:

--

--