Text Analytics in Cloud — Part 1
Using Google Cloud Functions to fetch data to cloud storage
As part of one of my recent project, I was working with text. Precisely Religious text.
We were curious about a few questions
- Distinctions between Pre-abrahamic vs Abrahamic religions
- Geography/Demographics by religion
- What do different religious books agree/disagree on?
- Evolution of themes in the texts over time/ change of tone / language / concepts
- Leverage text analytics/ word vectors to search for similar ideas in different books, compare and contrast them.
It was decided that we would pull books from Project Gutenberg , load them into Google Cloud storage, use Apache beam with Cloud Dataflow to preprocess and annotate the text and then analyse using Python.
You can follow the code here @ github
This article focuses on the left most part of the above diagram — getting the data, in other articles we will continue down the pipeline.
Getting the data
First step in the whole process is to get the data. For this we created a python script that uses BeautifulSoup and requests to download books from project gutenberg and saves them onto cloud storage.
Important piece here is how to interact with google cloud.
So how google cloud expects your cloud function be packaged is as below
- There should be a main.py file — with the actual code
- There should be a requirements.txt with all the dependencies specified.
Then you can zip up the folder and provide that to cloud functions.
For our usecase, instead of manually doing it we were leveraging terraform.
To create the zip file we used null_resource that behind the scenes leverages powershell commands to zip a folder.
In the second image you can see, the first resource uploads the .zip file to cloud storage and the second resource creates a cloud function using the uploaded zip file.
Notice in the second resource we specify the entry_point
This entry point is the function in the main.py that is invoked by cloud functions.
The actual function is in main.py
getbooks function takes the request as input ,parses the argument and then invokes custom code to fetch and upload the files to google cloud storage.
Once the function is created we can invoke it as follows using the url
query , bucket and folder are the parameters that the function that we defined above expected.
Couple of gotchas to be aware of
- Cloud functions have read-only file system, so you can’t write files just anywhere. It has to be in the /tmp/ folder
2. For Authentication with google cloud you should set the environment variable GOOGLE_APPLICATION_CREDENTIALS, pointing to the service principal key file. Creating Service account