Text Analytics in Cloud — Part 1

Abhijeet Pokhriyal
Analytics Vidhya
Published in
3 min readApr 16, 2020

--

Using Google Cloud Functions to fetch data to cloud storage

As part of one of my recent project, I was working with text. Precisely Religious text.

We were curious about a few questions

  • Distinctions between Pre-abrahamic vs Abrahamic religions
  • Geography/Demographics by religion
  • What do different religious books agree/disagree on?
  • Evolution of themes in the texts over time/ change of tone / language / concepts
  • Leverage text analytics/ word vectors to search for similar ideas in different books, compare and contrast them.

It was decided that we would pull books from Project Gutenberg , load them into Google Cloud storage, use Apache beam with Cloud Dataflow to preprocess and annotate the text and then analyse using Python.

Architecture

You can follow the code here @ github

This article focuses on the left most part of the above diagram — getting the data, in other articles we will continue down the pipeline.

Getting the data

First step in the whole process is to get the data. For this we created a python script that uses BeautifulSoup and requests to download books from project gutenberg and saves them onto cloud storage.

Important piece here is how to interact with google cloud.

So how google cloud expects your cloud function be packaged is as below

  1. There should be a main.py file — with the actual code
  2. There should be a requirements.txt with all the dependencies specified.

Then you can zip up the folder and provide that to cloud functions.

For our usecase, instead of manually doing it we were leveraging terraform.

To create the zip file we used null_resource that behind the scenes leverages powershell commands to zip a folder.

In the second image you can see, the first resource uploads the .zip file to cloud storage and the second resource creates a cloud function using the uploaded zip file.

Notice in the second resource we specify the entry_point

This entry point is the function in the main.py that is invoked by cloud functions.

The actual function is in main.py

getbooks function takes the request as input ,parses the argument and then invokes custom code to fetch and upload the files to google cloud storage.

Once the function is created we can invoke it as follows using the url

https://URL_TO_YOUR_CLOUD_FUNCTION.cloudfunctions.net/data_pull_cloud_func?query=hinduism&bucket=dsba6155pdatabucket&folder=download

query , bucket and folder are the parameters that the function that we defined above expected.

Caveats

Couple of gotchas to be aware of

  1. Cloud functions have read-only file system, so you can’t write files just anywhere. It has to be in the /tmp/ folder

2. For Authentication with google cloud you should set the environment variable GOOGLE_APPLICATION_CREDENTIALS, pointing to the service principal key file. Creating Service account

--

--

Abhijeet Pokhriyal
Analytics Vidhya

School of Data Science @ University of North Carolina — Charlotte