Build a financial newsletter with Deep Learning and AWS — Part 2

Adria M.
7 min readJan 20, 2019

--

This is a continuation of my previous post, where I outlined the logic behind the news classifier. This second part covers the deployment of the application on AWS (some knowledge is useful to follow along).

Newsletter application structure on AWS.

The picture above shows the different AWS services used to deploy the application:

  • CloudWatch: for monitoring and scheduling the crawling process.
  • Lambda: a handler function that dispatches the crawlers.
  • S3: for storing scrapped article urls and trained tokenizers and classifiers used to predict which articles are candidate for the newsletter.
  • EC2: for downloading the content of articles (title, summary, text, keywords, …), predicting and clustering candidate articles and building the newsletter and sending it.
  • RDS: for storing articles included in the newsletter.

Notes: the main idea was to run the whole application using a serverless framework but, by the time this was put into production, Lambda’s time, size and computational limitations made it impossible to run the classifiers there. Therefore, the second most optimal solution was to start and stop an EC2 instance to run the resource-intensive jobs. Additionally, I opted for RDS because there were no budget limitations, but given that only a bunch of articles are stored everyday, one could have decided to go for a cheaper solution, like S3.

Now that time has passed since I built this and I see it in perspective, perhaps I would have approached the application differently — the design is a bit monolithic and it could have benefited from using a micro-services structure. For example, having multiple services such as crawlers, classifiers, email builder, …

With this in mind, let’s dive into the details of the current deployment. There are two main Python projects (hosted by different AWS services):

  1. Lambda function: Crawlers
  2. EC2: Articles Download + Classification + Text Similarity + Email

These are the main components of the application in more detail:

Cron rule in Cloudwatch

A cron rule in Cloudwatch is responsible for awakening the crawling process every day. The cron expression (30 07 ? * MON-FRI *) generates a call to the connected Lambda function every weekday at 7:30am GMT. There is some data attached within this call and it basically is a key-value pair that specifies the Lambda handler which is the first crawler to be launched. In this case, it is set to: { "newspaper" : "cincodias" } .

Lambda — Crawlers

The focus was placed on refactoring the original crawlers to be as lightweight as possible so that they can run on Lambda. Therefore, although they were originally written using scrapy, I ended up using requests and beautifulsoup for the same purpose.

A Lambda handler function is in charge of dispatching the different crawlers. This function receives the payload from the cron rule and tells the first crawler to run (same Lambda execution call). Once the first crawler has scraped all article urls from its newspaper, urls are saved into a specific S3 bucket directory and the Lambda execution is over. The bucket emits an event connected to the same Lambda every time a csv is stored there. This is how recurrence is introduced into this process: the Lambda dispatches a crawler, urls are stored in S3, S3 sends a callback to the Lambda specifying which urls were stored and the dispatcher launches the next crawler.

S3 events property (launches Lambda when a file ending in urls.csv is being put.

In order to get a full understanding on how to generate and ship the function and dependencies to populate the Lambda, I did write the python project and created a Makefile that would be called within a docker container to install dependencies, test and package the application+dependencies in the ideal environment to make them suitable for Lambda. If you want to make your life easier, you can go for tools like zappa to do all this under the hood. Again, all I wanted was to “get my hands dirty” and learn more stuff on the go.
You can see the details of the code here.

When the last crawler has finished its work, an S3 event will re-execute the Lambda handler function and, this time, the Lambda will start the EC2 instance where all other processes will be executed.

The structure of the Lambda handler is written below:

from __future__ import print_function
import boto3
import datetime
from crawlers.cincodias import parse_cincodias
from crawlers.elconfidencial import parse_elconfidencial
from crawlers.eleconomista import parse_eleconomista
from crawlers.expansion import parse_expansion
def lambda_handler(event, context):
"""Lambda handler for the scraping / downloading process"""
ec2 = boto3.resource('ec2', REGION)
instance = ec2.Instance(INSTANCE_ID)
crawl_date = datetime.datetime.now()
try:
trigger_file = event['Records'][0]['s3']['object']['key']
except:
trigger_file = ''

if trigger_file:
if 'cincodias' in trigger_file:
parse_elconfidencial(crawl_date, BUCKET_NAME)
elif 'elconfidencial' in trigger_file:
parse_eleconomista(crawl_date, BUCKET_NAME)
elif 'eleconomista' in trigger_file:
parse_expansion(crawl_date, BUCKET_NAME)
elif 'expansion' in trigger_file:
response = instance.start()
else:
parse_cincodias(crawl_date, BUCKET_NAME)

EC2 — Articles download, classification, text similarity and email

When the EC2 is started, a cronjob is called (@reboot) and it executes the second main Python project (it actually executes the following shell script).

#!/bin/bash # Printing date
now="$(date)"
echo "--> Starting $now"
# Running application
cd /home/ubuntu/ec2_model
stdbuf -o0 /home/ubuntu/ec2_model/env/bin/python /home/ubuntu/ec2_model/main.py > log

The main.py script controls the execution of each step before the newsletter is sent. These are the main steps:

  • Download all articles from newspapers: the crawlers have already parsed all interesting newspapers and stored the urls from all the articles for the last three days in S3 (this is necessary to capture relevant articles that might have been sent during the weekend). Here I used a really convenient library (newspaper3k) to standardize the content downloaded from these different sources. Feel free to check this in my previous post.
  • Classify articles: use pre-trained tokenizers and classifiers (stored in S3) to predict the score of each article. There are only three features that are taken into account separately: title, summary and text of each article. Four different algorithms for each feature: CNN, LSTM, Bidirectional LSTM and CNN+LSTM. Cross-validation was also implemented.
for each feature in ['summary', 'text', 'title']:
for each model in ['CNN', 'LSTM', 'BiLSTM', 'CNNLSTM']:
for each fold in FOLDS:
model_path = 'models/{model}{feature}{fold}.h5'
# download and load model from S3
clf.predict()
# averaging scores by fold
# stacking
# using xgboost on previous predictions to get final score
  • Select candidate articles: those articles with >0.5 score are candidates for the newsletter (some additional controlling could be implemented here).
  • Save articles to RDS: save these articles with a date stamp on RDS and evaluate if there are duplicates that were already included in previous newsletter. If so, exclude from current selection of candidate articles.
  • Text similarity to cluster articles: there are articles that have scored high and might be related to the same piece of news but from different newspapers. It would be ideal to identify these groups of articles that have a similar content so they are displayed together in the newsletter.
    In order to enable this functionality, a text similarity process evaluates how similar all candidate articles are between each other. I used gensim library for this purpose. Implementing a tf-idf algorithm is pretty straightforward and results are really accurate. The result of this is a similarity matrix that is eventually used to cluster similar articles together based upon a similarity measure (100% would mean both articles are exactly identical). After some investigation, a 30% threshold yielded good results for the majority of the cases (a dynamic threshold or other algorithms could be also used here to refine this process).
This article in Cincodías can also be read in Expansion and El Confidencial.
Similar article in Expansion.
Similar article in El Confidencial.
  • Sending the email: this is the last step before the email is sent. Once the candidate articles are grouped together and sorted by score (high score articles will show up at the top of the newsletter), it is time to generate the newsletter. To do so I ended up using smtpib and email libraries.
    The first library is used to create a client session object from which the email can be sent to another SMTP listener (in this case the target is a domain from gmail).
    From email one can import the MIMEMultipart class (and others) to build the structure of the email.
    The email body is created using HTML. The headers and footer are fairly static and the only part that needs to be modified is the body, where articles are rolled out. This part has been coded in Python but one could use Jinja, for example.
    The list of recipients, in this particular case, is fixed and hardcoded but it could be dynamic and imported from another AWS service (e.g. a database), connected to a subscription page, etc.

When the email is sent, the EC2 is stopped until next call. You can see the details of this part of the code here.

This is how it looks like 🗞

Next

There are lots of things that can be built on top of this application (e.g. by using the data store in RDS). One could think of building a web app were users could rate or like articles to keep training the models and recommending news, etc.

Additionally, there is a lot of potential stuff to be done in order to improve the current application. Better testing and monitoring, packaging and publishing the application, build a subscription service, transitioning to micro-services, etc.

I hope you liked it and thanks for reading 🙏

--

--

Adria M.

Focused on growing as a data scientist / full stack developer. Very independent, problem solving lover, research addict and great team mate.