Build a financial newsletter with Deep Learning and AWS — Part 1

12 min readMay 13, 2018

Real case application for a financial consulting firm.

For over more than two years, an experienced employee from this firm collected the most relevant press articles from different newspapers each day. The goal was spreading a tailored, daily newsletter to all employees with interesting news about the financial and economic landscape (with both national and international scope). This process could take up to 2 hours every day and comprise task such as reading multiple newspaper front pages, keep track of cross-cutting topics from other sectors, structuring and ranking articles and manually writing and sending the newsletter.

We are lucky to have scraping and natural language processing (NLP) to do all this work for us now. We can very accurately reproduce her criteria to classify articles thanks to recent developments in deep learning and NLP.

This story gives an overview of the project, with focus in the (1) data gathering and (2) modelling phases. All code is written in Python3 and can be found here.

1. Data Gathering

This part, in general, is the most time-consuming task in every data project. In this case, however, the problem can be largely simplified by using a few python libraries.

The idea is to build a dataset with a large collection of articles where some of them were included in past newsletters (“ground truth”). These “included” articles should be extracted from the large list of newsletters sent over these years.

Before moving forward, let’s spend a second introducing the newspaper3k library. Newspaper is an amazing python library for extracting and curating articles. In a nutshell, to parse and download an article it is easy as passing the article url to the Article class from newspaper and use some of its methods.

Newspaper is an amazing python library for extracting and curating articles

With this, the workload reduces significantly and guarantees that all articles downloaded will follow the same structure, regardless of their source (text, title, authors, summary, …).

We should now focus on how to access all previous newsletters so that we can search the article urls that were sent (that is all we need for the “ground truth”). We will use imaplib to connect to an IMAP server (in this case, connecting over an SSL encrypted socket) and email to parse the content of the emails.

# Variables
M = imaplib.IMAP4_SSL('imap.gmail.com')# Retrieve credentials
user = input('User email: ')
pass = getpass.getpass()# Login
M.login(user, pass)
M.select()# Search all emails
typ, data = M.search(None, '(FROM "name@domain.com")')

Once we have processed the data (emails), we can easily extract the article urls using regex (you can see all the details in here).

Now let’s focus on filling the dataset with articles not included in past newsletters (“negative examples”). To do this we have used Scrapy and have crawled 3 newspapers (the most frequent ones in the previous newsletters): Expansión, Cincodías and El Confidencial.

Every time you face a scraping problem, it is essential to spend time browsing the website in order to find the right page to start crawling. In this case, since we are interested in downloading articles (links) published over the last two years, it would be best to find some sort of newspaper library or repository where you can filter articles by their “publication date”. Once we have spotted that starting url, we just need to figure out how to loop over different dates and retrieve all the news for each date.

Considerations: before jumping to building your spiders, play around with the Scrapy shell to understand what kind of url response you get back. Some websites are async rendered and you might need JavaScript integration.

In any case, you can find all the spiders used in this project here and you can also follow this tutorial if this is new to you.

At this point we have successfully collected news article links from:

Emails (“ground truth”): using imaplib + email
Online newspapers (“negative examples”): using Scrapy

Now we can import newspaper and download the content of each article. For this we have adapted the Article class to suit our needs.

We can now use this new class to parse, download and dump all the content of each article to a local file. Note that we are storing everything in local JSON files because it can be comfortably handled this way.

These are the data fields and how an arbitrary article looks like:

authors: author(s) of the article (array)
date: date of article publication (datetime)
day_of_week: day of the week of article publication (string)
domain: newspaper website domain
flag: target variable (0: not included in BBB newsletter, 1: included)
keywords: list of keywords suggested by nlp application (array)
section: newspaper section
summary: brief summary of the article
text: body text of the article
title: title of the article
url: url link to the article

{
  "authors":["Eduardo Segovia","José Antonio Navas","E. Sanz","M. Valero","Contacta Al Autor"],
  "date":1478304000000,
  "day_of_week":"Saturday",
  "domain":"elconfidencial.com",
  "flag":1,
  "keywords":["todos","inmuebles","para","retasar","pisos","sus","en","y","banco","del","el","que","los","vivienda","se","las","rebajas","la","noticias"],
  "section":"vivienda",
  "summary":"Malos tiempos para encontrar un chollo entre los pisos que tienen a la venta los bancos.\nEsta norma obliga a volver a tasar (y provisionar) todos los inmuebles que tenga el banco en balance con el descuento aplicado a las ventas, por lo que cualquier alegría en los precios puede tener un enorme impacto en sus cuentas.\nLas provisiones es dinero que los bancos apartan para cubrir el posible impago de los créditos o pérdida de valor de inmuebles, acciones o bonos.\nNo se trata de una norma estándar para todo el sector, ya que en esto se aplican los llamados \"modelos internos\", que son diferentes para cada entidad.\nEn todo caso, con esta norma se refuerza la idea de que cuanto mejor estén provisionados los inmuebles, más fácil será venderlos.",
  "text":"Malos tiempos para encontrar un chollo entre los pisos que tienen a la venta los bancos. El interés de algunos por librarse de las viviendas adjudicadas les ha llevado a ofrecer rebajas muy interesantes en el pasado, pero esta práctica se ha acabado con la nueva circular contable del Banco de España que acaba de entrar en vigor. Esta norma obliga a volver a tasar (y provisionar) todos los inmuebles que tenga el banco en balance con el descuento aplicado a las ventas, por lo que cualquier alegría en los precios puede tener un enorme impacto en sus cuentas.\n\nHasta ahora, cuando un banco vendía un piso por debajo de su valor de tasación, tenía la obligación de provisionar la diferencia con el precio al que lo tuviera valorado en su balance (valor en libros) y apuntarse la pérdida correspondiente. Pero solo para ese inmueble individual. Las provisiones es dinero que los bancos apartan para cubrir el posible impago de los créditos o pérdida de valor de inmuebles, acciones o bonos. Este dinero resta del beneficio de la entidad, o lo que es lo mismo, supone una pérdida.\n\nEl gobernador del Banco de España, Luis Linde (EFE)\n\nLo que cambia la nueva circular es que, a partir de ahora, esa pérdida no solo se referirá a cada piso individual, sino que esa rebaja tendrá que aplicarse a todos los inmuebles similares que tenga el banco en su balance. Y, en consecuencia, deberá provisionarse la pérdida de valor de todos ellos, y dado que la banca todavía tiene activos adjudicados por valor de 81.500 millones de euros (solo se han reducido un 4% desde 2011), estamos hablando de un impacto potencial muy importante en sus resultados.\n\nNo se trata de una norma estándar para todo el sector, ya que en esto se aplican los llamados \"modelos internos\", que son diferentes para cada entidad. Y algunos toman como referencia las ventas realizadas en los últimos tres meses, otros las de los últimos seis... Pero lo que es obligatorio es que estos modelos incluyan la referencia de las operaciones realizadas en los meses anteriores, según fuentes del sector. La justificación de esta medida es que los inmuebles deben estar valorado a precios de mercado y que ese precio debe ser similar para todos los de la misma entidad; por tanto, si vende algunos con un descuento, ese descuento debe aplicarse a todos.\n\nEl BdE critica que los bancos no vendan pisos\n\nResulta un tanto contradictorio que el mismo Banco de España que ha establecido esta norma -que entró en vigor en octubre y se aplicará ya a los resultados de cierre del año- critique en su último Informe de Estabilidad Financiera que los bancos no den salida a sus adjudicados con mayor celeridad: \"En el último año, este importe de activos improductivos [incluye los préstamos morosos] se ha reducido en un 12 %, si bien aún representa un porcentaje significativo del activo total de los bancos en su negocio en España y constituye un elemento de presión negativo sobre la cuenta de resultados y la rentabilidad de las entidades\".\n\nSede del Banco de España, en la Plaza de Cibeles en Madrid (EFE)\n\nComo adelantó El Confidencial, la nueva norma también obliga a las entidades a volver a tasar sus inmuebles adjudicados todos los años, en vez de cada tres ejercicios como sucedía hasta ahora. Esto también pretende que los bancos tengan su ladrillo valorado a niveles realistas y, de nuevo, que doten las provisiones necesarias para cubrir la diferencia con el valor al que se lo adjudicaron. En la presentación de los resultados del tercer trimestre, la mayoría de los grandes bancos explicaron que la nueva circular significará un trasvase de provisiones desde el crédito moroso a los adjudicados, pero que el efecto neto será mínimo.\n\nEn todo caso, con esta norma se refuerza la idea de que cuanto mejor estén provisionados los inmuebles, más fácil será venderlos. Esa es una de las razones que explican las dudas del mercado sobre la capacidad del Banco Popular de deshacerse de 15.000 millones de adjudicados en dos años, ya que su tasa de cobertura con provisiones se sitúa en el 36%, frente a una media del 50% en el sector.",
  "title":"Rebajas: Los bancos acaban con los chollos de pisos para no retasar todos sus inmuebles. Noticias de Vivienda",
  "url":"http:\/\/www.elconfidencial.com\/vivienda\/2016-11-05\/bancos-rebajar-pisos-retasar-inmuebles-adjudicados-circular-contable_1285149\/"
}

Next steps before jumping to modelling are:

EDA and data cleaning: there might be articles that can be discarded since their content has not been properly downloaded. Other articles have been published in different languages or in newspapers that will not be considered in production. What actions should be considered?
Train / test split: the time component in the modelling part is important. Our goal is to efficiently classify future articles in production, and hence it makes more sense to assign to the test dataset the most recent articles, and train the model with older ones. This approach is arguable.
…

Thousands of online articles have been downloaded over the last two years and those included in previous newsletters have been labeled

2. Modelling

This is where the real fun begins. The basic idea here is to build a classifier that correctly discriminates between interesting and non-relevant news.

I have hosted a Kaggle competition to engage other colleagues to take part of the project in the modelling phase. This is the one reason why the train / test split was done before.

Since our interest is training a model that should generalize correctly in production and the problem consists on obtaining a single prediction for a given article (include / not include), the most straightforward and reliable architecture is a neural network-based text classifier (only use text and title here).

Before defining the layout of our classifier, it is good practice to do some text pre-processing:

Words to Numbers

In this part we need to build a dictionary that will be used to convert words to numbers. Our dictionary will contain all words that have appeared at least 5 times (in all articles). Once we have the unique list of such words, we proceed to assign to each word a number. The first and most frequent word (=1) is de, the second one (=2) is la, …, the #200 is situación, etc. Now we can convert sentences with words to lists with integers.

['los',
 'nueve',
 'consejeros',
 'del',
 'banco',
 'de',
 'inglaterra',
 'han',
 'acordado',
 'mantener',
 'los',
 'tipos',
 'de',
 'interés',
 'en',
 'el',
 '025',
 '.'][   12,
    17,
  4265,
   202,
     0,
     1,
 31026,
   207,
  1195,
     9,
    70,
   332,
    11,
  1017,
 31027]

You cannot pass differing lengths of text sequences to the algorithm. Hence we will pre-pad the sentence with <PAD> (new “word” in our dictionary). In other words, truncate articles’ text to a determinate number of words (integers). The threshold of our choice is 500 words for text (lower threshold if we were to consider title). Sequences less than 500 in length will be pre-padded and sequences that are longer than 500 will be truncated.

Neural network-based text classifiers

Inputs are ready to train the model’s parameters.

The basic structure used is:

Input (pre-padded text sequences)
Embeddings
Convolutional layers (CNN)
LSTM (RNN)
Fully connected part

There are countless deep learning frameworks available today but for this case study we are using Keras running on top of TensorFlow.

Embeddings

Embedding layers take a sequence of word ids as an input and produce a sequence of corresponding vectors as an output. Their functionality is really straightforward, and since the actual semantics of those vectors are not interesting for our problem, the only remaining question is “What is the best way to initialize the weights?” Depending on the problem, the answers may be a “generate your own synthetic labels, train word2vec on them, and init the embedding layer with them.”

In Fig 3 we have used Continuous Bag of Words (CBoW) with TensorFlow to build an embeddings matrix and used t-SNE technique for dimensionality reduction to plot a few words in 2D. It can be clearly seen how similar words cluster together (numbers, years, banks, etc.).

**Fig 3.** Using Continuous Bag of Words (CBoW) to build embeddings matrix in all available articles.

The go-to solution here, however, is to use a transitional layer for embeddings. LSTMs learn to distinguish important and unimportant parts of the sequence by themselves, but we can’t be sure that the representation from the embedding layer is the best input, especially if we don’t finetune the embeddings.

In other words, instead of building the embeddings matrix with word2vec and transferring the fixed weights to the embedding layer, we let the neural network train the embeddings weights.

“Adding a layer that is applied to each word embedding independently can improve your results, acting as a simple attention layer.”

Before feeding the sequence model with data, we need to truncate and pad the input sequences so that they are all the same length for modelling. The model will learn the zero values carry no information so indeed the sequences are not the same length in terms of content, but same length vectors are required to perform the computation in Keras.

Convolutional

Convolutional Neural Networks excel at learning the spatial structure in input data. We can easily add a one-dimensional CNN and max pooling layers after the Embedding layer which then feed the consolidated features to the LSTM. Here you can find a simple implementation of Convolutional Neural Network for sentence classification.

LSTM

Recurrent Neural Networks like LSTM generally have the problem of overfitting and this is why these layers are normally combined with dropout.

Since the main work is being done in the recurrent layer, it’s important to make sure that it captures only the relevant information. It’s a frequent challenge for natural language applications and an open scientific problem.

Dense classifier

A fully-connected part performs a series of transformations on the deep representation and finally outputs the scores for each class. The best practice here is to apply the transformations as follows:

Fully-connected layer
Batch normalization
(Optional) Non-linear transformation (hyperbolic tangent or ELU)
Dropout

To add more predictive power we have decided to stack two neural network-based text classifiers (for text and title) and use their outputs to feed a XGBoost classifier (Extreme Gradient Boosting) in top of them.

Evaluation

Receiver Operating Characteristic (ROC) curves are typically used in binary classification to study the output of a classifier. The AUC is the Area Under the Curve and is used here to evaluate the performance of the model. The AUC has several equivalent interpretations:

The expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
The expected proportion of positives ranked before a uniformly drawn random negative.
The expected true positive rate if the ranking is split just before a uniformly drawn random negative.
The expected proportion of negatives ranked after a uniformly drawn random positive.
The expected false positive rate if the ranking is split just after a uniformly drawn random positive.

After this simple but powerful approach, we have reached an AUC larger than 0.95 in the test set !!!