Creating an Automated Serverless Data Pipeline for Visualization of Financial News Sentiment with Dash

Dmitriy
More Python
Published in
5 min readJul 6, 2019

I wanted to get a historical outlook on news sentiment of specific stocks in the market. In order to do this, I would need a large amount of full-text article data which is not readily or freely available. After an exhaustive google search, the best historical data I could find only dated back about 100 articles per company. I was disappointed by this and decided to make my own small data mining operation to periodically scan for new links to articles of the stocks I’m interested in, open them up, store them, parse them with NLP, and analyze the saved data on a dashboard. In addition to the S&P index, I thought it would be reasonable to choose all of the recent interesting IPOs, this way it’s possible to store the articles from day 1.

My next thoughts were: how much would this cost, and what’s the most parsimonious way to do it while still learning important skills. I am relatively new to the whole serverless trend so I wanted to explore it a little further. It turns out that AWS lambda functions and static storage on AWS are really cost-effective if used properly. You can read more about this stuff on google, but to summarize: lambda functions can perform operations on a periodic basis and only when you need them to. It’s like having functions in the cloud without servers. Since this isn’t really a resource-intensive mining operation, pure lambda functions will work great for all stages of ETL.

Creating The Pipeline

The plan I came up with involves three lambda functions which are invoked every three days and another one to serve up the data to the front-end for analysis/exploration using Dash.

Figure 1. AWS architecture diagram for a single cycle of the data mining operation.

The above diagram is an overview of my architecture in a broad sense, in that the labels for the functions are accurate but do not fully encapsulate its actions. Some of the steps are combined in different ways in different functions to account for limitations of the lambda service (execution time and memory), but for the most part, the ETL process is segregated and is as follows:

  • Extract URLs for articles, open them, save and parse the text from each article to S3. This is the raw data.
  • Grab the raw data from S3, transform it using NLP to get sentiment and summaries, drop some duplicates, store to a different S3 folder.
  • Aggregate and load the data into the final S3 folder so that it can be consumed via a REST API and visualized.

Now it’s time to talk about the costs: four lambda functions, a messaging service, and S3 storage. The S3 and messaging costs are negligible, costing a few pennies for gigabytes of storage. Lambda costs are based on the number of executions per month, memory allocation, and duration in milliseconds.

Figure 2. Lambda costs estimated via https://dashbird.io/lambda-cost-calculator/.

All I can say about my cost summary is that I love serverless. For ~10 cents per month, it's possible to set up an automated data mining operation in the cloud! Now, let’s talk about the execution of the plan.

  1. The first lambda function, Extract.py, gathers data for specified stock tickers every 3 days, turns it into a JSON object, stores it to an S3 folder, and sends a message to SNS.
  2. Next, Transform.py detects the message from above, gets the new data, adds sentiment, drops duplicates, and stores the processed data in a different folder in the same S3 bucket and sends another message to SNS.
  3. Load.py sees the message and proceeds to do aggregation and storage of the final data to another folder on S3.
  4. Finally, API.py grabs data from S3 on request from the API gateway to safely and securely serve up a REST API for consumption through Dash.

Note: it’s important to set the appropriate permissions and function specifications as outlined in Figure 2 before deployment in the config file or after deployment in the AWS console. I deployed these functions with chalice and modified IAM roles, memory, and time spec in config.json.

Visualizing the Data

Now that we have some data, it’s time to visualize it. Dash is a trending Python framework for building data visualization apps and user interfaces in pure Python and I love it. The app I built uses a “datatable” component to differentially render data from the API based on user input with various filters.

Figure 3. A snapshot of the app. Deployed (be patient, it’s pulling data from static storage): https://news-sentiment-get-money.herokuapp.com/. Code.

I should mention that the awesomesauce of Dash also comes with a lack of up to date tutorials and out-of-box styling that is a little too basic. I’ll just say that styling the app was a bit of an annoyance and front-end developer jobs will not be going away anytime soon. It definitely takes some effort to grok, but once you do, Dash is a very powerful, highly customizable, and expressive visualization tool.

Challenges

  1. Choosing the right tools for the pipeline. AWS has a very large suite of tools. I initially wanted to use glue to do my etl jobs, dynamodb to store the data, and state machines for event-driven programming, but I found that everything I wanted to accomplish was possible with just static storage and talking lambdas.
  2. Coming up with the best way to clean and process the extracted data. Even though I did a lot of testing on jupyter before actually deploying the lambda scripts into hyperspace, new weird data caused anomalies in my ETL. During live testing, I often found myself going back to modify my extract scripts to account for things like blank values and bad request errors.
  3. Lambda config. Specifying the correct configuration for every lambda function before deployment was sort of a pain. Chalice’s automatically generated IAM roles were invalid so I had to set them manually. I also initially wasn’t reserving concurrency to 1 and some lambdas ran 10x when it should have just been once. This was mainly due to my lack of understanding of S3 events and I quickly realized the issue while monitoring the logs. Lesson: don’t put objects and listen to put events in the same s3 bucket with the same function, it causes an endless loop.
  4. Dependency packaging. The app uses quite a few awesome libraries such as finviz, newspaper, nltk, and pandas, so I had to figure out how to best split my functions to conform to the 250mb size limit for dependencies in lambda..

--

--