Creating a Near Real-Time Financial News Dataset With AWS Lambda

Daniel O'Keefe
Oct 15, 2020 · 4 min read
Image for post
Image for post
AWS Lambda’s logo

Stock prices fluctuate over time depending on market sentiment. Firms can gauge the current and historic market sentiment for individual stocks or entire markets by using financial news articles. With these articles, firms can use natural language processing techniques such as named entity recognition and sentiment analysis to measure outlook for specific stocks or the market as a whole. These methods help to tag articles about publicly traded companies like Microsoft or Netflix and to calculate a sentiment tag or rating indicating if the financial article is positive, neutral, or negative.

This article will go over how you can compile a dataset of financial news from CNBC Finance in an S3 bucket that updates daily.

To find the top financial news article of the day from CNBC Finance, we can use the requests library in Python to send a GET request to https://www.cnbc.com/finance/.

We then parse through the response using Beautiful Soup. We identify “cards” on the website which contain links to the individual articles. We want to retrieve the URLs and record them in a list.

For each URL we retrieve, create an Article object which contains the date and text of the article.

To run this automatically with AWS, I will use AWS Lambda and store the results in an S3 bucket. Create an S3 bucket. Navigate to the IAM console and create a role that grants access to your S3 bucket.

Log into the AWS Lambda Console and click “Create function”. Select “Author from scratch”, select a Python 3.8 runtime, name your function, and click the “Create function” button again.

Since we are using two Python libraries that aren’t included in the AWS Lambda Python 3.8 environment, we have to download and zip our code to include in our deployment package. The libraries we have to add are requests and Beautiful Soup. From the command prompt, create a new directory on your computer named cnbc_dependencies and navigate to it. Download the requests package to that directory with

pip install requests -t .

Make sure to include the period, which downloads the package to the current directory. Download Beautiful Soup.

pip install beautifulsoup4 -t .

We need make a few modifications to our code first to insert the article text files into S3. We need to specify the handler function in our code which Lambda invokes. Name your Python script lambda_function.py. With these changes in place, the code should look like this:

Copy your Python file to your directory and then zip the contents of the directory. You can make modifications your code later from within the text editor in the AWS console if needed.

Image for post
Image for post
Which files should be included in the zipped deployment package?

From the Function code editor in the Lambda console, click on Actions > Upload a .zip file.

Image for post
Image for post

Scroll down to the Basic Settings section of your Lambda function and look at the Handler setting. “lambda_function” should be the name of your Python file. “lambda_handler” should be the name of a function in your Python file. Lambda will look for this Python file and function when the Lambda function is invoked.

Image for post
Image for post

Now let’s add an automated trigger to run this function daily. Navigate to the top of the Lambda function configuration page and click the “Add trigger” button. Select “EventBridge (CloudWatch Events)”. Create a new rule with a Cron schedule expression of “cron(0 13 1/1 * ? *)” to run everyday at 1300 UTC (0900 EST). Click “Add” to add the trigger.

To verify your Lambda function has access to both CloudWatch and S3, click on the Permissions tab and view the Resource summary.

Image for post
Image for post

Everything should be good to go now!

The articles will be saved as text files and organized by year, month, and day.

Image for post
Image for post

One possible improvement is reducing the amount of memory required to run this Lambda function. The minimum amount of memory you can use is 128 MB, but this function has been using memory in the 250–350MB range because I’m storing all the responses from each article web page before sending to S3. If I send the article to S3, delete the article from memory, then send a request for the next article, I could scale the memory I use down to 128 MB.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store