Creating a Near Real-Time Financial News Dataset With AWS Lambda
Stock prices fluctuate over time depending on market sentiment. Firms can gauge the current and historic market sentiment for individual stocks or entire markets by using financial news articles. With these articles, firms can use natural language processing techniques such as named entity recognition and sentiment analysis to measure outlook for specific stocks or the market as a whole. These methods help to tag articles about publicly traded companies like Microsoft or Netflix and to calculate a sentiment tag or rating indicating if the financial article is positive, neutral, or negative.
This article will go over how you can compile a dataset of financial news from CNBC Finance in an S3 bucket that updates daily.
We then parse through the response using Beautiful Soup. We identify “cards” on the website which contain links to the individual articles. We want to retrieve the URLs and record them in a list.
For each URL we retrieve, create an Article object which contains the date and text of the article.
To run this automatically with AWS, I will use AWS Lambda and store the results in an S3 bucket. Create an S3 bucket. Navigate to the IAM console and create a role that grants access to your S3 bucket.
Log into the AWS Lambda Console and click “Create function”. Select “Author from scratch”, select a Python 3.8 runtime, name your function, and click the “Create function” button again.
Since we are using two Python libraries that aren’t included in the AWS Lambda Python 3.8 environment, we have to download and zip our code to include in our deployment package. The libraries we have to add are requests and Beautiful Soup. From the command prompt, create a new directory on your computer named cnbc_dependencies and navigate to it. Download the requests package to that directory with
pip install requests -t .
Make sure to include the period, which downloads the package to the current directory. Download Beautiful Soup.
pip install beautifulsoup4 -t .
We need make a few modifications to our code first to insert the article text files into S3. We need to specify the handler function in our code which Lambda invokes. Name your Python script lambda_function.py. With these changes in place, the code should look like this:
Copy your Python file to your directory and then zip the contents of the directory. You can make modifications your code later from within the text editor in the AWS console if needed.
From the Function code editor in the Lambda console, click on Actions > Upload a .zip file.
Scroll down to the Basic Settings section of your Lambda function and look at the Handler setting. “lambda_function” should be the name of your Python file. “lambda_handler” should be the name of a function in your Python file. Lambda will look for this Python file and function when the Lambda function is invoked.
Now let’s add an automated trigger to run this function daily. Navigate to the top of the Lambda function configuration page and click the “Add trigger” button. Select “EventBridge (CloudWatch Events)”. Create a new rule with a Cron schedule expression of “cron(0 13 1/1 * ? *)” to run everyday at 1300 UTC (0900 EST). Click “Add” to add the trigger.
To verify your Lambda function has access to both CloudWatch and S3, click on the Permissions tab and view the Resource summary.
Everything should be good to go now!
The articles will be saved as text files and organized by year, month, and day.
One possible improvement is reducing the amount of memory required to run this Lambda function. The minimum amount of memory you can use is 128 MB, but this function has been using memory in the 250–350MB range because I’m storing all the responses from each article web page before sending to S3. If I send the article to S3, delete the article from memory, then send a request for the next article, I could scale the memory I use down to 128 MB.