How to Web Scrape on a Schedule with Apache Cassandra, FastAPI, and Python
Author: Pieter Humphrey
Can there be other use cases for Apache Cassandra® beyond messaging and chat? In this tutorial, we show you how to web scrape on a schedule by integrating the Python framework called FastAPI with Astra DB, a serverless, managed Database-as-a-Service built on Cassandra.
Recently, I caught up with the Pythonic YouTuber Justin Mitchell from CodingEntrepreneurs and we discussed how today’s apps are tackling global markets and issues. He pointed out that Discord stores 120 million messages with only four backend engineers—and that was back in 2017.
While scale is certainly not the only consideration for global applications, it’s one of the most difficult issues to tackle, because it often implies costly rewrites or re-architectures. Today, Discord is storing billions of messages a day, with a minimal team — still. Justin and I are both curious: what makes this possible?
As Discord writes on their blog,
“This is a lot of data that is ever increasing in velocity, size, and must remain available. How do we do it? Cassandra!”
As developers, we know that scalability on the app tier is just as important as the data tier. In the past decade, application frameworks and libraries have massively improved to tackle distributed applications. But what about distributed data?
Apache Cassandra® is a NoSQL and open-source database built for astounding amounts of data, and used by global enterprises like Netflix, Apple, and Facebook. But can there be other use cases for Cassandra beyond messaging and chat? I asked Justin from CodingEntrepreneurs if he thought web scraping was an interesting use case.
In this video tutorial on the Coding for Entrepreneurs YouTube Channel, Justin shows you how to web scrape on a schedule by integrating the Python framework called FastAPI with Astra DB, a serverless, managed Database-as-a-Service built on Cassandra.
This tutorial build uses the following technologies:
- Astra DB: managed Cassandra database service
- FastAPI: web framework for developing RestAPIs in Python based on pydantic
- Python: interpreted programming language
- pydantic: library for data parsing and validation
- Celery: open-source asynchronous task queue based on distributed message passing
- Redis: open-source, in-memory data store used as a message broker
We began our video tutorial with a quick demo on implementing our own scraping client and methods to scrape a dataset from Amazon.com on Jupyter Notebook. We show you how we grab all the raw data, put it in a dictionary, validate it, and store it in a database. What is amazing is that when FastAPI starts, it syncs all the database tables coming from AstraDB automatically. This makes web scraping really simple, easy, and fast.
What you need to web scrape on a schedule
We will be using Python to perform web scraping, which means extracting data from websites. Since we are web scraping on a quick schedule, we are going to get massive amounts of data really, really quickly, making it perfect for Cassandra. We will use Astra DB to automatically connect to Cassandra, which gives us up to 80 gigabytes for free. Start by signing up for an account here.
We also recommend that you have some experience with Python, like completing the programming challenge 30 days of Python. If you have a basic knowledge in classes, functions, and strings, you are good to go.
You will find all the tools you need on the Coding for Entrepreneurs Github. Don’t forget to install Python 3.9 and download Visual Studio Code as your code and text editor, and we are ready to set up our environment. Follow along in our YouTube tutorial.
Integrating Python with Cassandra
Once you’ve signed up for your free account on Datastax, we will create a FastAPI database on Astra DB with the key space name, provider and region that’s closest to you physically. While we are waiting for the database, we will add a new token to the application, which we walk through step-by-step in our video tutorial.
When the FastAPI database is active, we will connect it using a Cassandra driver that we can download from Python. After following all the steps, we should have a fully connected database between Cassandra and Python.
Creating data with your first Cassandra model
If you are following along with the video, you should have connected to a session. Now you can store data in your Cassandra database. Since we’ll be scraping Amazon, the data we want to store is the Amazon Standard Identification Number (ASIN), a unique identifier for each product, and a product title. If we have a million of these, it won’t be efficient in Python, but it will be very efficient in Cassandra. The ASIN is the primary key to look up products in our database.
Once we are storing data in the model, we will set and register the default connection. The idea here is to store the product itself and its details. When you use the primary key, it will update other fields automatically.
One of my favorite things about Cassandra is that if you need to add a new column to the existing model, you can do it easily without having to run migrations or make changes to the actual tables. Follow along the coding to add a new column to an existing model here.
Tracking a Scrape Event
The next part is to track an actual scrape event. With the
create_scrape_entry command, we are essentially adding new items to our database. The current model is going to update based on the ASIN that you’ve stored. If the ASIN exists in the database, it will update all the other fields. If it doesn’t, we can train Cassandra to always add in the updates as the default by creating a different table and setting an UUID field as our primary key. As long as there is a UUID, you can have a consistent data update. Learn how to do this and more here.
Using Jupyter with Cassandra Models
When we write everything in the Python shell, it is easy to forget what we wrote or for things to get lost in context. Jupyter Notebook prevents us from having to exit out of the shell and rerunning it all. Even if you haven’t used it before, you’ll find that it’s very user-friendly as long as you’re familiar with Python. Jupyter allows rapid iteration and testing, making it extremely useful for our Cassandra model, and to scrape our web pages. Follow along our YouTube tutorial to scrape events using Jupyter. On our GitHub, you can also find other methods apart from Jupyter that you can use for every model.
Validating Data and Implementing FastAPI
Validating our data creates a more robust FastAPI when we start implementing it. pydantic is the best tool to do this in any given response or view, especially if you are converting an old project that is running on FastAPI to use a Cassandra database.
What is neat about pydantic is that if there is a data error, it automatically returns a validation error and gets rid of the unnecessary data. You can also change your requirements easily on pydantic without having to update your Cassandra database. Find all the instructions and codes for validating data and implementing FastAPI you need right here at this video.
Convert Cassandra UUID to pydantic datetime strv
There is one critical flaw to the event scraping process: it is missing the actual time an event occurred. You can find this information from the UUID field that you put in the Cassandra model, which has a time element, or the time UUID field in the Cassandra driver.
But let’s go ahead and parse this UUID field into an actual date and time object. I created a gist specifically for this challenge which you can find on our GitHub. Just grab the values over to your schema and create a timestamp if you need to by following along our video tutorial.
Now, you’re ready to scrape the data from your local server and send it to your FastAPI project, which will then send it to your Astra DB. By now, you have pretty much built a foundation in terms of storing your data. If you want to add things to your database in the future, you can just add extra fields to your model and update your schema accordingly.
Implementing Celery to Delay and Schedule Tasks
Once you have a foundation, you want to create a worker process that is going to run on a regular basis. You would normally look at a function and call it when you want to execute that function. But what if you want the function to be fully executed in the future or for it to run on a specific day at a specific time?
You can use Celery to both delay tasks from running and schedule those tasks periodically. You can have a huge cluster of computers running these tasks for you without having to use your main web application computer or the current one you’re on, which is amazing.
Celery relies on communicating through messages or a broker, such as Redis — a big value store with a bunch of keys. Redis runs these keys and Celery inserts and deletes keys on its own automatically. Find a complete guide on how to set up Redis on our GitHub.
Once you have set up Redis, follow the instructions in our tutorial to:
- Configure environment variables
- Implement a Celery application as a worker process to delay tasks
- Integrate Cassandra Driver with Celery
- Run periodic tests on Celery
Scraping with Selenium
Now that you know how to delay and schedule tasks, let’s get to the web scraping process using Selenium and Chrome Driver. We are using Selenium particularly to emulate an actual browser. But if it is not installed on your system or if you are having issues with it, feel free to use Google Colab or any kind of cloud service using Notebooks. Follow the instructions along here:
- Set up Selenium and Chrome driver data
- Scrape events with Selenium
Once you are done with the scraping, remember to use the end-point from our FastAPI along with
ngrok to ingest the data.
Implement our scrape client parser
It is time to parse out our actual HTML string using a request-html. You could use BeautifulSoup or even the Selenium WebDriver instead, but I found request-html to be easier. Next, implement the Scrape Client Parser, meaning to add a system right on the client scraper to handle all of the scraping and parsing. Follow along the steps here to:
- Parse your data using requests-html
- Set up Scrape Client Parser
Then put the validated data, Celery worker process and Scrape Client Parser, all together in the Cassandra database to return a massive dataset.
In this tutorial, you have learned how to:
- Integrate Astra DB with Python, FastAPI, and Celery
- Set up and configure AstraDB (a managed Cassandra database)
- Use Jupyter with Cassandra models
- Schedule and offload tasks with Celery
- Web scrape on a schedule
- Use Selenium and requests-html to extract & parse data
If we are scraping data for thousands of products, our data set is going to balloon up to become massive. Astra DB can handle any amount of data, allowing you to track price changes, observe trends or historical patterns, and really get into the magic of big data. Plus, it is really easy to implement everything on Astra DB!
If you want to learn more about Cassandra or Astra DB, check out DataStax’s complete course on Cassandra and YouTube channel with plenty of tutorials. Also, join the DataStax community and keep a look out for Cassandra developer workshops near you. The Coding for Entrepreneurs YouTube channel also intends to cover Cassandra and Astra DB more frequently in the future, so let us know what you want to see next in the comments!
- Astra DB: Multi-cloud Database-as-a-service Built on Cassandra
- Astra DB Sign Up Link
- What is FastAPI?
- How Discord Stores Billions of Messages
- 30-Days of Python
- YouTube Tutorial: Python, NoSQL & FastAPI Tutorial: Web Scraping on a Schedule
- Coding for Entrepreneurs YouTube Channel
- Coding for Entrepreneurs Github
- DataStax Academy: Apache Cassandra Course
- DataStax YouTube Channel
- DataStax Community
- DataStax Cassandra Developer Workshops