Sending Customized Coronavirus U.S. Case Updates Using Web Scrapers and Slack Webhooks

Published in

The Startup

11 min readApr 22, 2020

As the coronavirus pandemic unfolds globally, massive amounts of coronavirus-related data are published by governments, companies, and newspapers to an extent that it feels overwhelming to get concise and customized information about the crisis. A person in India may need updates about cases in their state and a list of grocery stores open locally, whereas a family in San Francisco may need case updates in the city and alerts about hospital capacity with directions to care centers, yet both have to search multiple sites to gather this information. However, we can alleviate this problem by using web scrapers and Slack webhooks to send personalized messages containing specific data to a Slack channel (or through other forms of communication or data storage). Web scraping is the process of harvesting and extracting data from multiple websites. The process usually involves software spiders, which contain instructions for fetching HTML documents to extract desired information. Slack webhooks allow for integrations with Slack channels through applications created using the Slack API. The combination of three areas, web scraping, Slack integrations, and periodic tasks, can help us extract and send customized data to a Slack channel.

Scraping: We’ll cover how to use Scrapy to crawl pages and extract certain pieces of information.
Slack Integration: When the information is extracted, sending it to Slack requires a different type of integration through the Slack API. We’ll explore how to properly integrate Slack with our own web scraper.
Periodic Tasks: Without going too much into daemons and task-handling, we’ll cover how to run periodic tasks with web scrapers on your machine. We’ll also look at other options for handling periodic tasks.

By the end of this article, you will be able to scrape coronavirus case data and send it to a Slack channel periodically, similar to the output shown below:

This article assumes that you have some knowledge in computer programming with Python, and in order to send personalized messages to Slack, you also need the following:

A Slack account and workspace where you can test features of Slack’s API.
The latest version of Python installed on your machine, and the interpreter in your IDE should reference the latest version. Scrapy must be installed (install here) using the command prompt and pip, as well as datetime for timestamps (learn more here).
Python fluency in order to understand this implementation. Here is a link for a useful tutorial: Python tutorial.

Using Scrapy to Extract Data

We’ll be using a site that provides constant updates to U.S. coronavirus case data which makes it ideal for scraping. In this article, we will be extracting four important pieces of data: total cases, new cases, total deaths, and new deaths, all for the United States. There are two main steps to locating the piece of data we want:

Figure 1: Case Table and Inspect Element Page

Inspect Element: Use the inspect element (Ctrl + Shift + I on Windows and Command + Option + I on Mac) to locate the HTML tag that corresponds with what you want to extract. In our example, since the data we want is in a table, I would look in the body of the HTML page for a table with rows corresponding to data points of different countries. Figure 1 shows a highlighted row that corresponds to the row of U.S. values that we want to extract.
Classes: It is important for each tag to have a class or ID associated with it so that the spider knows where to locate the data points. For example, the row with our data points has a class called “even” that allows us to identify our row. However, if we look closer at the row, it looks as if the developer of this website did not identify the specific values using a class name in the U.S. row (shown in Figure 2). We’re in luck that the values are in a table, because we can identify those values by the table cell numbers.

In order to extract certain values using our spider, we need to use a function called response.css(), which will allow us to extract certain data points from the table. Since we are using cell indexes, we can call the extract() function with the cell value in brackets to grab that value. The top left of the table is the first cell (which has the index 0), and by counting, you can determine the index of value you want.

Take a look at Figure 3. Ignore the parts about a webhook URL; we will go over exactly what that is when we cover Slack integration. table td::text allows access to text properties of the table, and the following function extracts the value from the text. The most overlooked but important part of the statement shown is the encode function. A problem I encountered, which mostly likely you will too, is when data is being extracted, extra ASCII characters surround each value and makes concatenation difficult; by using the encode function with specific parameters, we can eliminate those extra characters.

Figure 3: Extraction Code

Which makes the output look like this:

We have just established the format of our data payload! For creating the rest of the web scraper, the general outline of the web scraper can be found in the Scrapy documentation, which also provides an excellent explanation on how each component works. For the purposes of this article, all the modifications made to the web scraper for Slack integration will be covered later, but I want to spend some time covering scraper etiquette, specifically on avoiding anti-scrapers. Below are three of the best practices I use while web-scraping:

Don’t purposefully slow down sites: If your web-scraper hits a site too often while extracting a lot of data, it can slow the website down for users. Maintain a reasonable time gap and control the amount of data being extracted.
Respect Data Copyright Laws: Republishing scraped-data as your own can be considered copyright infringement, so take a closer look at the data you’re scraping and how to use it responsibly.
Reasonable Requests: Incorporate variability into your scraping, otherwise anti-scrapers can easily detect data extraction. Anti-scrapers are intelligent bots built by developers to keep their websites from being scraped through pattern recognition (humans will not perform quick, repetitive tasks). Add some random actions to prevent the site from blocking you for accessing and extracting data. Here is a link for some advanced web-scraping techniques that can help you avoid anti-scraper bots: Advanced Python Scraping.

Slack Integration

We are now going to cover how to send that extracted data to a specific Slack channel. Slack provides an API that makes it easy to send data through scripts and webhooks. There are several steps in order to create a Slack app that can receive data, but first I want to go over some basic definitions.

Slack Workspace vs. Channel: Channels are smaller streams of communications that operate in the larger workspace.
Slack Apps: Customizable functionalities that expand the capabilities and enhance the creativity of a workspace. They’re like CSS to HTML, but more advanced.
Incoming Webhooks: Slack’s unique webhooks, which contain a JSON payload (a message), that are sent by a unique URL generated when creating a Slack app. A generic example of a webhook URL is shown at the bottom of Figure 4.

Slack’s documentation provides a thorough layout of the process, but I want to focus on the integration: How can I get my Python spider to connect with my Slack App?

While this task sounds daunting, by using the requests.post() function with the webhook URL as a parameter, we can send the JSON payload as a message to a specific channel. Through Slack’s website, you can generate a specific webhook URL for a certain channel and use it in your code. In Figure 3, the requests.post() function is shown. This function is Python’s built in HTTP POST request method, which requests that a web server accepts the data in the payload (Slack’s servers then direct it to the channel using the webhook URL). Before using the function, you must make sure that requests is installed and updated in your IDE’s terminal; you can install requests using the pip command in the terminal. Our version of this function contains three parameters: link, json, and headers. The link parameter is where our webhook URL will go, the json parameter is where the data payload is loaded, and the headers parameter is set to ‘Content-Type’: ‘application/json,’ which lets the client and the server know about the response location and server type (needed for request navigation). This command allows the scraper to crawl and extract data, then send that data through a webhook to a desired Slack channel.

Error Handling: What if you created an app for a channel that no longer exists? What if your app isn’t authorized to operate in a specific workspace? Trying to catch errors with the web scraper within Slack is problematic because Slack doesn’t provide error logs for integrations. Instead, it would be best to build in error handling inside of the scraper class itself.

Figure 5: Error Handling for Slack Integration

Figure 5 shows a code snippet of the conditional statement used for error handling. The conditional operator is based on HTTP responses status codes, which are certain codes that we use to indicate whether a certain HTTP request has been successfully completed. These can help you narrow down where and how the error occurred. Below are the five classes of request status codes: Informational responses (100–199), Successful responses (200–299), Redirects (300–399), Client Errors (Incorrect Syntax and User Error) (400–499), and Server Errors (500–599).

We must check for all possible errors, so our conditional statement checks if the error does not equal 200 (success for POST request), only then does it send an error message. It then raises a ValueError, which essentially contains the error message, which is returned in the terminal.

Yes, it’s that simple. It took me a while to find a variation of the requests.post() function with those specific parameters, but I was able to send a Slack message to a specific channel once. However, if you want to send Slack messages periodically, we’d need more advanced functions.

Periodic Tasks

We just went over how to send a single message to Slack with scraped data in one code execution, but what if you want to run it multiple times? There are two ways to do this:

Twisted Reactor or Scrapy-Do: Both of these programs yield control and go to sleep on certain events. Using a deferLater() function, we can put the process to “sleep” while periodically waking it up to scrape data.
Daemons: These are computer programs that run as a background process without any user interaction. Systems boot daemons and can then respond to HTTP requests (such as scraping data), hardware and firmware activity, and other tasks.

For the purposes of this task, it is extremely simple to use the Twisted Reactor to run our scraper on a schedule. While daemons and hosting the task on a server can provide more flexible scheduling capabilities, using Twisted is easier and faster to implement. We will use something called Twisted Internet, which is a collection of Python event-loops that contain code to dispatch events. The reactor is situated in the core of the event loop, which waits for and executes tasks in a program. Twisted Internet also comes with a function called deferLater(), which takes three parameters, the reactor, the number of seconds for the delay, and lambda (which is a terminal output). All of these come together to execute a function that fires the result of the callable (our scraping task) when the specified time has elapsed. Twisted Internet and deferLater() are key to executing periodic tasks.

While the spiders in Scrapy’s documentation are executed in a case-specific way, we will use Twisted Internet to execute ours to enable sleep capabilities. Figure 6 below shows a code snippet which contains our sleep and crawl functions inside our scraper class.

Figure 6: Sleep and Crawl Functions

The sleep function returns a modification of the deferLater function that would make it easier to implement in another function. The process variable just gather information about the crawling task, which is important in our crawl function. The crawl function is where the main action happens, where the crawling process is initiated. Our main goal with this function is to delay the process after it has executed once, and to keep delaying it after future executions. We add two callback functions (function that don’t execute until the previous function is finished), one that delays the spider for a certain period of time (3 hours in this case) and one that initiates crawling. We then return the deferred variable so that it can be accessed during task execution.

We can now send periodic updates to Slack! You should see a similar output as the one shown below.

So how would you run a process like this? You can incorporate the methods above in two ways: running the process locally (on your desktop) or hosting it on a server. Running the process on your desktop is simple and fast, but it must be constantly monitored, and if the IDE restarts or the computer shuts down, the scraper will stop extracting data. If you want to your process to run unattended, you can host your process on a server. AWS, Microsoft Azure, and Google Cloud Platform offer this service, but it usually comes with a cost.

Conclusion

In this article, we have covered how to extract customized data using web scrapers, how to send messages to Slack, and how to run this task periodically on your machine. You can now send personalized data to your Slack channels, and I hope this helps update you about this unprecedented coronavirus crisis and streamlines the data to meet your needs. Below are some helpful tips:

Remember to be aware when collecting and extracting data; don’t clog websites and respect copyright laws.
Include error handling even if it seems unnecessary; it definitely helps pinpoint errors with your scraper.
Keep your Slack webhook URL secure, it can give unauthorized access to your channel if you are not careful with the link.

Here is the link to my GitHub repository, which contains the spider that sends U.S. case updates to Slack. Feel free to download and modify it, and I’d appreciate your feedback.

Happy scraping and stay healthy!

Sending Customized Coronavirus U.S. Case Updates Using Web Scrapers and Slack Webhooks

Using Scrapy to Extract Data

Slack Integration

Periodic Tasks

Conclusion

Written by Devin Shah