The premise of web scraping has always remained the same. You use a library like Selenium or Puppeteer to load up your target application and navigate through them using hacky DOM queries and screen parsing. I have never written a scrape for an application that provided any real reusability. That’s where Krawlr comes in. Krawlr is an event driven web scraping library that breaks up scrapes into multiple reusable components which can then be chained together to accomplish tasks in an application. Krawlr also adds another way to extract data from an application through analyzing network requests that the browser has made as a response to stimuli from the scraping process. Krawlr was written with the intention of making web scraping a much more organized process, providing a way to abstract parts of a scrape into multiple reusable sub components that can be modified easily when applications change. This makes the code you write for web scraping have reusability.
Krawlr’s Library Architecture
The main component of Krawlr is the Activity, the Activity represents any given task that a scrape sets out to achieve. Such as retrieving the Tweet statistics for a target tweet, or the posts on a user’s timeline. An Activity needs two things to achieve a task, which are the Schedule and the Life Cycle. The Schedule for an activity contains the optional cron schedule (if null activity will run once) and the callback to deliver the data to. The Lifecycle is the heart of the Activity, the Lifecycle houses the components that make up the scrape. These components are known as Lifecycle events and consist of two sub-genres, the Actor and the Extractor. The Actor is an event that is intended to stimulate the user interface for data extraction, such as scrolling through a timeline or clicking a button. The Extractor is an event that takes the data acquired from Actors and returns data to be delivered at the end of the LifeCycle through the callback defined in the schedule.
The Lifecycle consists of two stages, the prep stage and the stimulus stage. The prep stage is the initial stage that will only run through its Lifecycle events once and will prepare the application for continuous data extraction through the stimulus stage. Such as loading the profile page for a targeted user. The Stimulus stage is a stage that may execute once if no cron schedule is provided, or continually if a cron schedule is provided, and will run through its events and collect any data provided by extractors. At the end of the stimulus phase, data delivery will occur to the provided callback if data is available. There could very well be a scenario that the stimulus yields no data for delivery, such as continually polling a twitter account for new tweets.
Once you have created a full Activity with all of its required components, you can initialize an instance of the Crawler. A Crawler serves the purpose of scheduling activities and housing the main instance of Puppeteer that provides Activities with the browser tab their scraper will live in. Since a Crawler contains only once browser instance, Activities can rely on authentication from other Activities due to the nature of session cookies. This makes reconstructing APIs for social media applications possible through the use of Krawlr.
Here is a sketch of the Activity architecture:
An Example: Grabbing Tweet Statistics
Here’s an example of building an Activity that will extract statistics for a Tweet given the username of the user that posted the tweet and the ID of the tweet. We will break this process into three lifecycle events:
- Navigate to to the desired tweet
- Scroll to the bottom of the conversation to see all the replies
- Extract the replies and tweet itself from the network requests made as a result of the prior action.
First we will create the event that navigates to the desired tweet, which is done in Krawlr through the NavigationEvent. A NavigationEvent is constructed with a handler that is given a reference to the Activity and will return a string for the page to navigate to. Here is the code creating a navigation event that navigates to the desired tweet:
Next we will create an Actor which scrolls through the conversation timeline and stimulates the browser to make the requests that will retrieve all of the tweet information. Here is the code that infinitely scrolls until the browser does not allow it to scroll anymore. This Actor can be reused for many different applications. Here is the code:
Finally, we will create the Extractor that extracts the desired tweet data from all the requests made as a result of the prior extraction. This is done using the NetworkAnalyzer in Krawler which is a handler that takes in all the requests/responses made by the browser and returns data to be delivered to the Activity callback. Here is the code that extracts the tweets from the conversation requests:
Now since we have created the LifeCycle events that we need to create the Activity, we can construct the Activity itself. The Activity class in Krawlr is an abstract class that has one abstract method, the setup method. The setup method must be implemented and is responsible for constructing the LifeCycle for the Activity. The Activity we create for this purpose of extracting tweet information is shown below:
We can schedule this Activity by instantiating an instance of the Activity itself and a Crawler object and schedule it for execution. An example of the code for this is shown below:
Another Interesting Use Case
Another, but unique use case for Krawlr that isn’t easily possible with the traditional web scraping approach is to have a polling analysis of the requests that the browser is making as you use an application yourself. An example of this is having Krawlr keep track of any tweets retrieved from XHR requests as you are using the application, so you can in a sense keep track of all the tweets that you viewed in your time using the application. Below is code for an analyzer that extracts tweets from specific XHR requests that Twitter makes. Below the Analyzerwill be an example of an Activity that will reuse some previously defined Network Analyzers to extract even more data as you use the application. There is an empty Actor in the timeline so Krawlr knows to stash the network requests the browser makes as Krawlr is firmly based on the decoupling of Extractors and Actors, in which Actors are the ones that accumulate the data. Here is the code for the entire Activity:
Krawlr is a web scraping library built on top of Puppeteer that aims to break web scraping into two parts. Data stimulation, and data extraction. Separating these events allow for reusability of components of a scrape that can be used to increase development speed when writing a scrape for an application. I hope that this library will change the way you think about developing a scraper. Changing your design ideology to focus around building out components for that can be easily reused or changed. Feedback for the library would be greatly appreciated!
The code for this example is on my GitHub at this URL:
And the Krawlr library is located here: