Creating an Automated Text Extraction Workflow — Part 1

LogicWriter
The Startup
Published in
5 min readMay 5, 2020
Author: macrovector

Disclaimer: I’m the co-founder at extractorapi.com

Developing a workflow that automatically extracts relevant text from URLs can be laborious.

Whether you’re collecting articles or blogs for a dataset, scanning press releases to spot mentions of competitors, or composing a company news feed, the process of gathering URLs, extracting the text, and storing the results requires quite a bit of hand holding.

What Tools Are Out There?

To actually gather your URLs, you might use the powerful News API (they have a decent free-tier for non-commercial use), or the very well priced and speedy Newscatcher API (sign up for their beta).

If you’re crawling the same sources, your IP might eventually be blocked, so you’ll need a service that offers proxy rotation. From our experience, scraperapi.com has been solid at extracting HTMLs while handling proxies and retries.

You’ll also need something that reliably extracts clean , boilerplate-free text from the HTML. The most robust free option I’ve come across is newspaper3k — if you’re privy to the Python world, it’s the requests of text extraction. For the most part, it’s accurate and fast.

The 600 lbs gorilla, Diffbot, comes with a swath of solid APIs but starts at $300, which is ridiculous if you’re just extracting text. Scrapinghub’s News API, Extractor API, and plenty more are better priced if you want an affordable alternative; plus, Extractor API includes a visual online tool for extracting hundreds of articles at once, if you want to do things via UI.

Finally, you need to store the results somewhere — a local or hosted database you can easily query, like Digital Ocean’s Managed Database or AWS’s DynamoDB. You can also use Extractor API to store all your extracted text, and easily query your jobs via a RESTful API — we’ll dive into that soon.

The Technology We’ll Be Using

In this guide, we’ll be using the News API to gather relevant URLs and Extractor API to extract relevant data and store our work for later querying.

To use News API, head over to their pricing page and sign up for the free Developer plan. For Extractor API, the pricing page includes a free plan, which we’ll be using today.

When you’re set, let’s dive in!

Gathering the URLs

We’ll be using News API’s free tier to gather news URLs mentioning artificial intelligence, and I’ll be including some of my favorite tech news sources: Ars Technica, Wired, Bloomberg, Fast Company, MIT Technology Review, Gizmodo, and others.

Here’s how to set it up (you can also see News API’s Python client setup guide here):

Extracting and Storing the Text

Now that we have our 58 target URLs, it’s time we started extracting text and storing the results for later use. Extractor API helps us do this easily with the help of the Jobs endpoint.

If you take a look at the documentation, you’ll see the Create Jobs endpoint, which allows you to feed the API a list of URLs under a job_name of your choice.

A few notes about the Jobs:

  • Extraction happens server-side, so you can check the progress via the API. When all the URLs in the job have finished processing, you’ll get a 100% completion status (I’ll show you how to query for this).
  • You can fetch the paginated list of URLs (along with extracted text, titles, and so on) at any time for downstream processing using a simple GET request. The extracted data is securely stored on the Extractor API server, so you’ll have quick access to your crawls.
  • You can access your jobs (and see their status) on the Jobs page on the Extractor API website, and download results in either .json or .csv format.

Once you’re logged in, you can head over to your Dashboard, where you can retrieve your API key.

Here’s how we go about creating a job with the Extractor API (we’re continuing the script from above):

We can check the status programmatically:

Or we can head over to the Jobs page (under the Online Text Extractor dropdown in the menu). We can see all except one of the URLs were successfully processed.

Querying Your Job

Once your job is done, you can inspect the results using the Job URLs endpoint. Just add /urls (no ending slash needed) and you’ll get a paginated list of results.

count is the total number of results (in our case, 57 successfully extracted URLs and one error). By default, each “page” shows 10 results. To access the next or previous pages, simply use the URL found in the next or previous key.

For example, to collate all extracted text into a Python list, you could do this:

Filtering Your Results

You might not want all the articles in your job — you might need a specific subset, say, all those that contain the word “acquisition” in the full text. In that case, you can apply a filter parameter:

For the Job URLs you can search the title and text for any keywords. For the List Jobs endpoint you can search the job name.

To construct the parameter, simply attach __contains (for an exact match) or __icontains for a case-agnostic match. A few examples: text__icontains=facebook (for the Job URLs endpoint), title__contains=AI (for the Job URLs endpoint), and job_name__contains=my_articles (for the List Jobs endpoint).

Tools for Automation

So let’s wrap up for this part. As we saw above, using just two APIs you can start to piece together a mechanism for retrieving relevant URLs, extracting and saving the URL text data, then querying your jobs.

Next, we’ll be looking at interesting methods to automate the URL retrieval and text extraction process. Till next time!

Catch me on YouTube with my new channel, LogicWriter.

--

--