How to download and scrape 10-K filings from SEC EDGAR

Jay
5 min readOct 28, 2022

--

This tutorial shows you how to download and scrape 10-K filings from SEC EDGAR to your local disk. We use Python 3 and the SEC-API.io Python package to help us find the links to all 10-K filings on EDGAR and then download them.

Our SEC filings download application will be structured into two components:

  1. The first component of our Python application finds all URLs of 10-K filings on EDGAR filed between 1995 and 2022. We also consider older 10-K variants, that is 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. Our application also includes all amended/changed filings, for example 10-K/A. Once we generated a complete list of all URLs of 10-K filings, we’re going to save the list to a file to our hard disk.
  2. The second component reads the URLs from the file, and downloads all annual reports. We download up to 20 filings in parallel using the Render API of the SEC-API package and use Python’s multiprocessing package to speed up the download process.

The complete Jupyter notebook is available here. You can also run the code directly in Google Colab here.

Getting Started

Let’s starts by installing the SEC-API Python package.

pip install sec-api

Head over to https://sec-api.io to get your free API key so that we can start searching the SEC EDGAR database for 10-K filings.

The Query API is a search interface allowing us to search and find SEC filings across the entire EDGAR database by any filing meta data parameter. For example, we can find all filings filed by Microsoft using a ticker search (ticker:MSFT) or build more complex search expressions using boolean and brackets operators. The Query API returns the meta data of SEC filings matching the search query, such as filer details (e.g. ticker and company name), URLs to the filing and all exhibits, filing date, form type and more.

We’re looking for all filings with form type 10-K and its variants: 10-KT, 10KSB, 10KT405, 10KSB40, 10-K405. So, the Query API form type filter comes in handy. The search query string looks like this:

formType:(“10-K”, “10-KT”, “10KSB”, “10KT405”, “10KSB40”, “10-K405”)

The brackets tell the Query API to include a filing in the response if the form type is either 10-K, or 10-KT, or 10KSB, and so on.

Let’s start off simple by finding the most recently filed 10-K filing from Tesla.

The response of the Query API package in Python represents a dictionary (short: dict) with two keys: total and filings.

The value of total is a dict itself and tells us, among other things, how many filings in total match our search query. The value of filings is a list of dicts, where each dict represents all meta data of a matching filing.

We use the json Python package to pretty-print the first filing to the console to explore the structure of a filing dict.

The URL of the 10-K filing is the value of the `linkToFilingDetails` key in each filing dict, for example:

https://www.sec.gov/Archives/edgar/data/1318605/000119312514069681/d668062d10k.htm

We see that information such as the filer ticker and CIK, company name, and all links and types of filing attachments (e.g. XBRL) is included as well. If you were to download, let’s say, XBRL attachments of 10-K filings, you would be able to use the same approach we implement here.

In order to for us to generate a complete list of 10-K URLs, we simply iterate over all filing dicts, read the linkToFilingDetails value and write the URL to a local file.

One more thing: the Query API returns a maximum of 200 filings per search request and a maximum of 10,000 filings per search universe. That’s why we paginate over the search results, i.e. we request the first “page” of matches with 200 filings, then the second “page”, and so on, until we iterated through all filings filed between 1995 and 2022.

1. Generate a list of URLs of all 10-K filings

This chapter implements the first of our two components and explains how to generate the list of 10-K URLs and save the list to a file.

The following base_query is reused and updated for each request allowing us to page through all results in the next part of the code.

On each search request, the PLACEHOLDER in the base_query is replaced with our form type filter and with a date range filter. The complete Python code for downloading all URLs of filings filed between 1995 and 2022 is shown and explained below.

Be aware that it takes some time to download and save all URLs. Plan at least 30 minutes for running your application without interruption.

The URL downloader appends a new URL to the log file filing_urls.txt on each processing iteration. In case you accidentally shut down your application, you can start off from the most recently processed year without having to download already processed URLs again.

> Uncomment the two lines below in your code if you want to generate all URLs at once. I deliberately uncommented them to provide a quick running example of the entire code without having to wait 30+ minutes to see results.

for year in range(2021, 1994, -1):
for from_batch in range(0, 9800, 200):
Results of running the URL downloader

2. Download all 10-Ks from SEC EDGAR

The second component of our filing download application loads all 10-K URLs from our log file filing_urls.txt into memory, and downloads 20 filings in parallel into the folder filings.

We use the Render API interface of the SEC-API Python package to download the filing by providing its URL. The Render API allows us to download up to 40 SEC filings per second in parallel. However, we don’t utilize the full bandwidth of the API because otherwise it’s very likely we end up with memory overflow exceptions (considering some filings are 400+ MB large).

The download_filing function downloads the filing from the URL, generates a file name using the last two parts of the URL and saves the downloaded file to the filings folder.

The load_urls function reads the text content from the previously generated filing_urls.txt file, and creates a list of URLs by splitting the text content at each new line character ("\n").

The download_all_filings is the heart and soul of our application. Here, Python's inbuilt multiprocessing.Pool method allows us to apply a function to a list of values multiple times in parallel. This way we can apply the download_filing function to values of the urls list in parallel.

For example, setting number_of_processes to 4 results in 4 download_filing functions running in parallel where each function processes one URL. Once a download is completed, multiprocessing.Pool gets the next URL from the URLs list and calls download_filing with the new URL.

We used 40 URLs (urls = load_urls()[1:40]) to quickly test the code without having to wait hours for the download to complete. Uncomment the next line to process all URLs:urls = load_urls()

Finally, run download_all_filings() to start downloading all 10-K filings.

Downloaded 10-K filings

References

--

--