Automating data extraction from SEC 10-K forms using Document AI and Generative AI

Harish Verma
Google Cloud - Community
4 min readApr 18, 2024

SEC10K forms are comprehensive financial reports that public companies file with the U.S. Securities and Exchange Commission (SEC) to disclose their financial performance. However, SEC 10-K forms can be very large, ranging from 50 to over 200 pages. Extracting data from these forms can be time-consuming and challenging due to their large size and complex format.

In this blog post, we will show you how to use Google Cloud’s Document AI and Generative AI to parse SEC 10-K forms and extract key information. This solution can save you time and effort, and it can help you to make more informed investment decisions quickly.

Solution Architecture

The solution architecture for Sec10k Form Parser using Document AI and Generative AI is shown below. The solution consumes a pdf document and extracts predefined fields.

Solution Architecture

The solution consists of the following components:

  • Document AI Custom Document Splitter (CDS): Given a Sec 10-K document it splits the SEC 10-K form into individual sections.
  • Document AI Custom Document Extractor (CDE): Extracts key information present in tabular form from different sections of the SEC 10-K form.
  • Generative AI: Extracts text-based information from the SEC 10-K form.
  • BigQuery: Stores the extracted data

Data and Model Training

The solution was trained on a dataset of SEC10K forms. You can find Kaggle Dataset SEC Edgar Annual Financial Filings — 2021 for Sec10K form dataset.

For Generative AI, fields like company names, addresses, year end date are extracted by providing relevant content to the text-bison model.

For Custom Document Splitter, we divided the document into sections like Introduction and Signature along with identifying important tables like Consolidated Balance Sheet and Statement of Operations. We labeled and trained on 50+ numbers of training documents.

Snapshot of Custom Document Splitter developed

For Custom Document Extractor, the documents were labeled to identify the relevant fields. Examples of labels from tables of Consolidated Balance Sheet and Statement of Operations are total current liabilities and assets, total net sales and operating expenses with year wise mapping. We labeled and trained on 50+ numbers of training documents.

Snapshot of fields for Custom Document Extractor developed

Below is a sample page having a Consolidated Balance Sheet table in a Sec10k form.

Consolidated Balance Sheet table in Sec10K form (Source)

Results

The solution was evaluated on a test set of 20 documents and has demonstrated impressive results.

  • 95%+ accuracy on Document Splitter to identify different sections of the forms
  • 90%+ accuracy on field extraction of tabular data using Custom Document Extractor
  • 99%+ accuracy on field extraction of textual data using Generative AI

We tried our solution developed on the latest filing of Sec10k form by Alphabet Inc. which is publicly available here. Below is the snapshot of the 50 pager document.

Source

Here is the output produced from the solution developed by directly ingesting the pdf shared by Alphabet.

{'company_address': '1600 Amphitheatre Parkway Mountain View, CA 94043',
'company_name': 'Alphabet Inc.',
'company_phone': '(650) 253-0000',
'fiscal_year': 'March 31, 2023',
'form_type': '10-Q',
"chief_financial_officer": "Ruth M. Porat",
'current_assets': {'previous': '164,795', 'current': '161,985' ,'description': 'Total current assets'}
'current_liabilities': {'previous': '69,300', 'current': '68,854' ,'description': 'Total current liabilities'}
'net_income': {'previous': '16,436', 'current': '15,051', 'description': 'Net income'}
'total_net_sales': {'previous': '68,011', 'current': '69,787', 'description': 'Revenues'}

Conclusion

The integration of Document AI and Generative AI offers a powerful solution for automating and enhancing SEC Form 10-K parsing. By leveraging machine learning and natural language processing capabilities, investors, analysts, and stakeholders can extract structured data with high accuracy, gain contextual understanding, and unlock data insights that are crucial for making informed decisions.

Learn more about the products used in the solution from links below:

--

--