Document AI Warehouse: Introduction, Key Features & Sample Usecase Explained

Vasu Mittal
Google Cloud - Community
8 min readJan 27, 2023

Document AI Warehouse is an integrated, cloud-based platform to store, search, organize, govern and analyze documents and their structured metadata(called Properties).

Documents include structured data(e.g. forms, invoices) and unstructured data(e.g. contracts, research papers).

Their properties includes AI-extracted data from documents and tags( which are either manually assigned or are AI-assigned(for ex: account number, loan ID, document type)).

Document AI Warehouse has powerful search features which helps your business in finding content faster and better. It allows you to automate the entire framework of document handling, document management and workflows which results in super savings in cost & time and improves the overall efficiency of the document management process. Using Document AI Warehouse, you store and access document contents faster at cloud scale. It helps in risk mitigation by enforcing rich governance & compliance policies. Document AI Warehouse allows easy & seamless connection with GCS, Doc AI and other cloud services.

Key Features

Document AI Warehouse has some amazing features which makes it a very powerful, cost effective & efficient Document Storage & Management System. Let’s have a quick walkthrough of some of these below:-

  1. It has REST APIs to manage all your documents and their properties(i.e extracted or tagged metadata).
  2. It has Metadata Management to manage extracted and tagged metadata.
  3. It supports End to End Governance by integrating with IAM and corporate directories
    - Fine-grained Access Control at the document and folder levels can be assigned to users and groups to view, edit, manage(share, delete) documents.
    - Document AI Warehouse is integrated with IAM(Cloud Identity), so that users and groups can be provisioned into Cloud Identity.
  4. Document AI Warehouse supports rich Search Capabilities(for e.g semantic search etc.), which includes the following features:
    - Full-text search
    - Filtering search results by Properties (date, numeric, text etc.)
    - Filters can be combined with AND and OR operators
    - Semantic search — support common synonyms and misspellings.
  5. Document AI warehouse supports Flexible Folder management.
    -
    Documents can be cataloged into one or more folders, based on application(for example, an ID card is placed in a KYC folder, Loan folder, Bank Account folder), without replicating the document.
    - The folders can be nested in one or more hierarchies[for example, AllLoans->State->Branch->Loans or LoanTypes->Loans].
  6. It includes Web-accessible UI with the following features:
    - Doc Explorer: search documents, filter search results, select documents to bulk-update properties or delete
    - Doc Viewer: view documents, view/update its properties, assign ACLs, add to folders
    - Upload: upload documents and run them through a DocAI extractor (either OCR or a supported specialised parser such as Invoice DocAI).
    - Folder Explorer: add documents to one or more folders, explore folder hierarchy.
  7. Doc extractors(DocAI and others): Documents may be extracted by an AI pipeline, so that the extractions can be ingested and managed in Document AI Warehouse(as Metadata) along with the Raw Document. The extraction can be done by
    - Document AI Specialised parsers(for Procurement forms, Lending forms, others)
    - OCR(only for supported files), AutoML, Forms parser(for images such as TIFF/PNG/etc.)
    - Other custom models
    - Text extracting tools for specialized document formats such as PDFs, Office documents and others.
    - Note that Document AI Warehouse can work with any extraction pipeline that calls Document AI Warehouse APIs to ingest/update documents.
  8. It supports Policy Management and Compliance Enforcement by setting conditional notifications and scheduled notifications to trigger workflows that enforce policies(for example, records management, retention and disposition, legal holds) on specific documents.
  9. Files supported — Text PDFs, Images(scanned PDFs, JPEG files etc.), Office(DOCX, PPTX, XSLX) files — run through OCR(only for supported files) and indexed.

Basic Document AI Warehouse Terminology:

  1. Document: A record in Document AI Warehouse that users can search, manage, and enforce access control on. It comprises the raw document(in pdf, image or other format) and some associated metadata.
  2. Raw Document: It’s that raw content file(pdf, image etc.) of the Document(basically the real document).
  3. Schema: Each & every document is of a certain document type and that document type is specified by a schema in Document AI Warehouse. E.g. an Invoice is a schema type that contains the following schema: Supplier Name, Vendor Name, Invoice Amount, etc.
  4. Property: Properties are the fields of the Document Schema that may either be extracted from the document or enriched(labeled) by users. For ex: Free Text values, Numeric, Date etc.
  5. Folders: A folder is a virtual collection of documents(virtual because the same document can be contained in one or more folders). It has a “Document Type/Schema” and contains metadata and Access Control Lists just like documents.
  6. Links: Links are used to add documents to folders or to link related documents together.
  7. Related Documents: Documents can be related by directional links from one document to another.
  8. Policy: A policy is evaluated when a document or folder is created or updated, and is used to validate or update document metadata, ACLs or add/move/remove docs from folders.
  9. Faceted Search: A Facet is a metadata filter used in a search query. For example, search for Bank Statements from “Month = March 2022” and “Branch State = NW” filters the Search results by these 2 facets.
  10. Semantic Search: Semantic search supports synonyms or “semantically related” terms in the search query. E.g. “Driver license” returns “driver permit”.

Sample Usecase

Let’s get to some practical understanding of this.

So, let’s suppose being fascinated by Document AI Warehouse’s latest features, you decide to move all your documents to Document AI Warehouse. Now where do we start:

So one way to move all your documents from anywhere(e.g Filenet etc.) to Document AI Warehouse is using TSOP(Storage Transfer Service).

Storage Transfer Service is a product that enables you to:

  • Move or backup data to a Cloud Storage bucket either from other cloud storage providers or from a local or cloud POSIX file system.
  • Move data from one Cloud Storage bucket to another, so that it is available to different groups of users or applications.
  • Move data from Cloud Storage to a local or cloud file system.
  • Move data between file systems.
  • Periodically move data as part of a data processing pipeline or analytical workflow.

Storage Transfer Service provides options that make data transfers and synchronization easier. For example, you can:

  • Schedule one-time transfer operations or recurring transfer operations.
  • Delete existing objects in the destination bucket if they don’t have a corresponding object in the source.
  • Delete data source objects after transferring them.
  • Schedule periodic synchronization from a data source to a data sink with advanced filters based on file creation dates, filenames, and the times of day you prefer to import data.

Storage Transfer Service does the following by default:

  • Storage Transfer Service copies a file from the data source if the file doesn’t exist in the data sink or if it differs between the version in the source and the sink.
  • Retains files in the source after the transfer operation.
  • Uses TLS encryption for HTTPs connections. The only exception is if you specify an HTTP URL for a URL list transfer.

So, TSOP is good but How do we setup the entire architecture?

Sample Architecture

This sample architecture includes the following services:-

  1. Google Cloud Storage:- Cloud Storage is a managed service for storing unstructured data. You can store any amount of data and retrieve it as often as you want.
    Here in this architecture, a bucket(Landing Bucket) is created in Cloud Storage as a landing zone to hold the files intermittently.
  2. Cloud SQL:- Cloud SQL is a fully-managed database service that helps you to set up, maintain, manage, and administer your relational databases on Google Cloud Platform.
    For our use case, we have used Cloud SQL PostgreSQL instance. For this use case, we have created one table in Cloud SQL(Audit Logging) for Audit purposes i.e whenever a file lands in the Landing bucket an entry is made for it in this Cloud SQL table, with initial status as “Received”. Once the file is moved to Document AI Warehouse then the status is updated to “Completed” or if incase there are any issues because of which the file in not moved then the record is updated with Status as “Error” and with Error Message Column updated with the relevant description.
  3. Cloud Functions:- Lets you run your code in cloud with no servers or containers to manage with scalable, pay-as-you-go functions as a service (FaaS) product.
    Here, we have created 2 Cloud functions(written in python):-
    I. File Entry:- This function will check the landing bucket consistently and as soon as a file lands in the Landing Bucket, this function first, makes an entry in the cloud sql table(created for logging purposes, to understand what all files are received, processed, in process, completed etc.) and then creates a task queue to initiate the second function i.e Load File.
    II. Load File:- This function picks the file from the Landing bucket and loads it to the Document AI Warehouse. If the file is loaded successfully then this function updates the record in the Cloud SQL table with Status as “Completed” or if the file is not loaded(due to any reason for ex: Exact same file is already available etc.) then it updates the Status in the Cloud SQL Table as “Error” with the Error Message Column updated with the relevant description.
  4. Cloud Tasks:- Cloud Task is a fully managed service that allows you to manage the execution, dispatch, and delivery of a large number of distributed tasks.
    Cloud task is created to load each file from Landing Bucket to Document AI Warehouse by calling Load File Function.
  5. Document AI Warehouse:- This blog already describes it well!

Note: Except Parser capabilities, rest all the key features of Document AI warehouse described above were used in this sample usecase.

So, once you have finalised the set of files that you want to move to Document AI Warehouse(“Discover Files” in Architecture image) you can use TSOP to migrate them to Landing Bucket(a bucket created in Google Cloud Storage as a landing zone to hold the files intermittently). The “File Entry” Cloud Function will check the Landing bucket consistently and as soon as a file lands in the Landing Bucket, this function first, makes an entry in the cloud sql table and then create a Task queue to initiate the second function i.e “Load File”. Load File function picks the file from the Landing bucket and loads it to the Document AI Warehouse. If the file is loaded successfully then it updates the record in the Cloud SQL table with “Status” as “Completed” or if the file is not loaded(due to any reason for ex: Exact same file is already available etc.) then it updates the “Status” in Cloud SQL Table as “Error” with the “Error Message Column” updated with the relevant description. And its done!

Once the files are uploaded in the Document AI Warehouse, you can view, search, filter files using the web UI.

Keep Learning, Keep Growing!!!

--

--