Stories by Patrick Guha on Medium

Four Steps for Startups to Master Visual Document Understanding with AI

Patrick Guha — Fri, 08 May 2026 18:40:11 GMT

AI agents are transforming the nature of work by automating complex workflows with speed, scale, and accuracy. For AI startups — whether you are building the next-generation AI assistant for legal professionals, automating digital mortgage and real estate closings, streamlining complex estate planning, or pioneering digital musculoskeletal health and physical therapy — processing complex, highly-formatted documents is at the core of your product.

Contracts, loan applications, and clinical records are rarely just plain text. They are densely packed with tables, signatures, charts, and highly specific spatial layouts. Traditionally, developers had to rely on brittle pipelines involving disparate optical character recognition (OCR) tools and text-only large language models (LLMs).

Today, native multimodal models have changed the game. Working closely with our DeepMind and Google Cloud AI teams, we’ve developed a simple four-step framework to help your startup build reliable, highly scalable, and economical document understanding systems using Gemini.

Step #1: Skip the OCR pipeline and leverage dynamic tokenization

For complex document processing, the golden standard is now direct inference through native multimodal models. Instead of preprocessing documents through a traditional OCR engine, pass the PDF pages directly to Gemini. Models like Gemini 3.1 Pro, 3 Flash, and 3.1 Flash-Lite natively understand document layouts, tables, and embedded images.

Optimize your PDF format: For text-heavy workloads like master service agreements or estate planning documents, try to ingest native PDFs (where text is rendered as text) rather than flat scanned images. This makes the text machine-readable, which is far easier for the model to edit, search, and manipulate.

Mastering Gemini 3 Tokenization: With the Gemini 3 models, document tokenization uses a variable sequence length, replacing the older Pan and Scan method for better quality and latency. You can now tightly control costs and performance by explicitly setting the media_resolution for your PDF inputs:

UNSPECIFIED (560 tokens/page): The default setting. The token count for this level varies significantly between Gemini 3 and earlier Gemini models.
LOW (280 tokens/page): Best for simple text extraction or global summaries.
MEDIUM (560 tokens/page): A balance between detail, cost, and latency. Ideal for standard document extraction.
HIGH (1120 tokens/page): Higher token count, providing more detail for the model to work with, at the expense of increased latency and cost. Reserve this for fine-grained detail on complex layouts, tiny serial numbers on hardware, or dense financial tables.

Pro-tip: You can set these resolutions at the top level of your generationConfig, or override them for individual media parts if you have a mixed batch of high-res charts and low-res text pages.

Example: Set media_resolution per individual media part:

response = client.models.generate_content(
  model="gemini-3.1-pro-preview",
  contents=[
      types.Part(
          file_data=types.FileData(
              file_uri="gs://cloud-samples-data/generative-ai/image/a-man-and-a-dog.png",
             mime_type="image/jpeg",
          ),
          media_resolution=types.PartMediaResolution(
              level=types.PartMediaResolutionLevel.MEDIA_RESOLUTION_HIGH  # High resolution
          ),
      ),
      Part(
          file_data=types.FileData(
             file_uri="cloud-samples-data/generative-ai/video/behind_the_scenes_pixel.mp4",
            mime_type="video/mp4",
          ),
          media_resolution=types.PartMediaResolution(
              level=types.PartMediaResolutionLevel.MEDIA_RESOLUTION_LOW  # Low resolution
          ),
      ),
      "When does the image appear in the video? What is the context?",
  ],
)
print(response.text)

Example: Or set media_resolution globally (via GenerateContentConfig)

response = client.models.generate_content(
  model="gemini-3.1-flash-lite",
  contents=[
      types.Part(
          file_data=types.FileData(
              file_uri="gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf",
              mime_type="application/pdf",
          ),
      ),
      "Please summarize the given document for a general audience.",
  ],
  config=types.GenerateContentConfig(
media_resolution=types.MediaResolution.MEDIA_RESOLUTION_MEDIUM  # Global setting
  ),
)
print(response.text)

Step #2: Prompt engineering for spatial and mixed-media context

Many startups, particularly innovative legal case management platforms, deal with documents that mix dense text with embedded images (like exhibits in legal PDFs). When reasoning over these mixed-media documents, send the full PDF pages as-is to Gemini. Because Gemini processes the entire page visually, it preserves the spatial relationship between the text and the images.

To get the most out of the model, adhere to these structural prompt best practices:

Order matters: If your request contains a single PDF, always place the PDF file before the text prompt in your API request.

Example: Same example as above with a special callout to the ordering of the PDF Part object first then text the prompt/instruction in the Contents message

text_part = """
You are a professional document summarization specialist. Please summarize the given document for a general audience."""

response = client.models.generate_content(
  model="gemini-3.1-flash-lite",
  contents=[
      types.Part(
          file_data=types.FileData(
              file_uri="gs://cloud-samples-data/generative-ai/pdf/1706.03762v7.pdf",
              mime_type="application/pdf",
          ),
      ),
      text_part,
  ],
  config=types.GenerateContentConfig(
media_resolution=types.MediaResolution.MEDIA_RESOLUTION_MEDIUM  # Global setting
  ),
)
print(response.text)

Split massive files: If you are dealing with a long document (like a massive mortgage closing packet), consider splitting it into multiple smaller PDFs before processing it to improve retrieval accuracy and parsing speed. There are great tools out there powered by Gemini, such as LlamaParse, that can help you perform splitting and other tasks. It is also best practice to rotate pages to the correct orientation before uploading.

Example: Create a split job with category definitions with LlamaParse

job = client.beta.split.create(
    document_input={
        "type": "file_id",
        "value": file_id,
    },
    categories=[
        {
            "name": "essay",
            "description": "A philosophical or reflective piece of writing that presents personal viewpoints, arguments, or thoughts on a topic without strict formal structure",
        },
        {
            "name": "research_paper",
            "description": "A formal academic document presenting original research, methodology, experiments, results, and conclusions with citations and references",
        },
    ],
)

print(f"✅ Split job created: {job.id}")
print(f"   Status: {job.status}")

Part isolation: For multi-document reasoning across a corpus, structure your requests so each document is a separate Part in the prompt with clear document identifiers.

Example: Interleaving document identifiers inside the prompt of the multi-part Content message with PDFs as unique Parts.

response = client.models.generate_content(
    model="gemini-3.1-flash-lite",
    contents=[
        "Document 1: Master Service Agreement",
        types.Part(
            file_data=types.FileData(
                file_uri="gs:///msa_doc.pdf",
                mime_type="application/pdf",
            ),
            # Apply medium resolution specifically to Document 1
            media_resolution=types.PartMediaResolution(
                level=types.PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM
            )
        ),
        "Document 2: Statement of Work",
        types.Part(
            file_data=types.FileData(
                file_uri="gs:///sow_doc.pdf",
                mime_type="application/pdf",
            ),
            # Apply medium resolution specifically to Document 2
            media_resolution=types.PartMediaResolution(
                level=types.PartMediaResolutionLevel.MEDIA_RESOLUTION_MEDIUM
            )
        ),
        "Analyze both documents. Compare the liability clauses in Document 1 with the deliverables outlined in Document 2."
    ]
)
print(response.text)

Step #3: Build dynamic routing and mitigate model limitations

When building multi-agent systems (like those built on Google’s Agent Development Kit), give your agent the ability to dynamically route tasks. We recommend setting up two distinct tools:

General RAG (Text-based retrieval): Standard vector search over extracted textual data.
Multimodal RAG (Image-based retrieval): Sending relevant raw pages/images to Gemini for visual analysis.

Instruct your agent’s system prompt to use the image retrieval tool when the user’s query involves visual content (e.g., charts, signatures, or scanned forms). Think of this as an on-demand visual analysis tool.

Designing for limitations: It is crucial to build your workflows with the model’s limitations in mind. While multimodal models are incredibly powerful, they are not perfect spatial calculators.

Approximations: The models are not precise at locating exact coordinates of text/objects and might only return approximated counts of objects on a page.
Handwriting: Be aware that models may hallucinate when interpreting messy handwritten text in PDFs (a common challenge for startups processing intake clinical notes or physical therapy records). Design your agent to flag highly ambiguous handwriting for “human-in-the-loop” (HITL) review.
Confidence Thresholding: Program your agent to return a “certainty score” alongside its extraction. For mission-critical data (like financial totals or medical IDs), set a threshold (e.g., 90%). If the model’s confidence is lower, route the task to a fallback OCR engine or a human-in-the-loop for verification.

Step #4: Optimize economics for massive scale

At startup scale — processing tens of millions of pages per month — passing every single image to a full vision language model (VLM) during high-volume ingestion is economically unsustainable.

Here is our blueprint for massive, economical scale:

Adopt Multimodal Embeddings: Instead of sending every ingested image to Gemini for reasoning, send them to a multimodal embedding model such as Gemini Embedding 2. You pay a fraction of a cent to make the image searchable. You only invoke the more expensive Gemini model after a search finds a relevant image.
Use Gemini Flash-Lite for the heavy lifting: For initial extraction passes and OCR-ing simple tables, use Gemini 3.1 Flash-Lite. It has the cheapest per-token cost and is incredibly fast.
Leverage Context Caching: If your users have standard reference documents (like massive legal codebases) that your agent queries repeatedly, turn on explicit context caching. You pay to process the heavy document images once, and every subsequent query over the next hour is discounted by up to 90%.
Routing Requests: For global applications, route your document processing requests to the nearest Google Cloud region by using the global endpoint. This minimizes the “time to first token” (TTFT) for your users and allows you to utilize regions with higher availability. You can also leverage lower pricing for non-urgent batch processing of documents by using batch inference or Flex PayGo.

Ready to start building visually intelligent agents? Let’s get started!

Explore the Startups technical guide: AI agents to dive deeper into agent architectures.
Try out Gemini 3.1 Pro, 3 Flash and, 3.1 Flash-Lite in Google AI Studio to test multimodal document capabilities.
Review the full Gemini Document Understanding best practices.
Gemini models can be configured to take advantage of GCP’s enterprise-grade security and compliance, so your team can spend more time building secure solutions.
No matter where you are with AI adoption, we are here to help. Contact our Startup team today to learn how you can get up to $350,000 USD in cloud credits with the Google for Startups Cloud Program.

Special thank you to my co-author, Nanditha Embar!

Disclaimer:

The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of Google Cloud. Recommendations noted here should be tested in a non-production environment before production deployment. Users are responsible for assessing the cost and security implications of their own deployments.

Four Steps for Startups to Master Visual Document Understanding with AI was originally published in Google Cloud for Startups on Medium, where people are continuing the conversation by highlighting and responding to this story.

Simplifying B2B integrations with AWS Step Functions Workflow Studio

Patrick Guha — Mon, 04 Oct 2021 13:57:56 GMT

This post is written by Patrick Guha, Associate Solutions Architect, WWPS, Hamdi Hmimy, Associate Solutions Architect, WWPS, and Madhu Bussa, Senior Solutions Architect, WWPS

B2B integration helps organizations with disparate technologies communicate and exchange business-critical information. However, organizations typically have few options in building their B2B pipeline infrastructure. Often, they use an out-of-the-box SaaS integration platform that may not meet all of their technical requirements. Other times, they code an entire pipeline from scratch.

AWS Step Functions offers a solution to this challenge. It provides both the code customizability of AWS Lambda with the low-code, visual building experience of Workflow Studio.

This post introduces a customizable serverless architecture for B2B integrations. It explains how the solution works and how it can save you time in building B2B integrations.

Overview

The following diagram illustrates components and interactions of the example application architecture:

This section describes building a configurable B2B integration pipeline with AWS serverless technologies. It walks through the components and discusses the flow of transactions from trading partners to a consolidated database.

Communication

The B2B integration pipeline accepts transactions both in batch (a single file with many transactions) and in real-time (a single transaction) modes. Batch communication is over open standard SFTP protocol, and real-time communication is over the REST/HTTPS protocol.

For batch communication needs, AWS Transfer Family provides managed support for file transfers directly between Amazon Simple Storage Service (S3) or Amazon Elastic File System (EFS). EFS provides a serverless, elastic file system that lets you share file data without provisioning or managing storage.

Amazon EventBridge provides serverless event bus functionality to automate specific actions in the B2B pipeline. In this case, batch transaction uploads from a partner trigger the B2B pipeline. When a file is put in S3 from the Transfer SFTP server, EventBridge captures the event and routes to a target Lambda function.

As batch transactions are saved via AWS Transfer SFTP to Amazon S3, AWS CloudTrail captures the events. It provides the underlying API requests as files are PUT into S3, which triggers the EventBridge rule created previously.

For real-time communication needs, Amazon API Gateway provides an API management layer, allowing you to manage the lifecycle of APIs. Trading partners can send their transactions to this API over the ubiquitous REST API protocol.

Processing

Amazon Simple Queue Service (SQS) is a fully managed queuing service that allows you to decouple applications. In this solution, SQS manages and stores messages to be processed individually.

Lambda is a fully managed serverless compute service that allows you to create business logic functions as needed. In this example, Lambda functions process the data from the pipeline to clean, format, and upload the transactions from SQS.

Step Functions manages the workflow of a B2B transaction. Step Functions is a low-code visual workflow service used to orchestrate AWS services, automate business processes, and build serverless applications. Workflows manage failures, retries, parallelization, service integrations, and observability so developers can focus on higher-value business logic.

API Gateway is used in the processing pipeline of the solution to enrich the transactions coming through the pipeline.

Amazon DynamoDB serves as the database for the solution. DynamoDB is a key-value and document database that can scale to virtually any number of requests. As the pipeline experiences a wide range of transaction throughputs, DynamoDB is a good solution to store processed transactions as they arrive.

Batch transaction flow

A trading partner logs in to AWS Transfer SFTP and uploads a batch transaction file.
An S3 bucket is populated with the batch transaction file.
A CloudTrail data event captures the batch transaction file being PUT into S3.
An EventBridge rule is triggered from the CloudTrail data event.
Lambda is triggered from the EventBridge rule. It processes each message from the batch transaction file and sends individual messages to SQS.
SQS stores each message from the file as it is passed through from Lambda.
Lambda is triggered from each SQS incoming message, then invokes Step Functions to run through the following steps for each transaction.
Lambda accepts and formats the transaction payload.
API Gateway enriches the transaction.
DynamoDB stores the transaction.

Single/real-time transaction flow

A trading partner uploads a single transaction via an API Gateway REST API.
API Gateway sends a single transaction to Lambda SQS writer function via proxy integration.
SQS stores each message from the API POSTs.
Lambda is triggered from each SQS incoming message. It invokes Step Functions to run through the workflow for each transaction.
Lambda accepts and formats the transaction payload.
API Gateway enriches the transaction.
DynamoDB stores the transaction.

Exploring and testing the architecture

To explore and test the architecture, there is an AWS Serverless Application Model (AWS SAM) template. The AWS SAM template creates an AWS CloudFormation stack for you. This can help you save time building your own B2B pipeline, as you can deploy and customize the example application.

To deploy in your own AWS account:

To install AWS SAM, visit the installation page.
To deploy the AWS SAM template, navigate to the directory where the template is located. Run the following bash commands in the terminal:

git clone https://github.com/aws-samples/simplified-serverless-b2b-application

cd simplified-serverless-b2b-application

sam build

sam deploy --guided --capabilities CAPABILITY_NAMED_IAM

Prerequisites

Create an SSH key pair. To authenticate to the AWS Transfer SFTP server and upload files, you must generate an SSH key pair. Once created, you must copy the contents of the public key to the SshPublicKeyParameter displayed after running the sam deploy command. Follow the instructions to create an SSH key pair for Transfer.
Copy batch and real-time input. The following XML content contains multiple example transactions to be processed in the batch workflow. Create an XML file on your machine with the following content:

  Transaction made between user 57 and user 732.

  Transaction made between user 9824 and user 2739.

  Transaction made between user 126 and user 543.

  Transaction made between user 5785 and user 839.

  Transaction made between user 83782 and user 547.

  Transaction made between user 64783 and user 1638.

  Transaction made between user 785 and user 7493.

  Transaction made between user 5473 and user 3829.

  Transaction made between user 3474 and user 9372.

  Transaction made between user 1537 and user 9473.

  Transaction made between user 2837 and user 7383.

Similarly, the following content contains a single transaction to be processed in the real-time workflow.
transactionId=12&transactionMessage= Transaction made between user 687 and user 329.
Download Cyberduck, an SFTP client, to upload the batch transaction file to the B2B pipeline.

Uploading the XML file to Transfer and POST to API Gateway

Use Cyberduck to upload the batch transaction file to the B2B pipeline. Follow the instructions here to upload the preceding transactions XML file. You can find the Transfer server endpoint in both the Transfer console and the Outputs section of the AWS SAM template.

Use the API Gateway console to test the POST method for the single transaction workflow. Navigate to API Gateway in the AWS Management Console and choose the REST API created by the AWS SAM template called SingleTransactionAPI.

In the Resources pane, view the methods. Choose the POST method. Next, choose the client Test bar on the left.

Copy the single real-time transaction into the Query Strings text box then choose Test. This sends the single transaction and starts the real-time workflow of the B2B pipeline.

Viewing Step Functions executions

Navigate to the Step Functions console. Choose the state machine created by the AWS SAM template called StepFunctionsStateMachine.

In the Executions tab, you see a number of successful executions. Each transaction represents a Step Functions state machine execution. This means that every time a transaction is submitted by a trading partner to SQS, it is individually processed as a unique Step Functions state machine execution. This granularity is useful for tracking transactions.

Viewing Workflow Studio

Next, view the Step Functions state machine definition. On the StepFunctionsStateMachine page, choose Edit. You see a code and visual representation of your workflow.

The code version uses Amazon States Language, allowing you to modify the state machine as needed. Choose the Workflow Studio button to get a visual representation of the services and integrations in the workflow.

The Workflow Studio helps you to save time while building a B2B pipeline. There are over 40 different actions you can take on various AWS services and flow states that can provide additional logic to the workflow.

One of the largest benefits of Workflow Studio are the time-savings possible through built-in integrations to AWS services. This architecture includes two integrations: API Gateway request and DynamoDB PutItem.

Choose the API Gateway request state in the diagram. To make a request to the API Gateway REST API, you update the API Parameters section in Configuration. Previously, you may have used a Lambda function to perform this action, adding extra code to maintain a B2B pipeline.

The same is true for DynamoDB. Choose the DynamoDB PutItem state to view more. The configuration to put the item is made in the API Parameters section. To connect to any other AWS services via Actions, add Identity and Access Manager (IAM) permissions for Step Functions to access them. These examples include the necessary IAM permissions for Step Functions to both API Gateway and DynamoDB in the AWS SAM template.

Cleaning up

To avoid ongoing charges to your AWS account, remove the created resources:

Use the CloudFormation console to delete the stack created as part of this demo. Choose the stack, choose Delete, and then choose Delete stack.
Navigate to the S3 console and both Empty then Delete S3 buckets created from the stack: [stackName]-cloudtrails3bucket-[uniqueId] and [stackName]-sftpservers3bucket-[uniqueId]
Navigate to the CloudWatch console and delete the following log groups created from the stack: /aws/lambda/IntakeSingleTransactionLambdaFunction, /aws/lambda/SingleQueueUploadLambda, and /aws/lambda/TriggerStepFunctionsLambdaFunction.

Conclusion

This post shows an architecture to share your business data with your trading partners using API Gateway, AWS Transfer for SFTP, Lambda, and Step Functions. This architecture enables organizations to quickly on-board partners, build event-driven pipelines, and streamline business processes.

To learn more about B2B pipelines on AWS, read: https://aws.amazon.com/blogs/compute/integrating-b2b-using-event-notifications-with-amazon-sns/.

For more serverless learning resources, visit Serverless Land.

Originally published at https://aws.amazon.com on October 4, 2021.

Using Oracle Identity Cloud Service for SSO in a Python 3 Flask Web Application

Patrick Guha — Fri, 29 May 2020 18:46:13 GMT

Using Oracle Identity Cloud Service for Authentication in a Python 3 Flask Web Application

Goal:

This article is meant to document the proof-of-concept experiment in using Oracle’s Identity Cloud Service (IDCS) for Single-Sign-On (SSO) authentication in a Python web application built using the Flask framework.

Motivation:

Oracle has released a software development kit (SDK) for using IDCS with Python before. However, the most recent SDK was written to work with the Django web framework, and it only supports the now-deprecated Python version 2.7. Flask is a much more lightweight and easy to stand-up framework than Django for beginners in creating simple web apps. It is personally the framework I am familiar with. Thus, I thought it would useful to integrate IDCS with Flask. Of course, it goes without saying that in this process, the upgrade to Python 3.8 would also make the security of web apps much better.

Before we go into how to integrate the IDCS with a Flask app, let’s first understand what each of the components are.

What is IDCS:

Oracle’s Identity Cloud Service is an Identity and Access Management (IAM) platform that provides identity management, SSO, and identity governance for apps both in the cloud and on-premises. A service like this makes it much easier to allow granular access to your applications, and the provided SDK makes developers’ jobs easier by not making them write code from scratch. The focus of IDCS is to give enterprise-level access and security to an organization.

What is Flask:

Flask is a Python micro-framework that is very lightweight and easy-to-use. It does not natively include many components such as form validation or file uploads. It instead, is highly extensible, meaning that one can very easily add the best external components to their application as they see fit. Flask is built on Werkzeug utility library, which is a toolkit for Web Server Gateway interface (WSGI) applications. Werkzeug simply realizes software objects for requests and responses. Flask’s core is also built on Jinja2, which is a web template engine (templates being static HTML pages).

Now that we’ve gone over what the components are, let’s take a high level overview of how authentication works with IDCS and a generic web application.

The user requests a protected resource.
The authentication module uses the SDK to generate a request-authorization-code URL for Oracle Identity Cloud Service and send this URL as a redirect response to the web browser.
The web browser calls the URL.
The Oracle Identity Cloud Service Sign In page appears.
The user submits their Oracle Identity Cloud Service sign-in credentials.
After the user logs in successfully, Oracle Identity Cloud Service creates a session for the user and issues an authorization code.
The web application makes a back-end (or server-to-server) call to exchange the authorization code for a user access token.
Oracle Identity Cloud Service issues the access token.
A session is established, and the user is redirected to the Home page.
The Home page of the web application appears.

The above steps describe what it is called a “three-legged authentication flow” with the three coming from each party involved (user, web app, and IDCS.) The three-legged authentication flow, as you can see from step 6, uses the authorization code grant type.

Let’s now look at the required setup you will need to do in IDCS to add a web application for SSO. *This requires an active IDCS account.*

IDCS Setup:

Login to the IDCS console.
In the console, expand the Navigation Drawer, click Applications.
In the Applications page, click Add.

4. In the Add Application chooser dialog, click Confidential Application.

5. Populate the Details pane as follows, and then click Next.

Name: SDK Web Application
Description: SDK Web Application

6. In the Client pane, select Configure this application as a client now, and. then populate the fields of this pane, as follows:

Allowed Grant Types: Select Client Credentials and Authorization Code.
Allow non-HTTPS URLs: Select this check box. The sample application works in non-HTTPS mode.
Redirect URL: http://localhost:8000/callback
Post Logout Redirect URL: http://localhost:8000

7. In the Client pane, scroll down, click the Add button below Grant the client access to Identity Cloud Service Admin APIs.

8. In the Add App Role dialog window, select Authenticator Client and Me in the list, and then click Add.

9. Click Next in the Client pane and in the following panes until you reach the last pane. Then click Finish.

10. In the Application Added dialog box, make a note of the Client ID and Client Secret values (because your web application needs these values to integrate with Oracle Identity Cloud Service), and then click Close.

11. To activate the application, click Activate.

12. In the Activate Application? dialog box, click Activate Application. The success message The SDK Web Application application has been activated should appear.

13. In the IDCS console, click the user name at the top-right of the screen, and click Sign Out.

Ok, now that we’ve configured IDCS, let’s jump into the code for our application. This is a very simplistic HTML web app meant to demonstrate the process. We will look at both the project structure and the individual files in the project.

Project Structure:

Constants.py

Part 1 of 2 of the IDCS Python SDK. I included it in my GitHub repository, so there is no need to download it again from the IDCS console. This file simply holds constant variables for the main IdcsClient.py. This code is a bit long to post here, so refer to the GitHub.

IdcsClient.py

Part 2 of 2 of the IDCS Python SDK. This file is the critical client that connects to IDCS. I made no changes to this file. I simply found out how to use it in the Flask framework, rather than Django. This code is a bit long to post here, so refer to the GitHub here, as well.

config.json

Holds JSON variable that’s loaded with Oracle Identity Cloud Service connection information. Input the Client ID, Client Secret, and other details in bold here here:

{  
"ClientId" : "your-client-id",  
"ClientSecret" : "your-client-secret",  
"BaseUrl" : "your-base-url",  
"AudienceServiceUrl" : "your-audience-service-url (same as base url)",  "scope" : "urn:opc:idm:t.user.me openid",  "TokenIssuer" : "https://identity.oraclecloud.com/",  
"redirectURL": "http://localhost:8000/home",  "logoutSufix":"/oauth2/v1/userlogout",  
"LogLevel":"INFO",  
"ConsoleLog":"True"
}

main.py

Main Flask web application. I’ll dive deep into this in a later section, since this is the crucial part of this experiment.

requirements.txt

Simple list of requirement libraries needed to run the IdcsClient.py and main.py.

flask
requests
jwt
six
cryptography
simplejson
lru-ttl

To install these Python packages, run:

$ pip install -r requirements.txt

templates → login.html & home.html

HTML static templates that Flask renders. These define what our app looks like. Flask requires the HTML templates to be in the folder named templates. These templates also connect to main.py, so I’ll go over them in more depth in a later section.

main.py:

The main.py file has many key chunks of code that I’d like to go over here. These explanations will help describe both how IDCS and the Flask framework operate.

https://medium.com/media/327a19ae734afa198afeeae4472a033a/href

Flask Basics

If you are unfamiliar with Flask, the first thing you have to do is create a Flask object called app on line 6. To run the app, you simply write app.run on the port you want to serve the web app on (shown in line 97–98).

You’ll also see @app.route pop up a lot. In Flask, these app.routes define the pages of your web application. For example, you see that the app.route /home ties to the Python function home() on line 63–64. So, when someone goes to your-website.com/home, the home() code gets invoked. home() will render the home.html template code that we wrote. If there is not an @app.route above a Python function definition, then it is not accessible as a web page; it is only addressable in the code (example is getoptions() function on line 32).

Flask and Sessions

The last basic Flask concepts that I want to mention are app.secret_key and session. Sessions encompass a key difference between how IDCS interacts with Flask versus how it interacts with Django. app.secret_key is a required token (in production please make it more random than ‘secret’ like in line 9) that must be used to enable sessions in Flask. This key makes sure the session is cryptographically signed to avoid insecurity. A session object in Flask is used to hold global variables for a web application. It works like a Python dictionary, but it can used to track variable modifications. Session data in the Flask app is stored in the browser as a cookie.

Now let’s go back to how sessions are different in Flask and Django. Sessions in the Flask framework are client-side (stored on user’s local machine) by default, while sessions in Django are server-side (stored on cloud server machine). This affects IDCS integration greatly, and I’ll describe how with this example. The documentation for IDCS and Django integration show how to access an IDCS session variable such as, id_token (aka user access token from the three-legged authentication flow), by simply calling it like this:

request.session['id_token'] = id_token

Because the id_token is stored on a secure IDCS server in the cloud, and Django sessions are server-side, they could easily get the id_token by just calling the session variable. If you were to try and access the session variable, id_token, in Flask the same way, you would get an error, because the Flask (client-side) session is on your local machine, and it has no idea what an id_token is. Therefore, the Flask web app must get the id_token by calling the IDCS REST API (I will describe how to do this in a later section).

Note: because sensitive information (id_token) is being sent over the internet via a REST API call, it is imperative that this connection be secure and resistant to man-in-the-middle attacks in production environments. Namely, you should be using the HTTPS protocol which involves adding an SSL/TLS (1.2/1.3) certificate to your web server (Apache, NGNIX, etc.) that the Flask app will be hosted on. I will leave it at that since it is beyond the scope of this experiment, but it still worth while bringing up.

Now that I’ve gone over the basics of Flask apps, routes, and sessions, I’ll go into each @app.route code more closely.

@app.route (‘/’)

Staring on line 58 of main.py, this is the base page that users go to when they visit your Flask web app online. It simply renders the HTML template login.html shown here:

https://medium.com/media/ecca3f73e0a91ae9ae421cffb492ecef/href

The key code to look at is in line 30. This HREF tag points to the next important @app.route, /auth.

@app.route (‘/auth’)

This route starts on line 13 and runs the auth() function. This function is responsible for authenticating the user to IDCS. It directly interfaces with the IDCS SDK (aka IdcsClient) and uses the getoptions() function to unpack the required credentials and lets the SDK use them. You might be wondering how get from /auth to our actual web app home page /home now. This is actually done by IDCS in the form of a redirect URL. In the config.json file from earlier, we defined the redirect URL to be “http://localhost:8000/home". This URL explicitly takes us back to the /home route of our app once IDCS finishes authenticating the user.

@app.route (‘/home’)

This route starts on line 63 and runs the home() function. This function is responsible for 1) rendering the home.html template and 2) calling the IDCS REST API for the authenticated user’s id_token I spoke about earlier. Since 1) is very similar to what I showed in the login.html, I’ll avoid talking about it again here. The more interesting responsibility is 2). The goal of calling the IDCS REST API here is to add the id_token to the Flask local client-session, or browser cookie. This id_token is needed to perform IDCS tasks. In our case, it will be needed to logout successfully.

Now, let’s look at the anatomy of calling the IDCS REST API in Python. First, it calls the getoptions() function to load the IDCS credentials. Next, on line 74, you can see that we run this code:

session['code'] = request.args.get('code')

This line uses both creates a dictionary entry in the Flask session called ‘code’ and initializes it with the query parameter ‘code’ from the current URL. That’s a loaded statement, so let’s break it down. The Python library, request, is used here to parse URLs, and the argument it wants to take from the URL is ‘code’. ‘Code’ is actually the authorization code granted to the user once it passes authentication from IDCS in the /auth route. It is tacked onto the end URL of the session after passing. It is what’s known as a query parameter (aka whatever shows up after the “?” in a URL). Here’s a good diagram showing that here:

Next, on line 76, we set the specific fields needed to make the correct call to the IDCS REST API endpoint. These fields are enclosed in a simple JSON object called data. We essentially do the same thing for the headers needed on line 82. The actual call to the IDCS endpoint is performed through the Python library, requests, and shown again here on line 86:

response = requests.post(options['BaseUrl'] + '/oauth2/v1/token?', data=data, headers=headers, auth=(options['ClientId'], options['ClientSecret']))

The requests library is not be confused with the ‘request’ library used earlier. Request is a part of the Flask framework, while requests is an external library used to make REST calls with Python. Through a POST call to the IDCS endpoint ‘/oauth2/v1/token’, we will get back the user’s id_token, or access token. I highly recommend using the Python requests library as I did here. You could try making the call on your own through a similar cURL library, but I always ended up receiving an HTTP error code of 40x, meaning that something was wrong with the way I was calling it. Letting the requests library handle the authentication through auth=(options[‘ClientId’], options[‘ClientSecret’])) is a much easier solution.

Once you finally make the successful REST call, the resulting id_token will be sent back to you in the JSON object we call response. Line 89 simply parses that response JSON object and puts the id_token field into our local Flask session (or browser cookie), which we will use to logout successfully.

Line 91 highlights an important security feature this code implements through highlighting an interesting user edge case.

if str(response.status_code) != "200":        
   return render_template('login.html')

If a user were to try and access our web app by purposely bypassing the base / route or even the /auth route (ex: going straight to http://my-web-app/home), this if statement blocks them from accessing our protected web app (home.html) without being expressly authenticated. Remember when I stated that I kept receiving an HTTP error code 40x when unsuccessfully trying to call the IDCS API? Well, that turned out to be a useful excercise. If a user goes straight to the /home route, the session variable ‘code’ (which can only be correctly set after successfully going through the /auth route) will be null. The IDCS API call requires ‘code’ as a parameter, so a would-be hacker will always get a failed HTTP error code 40x from the API as a response.

One might be thinking that they could simply spoof or fake an authentication code by giving a dummy ‘code’ query parameter (ex: http://my-web-app/home?code=spoof). This would also block the hacker, because, again, the HTTP response code returned by the API would be 40x, not the success code 200. Only the exact authentication code can be used to return the user access token. Any response code other than 200 results in the user being redirected to the login.html page, where they would be forced to authenticate with IDCS. If the response code is 200, the user can finally see the protected web app, home.html.

@app.route(‘/logout’)

Once the user of the web app is finished with their session, they will click the logout button on the HTML page, which is backed by an HREF tag pointing to the logout() function (similar to what we saw in the login() function earlier). The logout() function is key to keeping our app secure, as well. We’ll go over how it does so here.

The first thing we do on line 42 is grab the id_token we stored in our session in the /home route. Then, like usual, we getoptions() for configurations. We then start appending on each of those variables to a string called url. The final variable, id_token, is the last query parameter, token_hint, in the url. This long url is the redirect url to IDCS, which handles ending the session on the server. We also clear all of our local session variables in Flask by running session.clear(). Lastly on line 55, we redirect our browser to the url we set, thus ending our secure web session from Flask to IDCS.

Closing Remarks

I hope this code overview and process flow was useful in explaining the inner workings of both the Oracle Identity Cloud Service and the Python Flask micro, web framework.

By taking the time to understand this code, one can really get a better appreciation of secure web technology, in general.

Thank you for reading!

Final Code on GitHub:

ironspur5/python3-flask-oracle-idcs

Resources:

Use Oracle Identity Cloud Service’s Software Development Kit (SDK) for Authentication in Python Web Applications: https://www.oracle.com/webfolder/technetwork/tutorials/obe/cloud/idcs/idcs_python_sdk_obe/idcs-python-sdk.html

Authenticate a Python application with Oracle Identity Cloud Service: https://docs.oracle.com/en/solutions/authenticate-sample-python-app-with-identity-cloud/learn-authentication-python-applications-and-oracle-identity-cloud-service1.html#GUID-2048B437-1DE8-445B-99D1-7053766044B1

Review user and application access with Oracle IDCS Audit Reports: https://blogs.oracle.com/imc/review-user-and-application-access-with-oracle-idcs-audit-reports

Flask vs Django: http://www.mindfiresolutions.com/blog/2018/05/flask-vs-django/

Authorization Code Grant Type: https://docs.oracle.com/en/cloud/paas/identity-cloud/rest-api/AuthCodeGT.html

Generate Access Token and Other OAuth Runtime Tokens to Access the Resource: https://docs.oracle.com/en/cloud/paas/identity-cloud/17.4.2/rest-api/op-oauth2-v1-token-post.html

Sessions in Flask: https://overiq.com/flask-101/sessions-in-flask/

Oracle Identity Cloud Service’ SDK Python Sample Application: https://github.com/oracle/idm-samples/tree/master/idcs-sdk-sample-apps/python

Parameters as Query String Values: https://howto.caspio.com/parameters/parameters-as-query-string-values/