Gemini API: Revolutionizing Content Generation with Direct PDF Input

Kanshi Tanaike
Google Cloud - Community
4 min readJul 23, 2024

Abstract

Gemini API now enables direct PDF processing for content generation, eliminating image conversion and reducing costs. This report provides a sample script to demonstrate this new capability and its potential applications.

Introduction

Gemini API has recently introduced the ability to directly process PDF data for content generation, significantly enhancing its capabilities. Previously, to utilize PDF data for content creation, it was necessary to convert each PDF page into a separate image format. This time-consuming and resource-intensive process has been eliminated, resulting in substantially reduced processing costs.

By directly ingesting PDF content, Gemini API unlocks new possibilities for various applications. This report will provide a sample script demonstrating how to effectively harness this feature by generating content directly from PDF data using the Gemini API.

Sample script

In this script, GeminiWithFiles of Google Apps Script is used. So, please install it before you test the following script. Ref

Please copy and paste the following script to the script editor of Google Apps Script. And, please set your API key.

function sample() {
const apiKey = "###"; // Please set your API key.

const urls = [
"https://journals.aps.org/pr/pdf/10.1103/PhysRev.48.73", // from https://journals.aps.org/pr/abstract/10.1103/PhysRev.48.73
"https://arxiv.org/pdf/1706.03762.pdf", // from https://research.google/pubs/attention-is-all-you-need/
];

const blobs = UrlFetchApp.fetchAll(urls).map((r) => r.getBlob());

const jsonSchema = {
description:
"Summary the following papers within 50 words. Also, retrieve each title and authors of each paper.",
type: "array",
items: {
type: "object",
properties: {
title: { description: "Title of the paper", type: "string" },
authors: { description: "Authors of the paper", type: "string" },
summary: {
description: "Summary of the paper within 50 words.",
type: "string",
},
},
required: ["title", "summary"],
additionalProperties: false,
},
};
const g = new GeminiWithFiles.geminiWithFiles({
apiKey,
response_mime_type: "application/json",
model: "models/gemini-1.5-pro-latest",
doCountToken: true,
});
const fileList = g.setBlobs(blobs).uploadFiles();

console.log(fileList); // Here, you can see the metadata of the uploaded data.

const res = g
.withUploadedFilesByGenerateContent(fileList)
.generateContent({ jsonSchema });
g.deleteFiles(fileList.map(({ name }) => name));

console.log(res);
}

The URLs in the above script is as follows.

  • https://journals.aps.org/pr/pdf/10.1103/PhysRev.48.7: This is from this page.
  • https://arxiv.org/pdf/1706.03762.pdf: This is from this page.

In this script, the PDF data is downloaded and used with Gemini API. The flow of this script can be seen in the top image of this report.

On August 3, 2024, GeminiWithFiles was updated to v2.0.0. By this, I updated the above script.

Result

When this script is run, the following result is obtained. Multiple PDF files can be parsed by one API call.

[
{
"title": "The Particle Problem in the General Theory of Relativity",
"authors": "A. EINSTEIN AND N. ROSEN",
"summary": "This paper explores an atomistic theory of matter and electricity using general relativity and electromagnetism. It modifies gravitational equations to admit regular solutions for static spherically symmetric cases, representing particles as \"bridges\" connecting two identical sheets of space. The theory explains the absence of negative mass particles and offers a unified treatment of field and motion."
},
{
"title": "Attention Is All You Need",
"authors": "Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Aidan N. Gomez, Illia Polosukhin, Łukasz Kaiser, Jakob Uszkoreit",
"summary": "This paper introduces the Transformer, a novel network architecture based solely on attention mechanisms for sequence transduction tasks. It replaces traditional recurrent and convolutional layers with multi-head self-attention, enabling superior parallelization and performance. Experiments on machine translation show significant improvements in BLEU scores and training time, achieving state-of-the-art results."
}
]

After v2, GeminiWithFiles can directly use PDF blob as the above script. But, when v1 is used, the PDF blob can be converted to images for each page. When v1 is used, the script utilizes the argument false of const fileList = await g.setBlobs(blobs, false).uploadFiles(); line to determine how PDF data is handled by the Gemini API. Here, the argument false signifies that the script directly feeds the PDF data to the API for processing. Conversely, when set to true (like const fileList = await g.setBlobs(blobs, true).uploadFiles();), the script converts each page of the PDFs into individual images before sending them to the API.

This experiment compared the processing times between directly using PDF data and converting PDFs to images. Here’s a breakdown:

  • Direct PDF Processing : 15 seconds
  • Image-based Processing: 120 seconds

(Tested on PDFs with 5 and 15 pages from https://journals.aps.org/pr/pdf/10.1103/PhysRev.48.7 and https://arxiv.org/pdf/1706.03762.pdf respectively)

As demonstrated by the results, directly feeding PDF data to the Gemini API leads to a significantly lower processing cost compared to the image-based approach. This is likely due to the overhead associated with image conversion for each page.

Note

This sample script is for Google Apps Script. But, this approach can also be used for other languages except for Google Apps Script.

Reference

--

--

Kanshi Tanaike
Google Cloud - Community

Physicist / Ph.D. in Physics / Google Developer Expert / Google Cloud Champion Innovator