Automating PDF Text Extraction from Google Drive using Python

GOOGLE DRIVE | WORD & PDF DOCUMENTS | GOOGLE API

bedy kharisma

Published in

Data And Beyond

4 min readJun 9, 2023

If you’ve come across the article titled “Getting Dark Data from Google Drive — No Google API” on Medium:

GETTING DARK DATA FROM GOOGLE DRIVE — NO GOOGLE API

GOOGLE DRIVE | WORD & PDF DOCUMENTS | NO GOOGLE API

medium.com

You may have noticed that the process outlined in the article involves a combination of JavaScript and Python. This requires separate processes, one using Google Sheets and Apps Script, and the other using a Python environment. While the article has its advantages, such as not requiring the use of a Google API, it can be challenging and time-consuming to work with multiple languages and navigate through complex procedures. If you prefer a more streamlined approach without the hassle of mixing languages and enduring tedious processes, this article is perfect for you.

In this article, we’ll explore a simplified method to extract dark data from Google Drive using only Python + Google API. By leveraging the power of Python libraries and eliminating the need for additional languages and complicated setups, we can streamline the entire process and make it more accessible to users who are primarily comfortable with Python.

In this article, we’ll explore a Python script that automates the process of extracting text from PDF files stored in a Google Drive folder. We’ll be using various libraries, such as google-api-python-client, PyPDF2, gdown, and requests, to accomplish this task. Let's dive into the script and understand each step.

Step 1: Setting Up First, we need to install the necessary libraries by running the following command:

!pip install -q google-api-python-client PyPDF2 gdown

Next, we import the required modules:

import pandas as pd
from google.oauth2 import service_account
from googleapiclient.discovery import build
import gdown
import requests
import PyPDF2

Step 2: Authentication and Setup Before interacting with the Google Drive API, we need to set up authentication by providing the path to the service account JSON key file and the required scopes. If you do not know how, this article below might help you, esepecially step number 3.

Creating a Google Sheet as Your Web App Database using Streamlit Cloud

Google Spreadsheet | Streamlit Cloud | Database

medium.com

SERVICE_ACCOUNT_KEY_PATH = './pydrive-*****-*****.json'
SCOPES = ['https://www.googleapis.com/auth/drive.readonly']

If you already have service account. All you have to do is share said folder to your service account, as a commenter will do.

Additionally, we’ll prompt the user to enter the folder ID of the target Google Drive folder:

folder_id = input("Enter folder ID: ")

Next, we build the service object for the Google Drive API using the authentication credentials:

credentials = service_account.Credentials.from_service_account_file(SERVICE_ACCOUNT_KEY_PATH, scopes=SCOPES)
service = build('drive', 'v3', credentials=credentials)

Step 3: Recursive Function to Get File Metadata To retrieve the list of files and folders within the specified Google Drive folder, we define a recursive function get_files_in_folder(). This function takes the parent folder ID as input and returns a concatenated DataFrame of the metadata for all files and folders within the folder.

def get_files_in_folder(parent_id):
    query = f"'{parent_id}' in parents and trashed=false"
    response = service.files().list(q=query, fields='files(name,id,mimeType,webViewLink,createdTime,modifiedTime)').execute()
    files = response.get('files', [])
    dfs = [pd.DataFrame(files)]
    for file in files:
        if file['mimeType'] == 'application/vnd.google-apps.folder':
            dfs.append(get_files_in_folder(file['id']))
    return pd.concat(dfs, ignore_index=True)

Step 4: Main Code Execution We retrieve the folder name using the provided folder ID:

folder = service.files().get(fileId=folder_id, fields='name').execute()

Then, we use the get_files_in_folder() function to obtain a DataFrame containing the metadata for all files and folders within the specified Google Drive folder:

df = get_files_in_folder(folder_id)

We filter the DataFrame to include only PDF files:

df = df[df['mimeType'] == 'application/pdf']

We add an empty column named “text” to store the extracted text from the PDF files:

df["text"] = ""

We create a temporary directory if it doesn’t already exist:

if not os.path.exists("./temp_dir"):
    os.makedirs("./temp_dir")

Next, we iterate over the DataFrame and download each PDF file using its webViewLink:

for index, value in df['webViewLink'].items():
    try:
        url = value
        r = requests.get(url)
        if r.status_code == 200:
            output = r"./temp_dir/" + df["name"][index] + ".pdf"
            gdown.download(url, output, fuzzy=True)
    except:
        pass

Finally, we iterate over the DataFrame again and extract the text from each PDF file, storing it in the “text” column:

for index, value in df['name'].items():
    try:
        output = r"./temp_dir/" + df["name"][index] + ".pdf"
        pdfFileObject = open(output, 'rb')
        pdfReader = PyPDF2.PdfReader(pdfFileObject)
        count = len(pdfReader.pages)
        text = ""
        for i in range(count):
            page = pdfReader.pages[i]
            text += page.extract_text() + "\n"
        df["text"][index] = text
    except:
        pass

Step 5: Saving the Result Finally, we save the DataFrame to a pickle file for future use:

df.to_pickle("dataframe.pkl")

By following this script, you can easily extract text from multiple PDF files stored in Google Drive and save the results for further analysis.