GETTING DARK DATA FROM GOOGLE DRIVE — NO GOOGLE API
GOOGLE DRIVE | WORD & PDF DOCUMENTS | NO GOOGLE API
We have learned how to get data from files, SQL localhost, and SQL Server. And most of the time, in the form of a table. But What if we need to get data from documents (word & pdf) stored in Google Drive? Then this article might be right for you, especially when no Google API setting is required!
Consider you are given a task:
“Hey I have a bunch of documents I saved on google drive, is it possible to read all those word and pdf files there, and sum it into a table?”
As a good data analyst as you are, sure thing you will answer, “sure”
So let’s get into it. first, you will need a script that enables you to get a list of files within the google drives specific folder. To do this, you will need to create a new google spreadsheet, go to extension, and choose Apps script
This will get you to the Apps script tab.
in code.gs write the following code. and replace the var folderID with your desired folder Id. I am sure you know how. Yup, that random string on your address bar.
// TODO: Set folder ID
var folderId = 'your folder id here';
var array=[];
var judul=['location','name','url','create date','update date'];
// Main function 2: List all files & folders, & write into the current sheet.
function listAll(){
getFolderTree(folderId, true);
};
// Get Folder Tree
function getFolderTree(folderId, listAll) {
try {
// Get folder by id
var parentFolder = DriveApp.getFolderById(folderId);
//go to 1st sheet
var spreadsheet = SpreadsheetApp.getActiveSpreadsheet();
SpreadsheetApp.setActiveSheet(spreadsheet.getSheets()[0]);
// Initialise the sheet
var file, data, sheet = SpreadsheetApp.getActiveSheet();
sheet.clear();
// Get files and folders
getChildFolders(parentFolder.getName(), parentFolder, data, sheet, listAll);
}
catch (e) {
Logger.log(e.toString());
}
};
// Get the list of files and folders and their metadata in recursive mode
function getChildFolders(parentName, parent, data, sheet, listAll) {
var childFolders = parent.getFolders();
// List folders inside the folder
while (childFolders.hasNext()) {
var childFolder = childFolders.next();
// List files inside the folder
var files = childFolder.getFiles();
while (listAll & files.hasNext()) {
var childFile = files.next();
//Logger.log("File Name: " + childFile.getName());
data = [
parentName + "/" + childFolder.getName() + "/" + childFile.getName(),
childFile.getName(),
childFile.getId(),
childFile.getUrl(),
childFile.getDateCreated(),
childFile.getLastUpdated(),
];
// Write
array.push(data);
}
// Recursive call of the subfolder
//Logger.log(array);
getChildFolders(parentName + "/" + childFolder.getName(), childFolder, data, sheet, listAll);
SpreadsheetApp.getActiveSheet().getRange(1, 1, 1, 1).setValue('location');
SpreadsheetApp.getActiveSheet().getRange(1, 2, 1, 1).setValue('name');
SpreadsheetApp.getActiveSheet().getRange(1, 3, 1, 1).setValue('id');
SpreadsheetApp.getActiveSheet().getRange(1, 4, 1, 1).setValue('url');
SpreadsheetApp.getActiveSheet().getRange(1, 5, 1, 1).setValue('DateCreated');
SpreadsheetApp.getActiveSheet().getRange(1, 6, 1, 1).setValue('LastUpdated');
SpreadsheetApp.getActiveSheet().getRange(2, 1, array.length, 6).setValues(array);
};
};
This script will read all files in google Drive and write it all for you in a table in your newly created spreadsheet. Including the location of a file relative to the specified folder, its name, id, URL, date created, and last updated. Before continuing to the next step please set the authority to let everyone with the link have access, minimum, to view.
This is the first table in which we need to locate where are all documents. And for that, we will be doing it using python. To do that is to write the following code:
import pandas as pd
import textract
import re
#get sheet id
sheet_id='your sheed id here'
#convert google sheet to csv for easy handling
csv_url=(f"https://docs.google.com/spreadsheets/d/{sheet_id}/export?format=csv")
#create dataframe from csv
df=pd.read_csv(csv_url)
#filter by filetype
desiredfiletype=[".pdf",".doc",".docx"]
df=df[df['name'].str.contains('|'.join(desiredfiletype))]
#drop na
df=df.dropna()
#create new column "text"
df["text"]=""
#create new column url
df["url_final"]=""
for index, row in df.iterrows():
if str(row['name'])[-4:] == ".pdf":
df['url_final']=df['url']
else:
df['url_final']="http://docs.google.com/document/d/"+df['id']+"/export?format=pdf"
#remove url column
del df['url']
df
The script will require your newly created google spreadsheet id. And then, you will need to convert the spreadsheet into csv. And let the pandas create a dataframe out of that csv. And then filtered that dataframe by specific types of the desired file extensions. In this case “.pdf”,” .doc”, ”.docx”
And for the word document (.doc and .docx), I prefer to download it first into pdf. This will make it easier for python to read its content. But first, we need to download it to our folder by using this code. To do this, you will need two libraries: gdown and requests.
import gdown
import requests
for index, value in df['url_final'].items():
try:
url = value
r = requests.get(url)
if r.status_code == 200:
output=r"D:/Documents/ISA/"+ df["name"][index]+".pdf"
gdown.download(url,output,fuzzy=True)
except:
pass
Right after, we got all the files on our local machine. then we will read those documents at once. By following codes. I have tried several pdf reader libraries. And in my humble opinion, PyPDF2 works like a charm.
import gdown
import PyPDF2
import requests
for index, value in df['url'].items():
try:
#read pdf
output=r"D:/Documents/"+ df["name"][index]+".pdf"
pdfFileObject = open(output, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
text=""
for i in range(count):
page = pdfReader.getPage(i)
text += page.extractText() +"\n"
df["text"][index]=text
except:
pass
So voila..
I am sure Google API is great, but to some like me, setting projects, credentials, JSON, et cetera sometimes is just too much. especially when there is an alternative that will deliver the same result.