Writing Your First Forensic Tool-1: Extracting Google Chrome Browser Data with Python and LLM

13 min readApr 25, 2024

**A female binary wizard summoning her Python Djinn to fetch and parse forensic data during an investigation session. — AI Generated Photo**

A Note for the Readers

This article is all about writing a rudimentary forensic Python tool that extracts and parses visited URLs, download URLs, bookmarks and searched keywords, among other data, from a Google Chrome browser.

Pleas note that several established cross-platform tools like browser-history, Chromagnon and Dumpzilla offer a more optimized and feature-rich approach that can be utilzed for conducting browser forensics investigations.

This write up is one of the hands-on parts of an online course titled Python Scripting for Geeky Testers. The course once taken by the author is now in limbo due to various factors like ‘work, life and interest balance’. So, the author decided to make the class lectures and codes public for testing enthusiasts with a view to getting invaluable feedback from them to improve his own scripting skills.

This article aims at accomplishing the following tasks to hone the expertise of the readers

Manually locate and explore interesting file locations of a web browser to gather insights into it
Connecting to a local database, scripting SQL queries to extract forensic artifacts and parsing the extracted data into a meaningful way
Utilizing Large Language Models as personal assistants to save time for googling, asking and waiting for basic queries on the internet. We will also do some prompt engineering, feed the LLMs with necessary information or documentation for generating tailored data relevant to the context that will engage the reader more to the solution to the problem in hand.
Writing and structuring a basic script will enable readers to explore how browser-history and dumpzilla work under the hood as well as enable them to use the tools more efficiently, tweak the source code for special needs and contribute to the projects.

This article assumes that the audience has some basic knowledge on SQL and Python programming language. Without any prior knowledge, they are assumed proactive enough to use offline, online resources and LLM assistants to explore them.

**A free-of-cost LLM can be a trusted friend for exploring new things in the dark deep wild if trained up and directed with enough context and precise guideline. — AI Generated Photo**

Section 1: Exploring Interesting Data Manually

The very basic task of a forensic tool is extracting and parsing interesting data from a (digital) artifact to get a deeper insight into what incidents took place in it. In this section we will write SQL queries that will fetch

Browsed data, URL and respective browsing time
Searched keywords in a site with respective URL and time and
Download links with respective time

This script will run both on *nix and nt systems, but first let us MANUALLY explore where the Google Chrome browser keeps its interesting data in a *nix system.

Fire up your Gemini terminal and input the following prompt

Consider a *nix system
- in which file does google chrome browser keep browsing data, searched keywords and download urls ?

Now open your terminal and connect to the db file

$ sqlite3 ~/.config/google-chrome/Default/History

The expected result is

SQLite version 3.34.1 2021-01-20 14:10:07
Enter ".help" for usage hints.
sqlite>

Now type .schema in the interactive DB shell. If you find an error message like this

Error: database is locked, then your chrome browser is open and the system is preventing you from accessing the database file.

In this case, please close the Chrome browser and try again.

The command .schema will print out all the schema or tables in the History database. Here are the tables of our interest (truncated view)

-- search term keywords
keyword_search_terms (keyword_id INTEGER NOT NULL,url_id INTEGER NOT NULL,term LONGVARCHAR NOT NULL,normalized_term LONGVARCHAR NOT NULL);
... 
-- Downloads
downloads (id INTEGER PRIMARY KEY,guid VARCHAR NOT NULL,current_path LONGVARCHAR NOT NULL,target_path LONGVARCHAR NOT NULL,start_time INTEGER NOT NULL,received_bytes INTEGER NOT NULL,total_bytes INTEGER NOT NULL,state INTEGER NOT NULL,danger_type INTEGER NOT NULL,interrupt_reason INTEGER NOT NULL,hash BLOB NOT NULL,end_time INTEGER NOT NULL,opened INTEGER NOT NULL,last_access_time INTEGER NOT NULL,transient INTEGER NOT NULL,referrer VARCHAR NOT NULL,site_url VARCHAR NOT NULL,embedder_download_data VARCHAR NOT NULL,tab_url VARCHAR NOT NULL,tab_referrer_url VARCHAR NOT NULL,http_method VARCHAR NOT NULL,by_ext_id VARCHAR NOT NULL,by_ext_name VARCHAR NOT NULL,by_web_app_id VARCHAR NOT NULL,etag VARCHAR NOT NULL,last_modified VARCHAR NOT NULL,mime_type VARCHAR(255) NOT NULL,original_mime_type VARCHAR(255) NOT NULL);
downloads_url_chains (id INTEGER NOT NULL,chain_index INTEGER NOT NULL,url LONGVARCHAR NOT NULL, PRIMARY KEY (id, chain_index) );
... 
-- URLs
urls(id INTEGER PRIMARY KEY AUTOINCREMENT,url LONGVARCHAR,title LONGVARCHAR,visit_count INTEGER DEFAULT 0 NOT NULL,typed_count INTEGER DEFAULT 0 NOT NULL,last_visit_time INTEGER NOT NULL,hidden INTEGER DEFAULT 0 NOT NULL);
...

Fetching Searched Terms with URLs

We are interested to fetch the following data

The searched keywords of a Chrome user
The title of the site where it was searched
Respective URL to the keyword
The respective time of search
How many times the user searched for the keyword and
Sort it from the highest to lowest times of search

If we carefully notice the output of .schema command, we will find that keyword_search_terms and urls tables are related with keyword_search_terms.url_id and urls.idcolumns.

Lets build an SQL query and run it in the interactive sqlite3 shell

SELECT 
   urls.title, urls.url, urls.visit_count, 
   datetime(urls.last_visit_time  / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime')
FROM 
    keyword_search_terms  
INNER JOIN 
    urls
ON
    keyword_search_terms.url_id = urls.id 
ORDER BY
    urls.visit_count
DESC;

We will take only the search terms from keyword_search_terms table and fetch detailed data from urls table. Lets break down the SQL query using our LLM assistant

Tables and Columns Explained by My LLM Assistant

keyword_search_terms: This table likely stores data related to search terms entered in the browser.
term: This column likely contains the actual search keywords used.
urls: This table likely stores information about visited URLs.
title: This column likely contains the title of the webpage.
url: This column stores the actual URL address of the webpage.
visit_count: This column likely keeps track of how many times a particular URL was visited.
last_visit_time: This column stores a timestamp representing the last time the URL was visited.

Join Operation

INNER JOIN keyword_search_terms ON keyword_search_terms.url_id = urls.id:

This clause joins the two tables based on a common field url_id. This means it retrieves entries where a url_id in keyword_search_terms matches an id in urls. This establishes a connection between search terms and the URLs they were used on.

Date Conversion

datetime(urls.last_visit_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime'): This part converts the last_visit_time stored in the database (likely in microseconds) into a human-readable format. Let's break it down further:
urls.last_visit_time / 1000000: Divides the last_visit_time by 1,000,000 to convert it from microseconds to seconds since the epoch (January 1, 1970, 00:00:00 UTC).
(strftime('%s', '1601-01-01')): This part calculates the number of seconds since the epoch for a specific date (January 1, 1601, in this case). It's likely used for reference or time zone adjustment.
datetime(...): This function converts the calculated seconds since the epoch into a datetime format.
'unixepoch': This specifies that the input time is based on the Unix epoch (seconds since January 1, 1970, 00:00:00 UTC).
'localtime': This specifies that the output datetime should be converted to the user's local time zone.

This is a truncated view of the output of the above SQL query

2.debian.pool.ntp.org|2.debian.pool.ntp.org at DuckDuckGo|https://duckduckgo.com/?q=2.debian.pool.ntp.org|3|2024-03-29 14:34:32
...
Exhuma|Exhuma - Google Search|https://www.google.com/search?q=Exhuma|3|2024-03-26 23:29:48
...

Explanation of the output

Exhuma|Exhuma - Google Search|https://www.google.com/search?q=Exhuma|3|2024-03-26 23:29:48

The above output shows that the Chrome user searched for the keyword Exhuma in Google search engine three times. The last time s/he searched for it was March 26, 2024 at 11:29PM 48 seconds of his local time.

Equipped with the aforementioned information, we will write a query for downloaded files

SELECT
  downloads_url_chains.url,
  datetime(downloads.end_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime')
FROM
  downloads_url_chains
INNER JOIN
  downloads
ON	
  downloads_url_chains.id = downloads.id;

And another for Browsed URLs

SELECT
  title, url, 
  datetime(last_visit_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime'),
  visit_count
FROM
  urls
ORDER BY
  visit_count
DESC;

The above SQL queries are similar to the first query.

Tips for Geeks and Beginners

If you want to play a bit more with your LLM assistant, please feed it with the prompts of your choice.

An example prompt I would have used when SQL was totally an alien language to me

Explain the following SQL Query 
- Explain each sql statement in seperate section 
- Explain likely data of each table under each statement section 
- How do tables under INNER JOIN and ON statement work 
SELECT
    downloads_url_chains.url,
    datetime(downloads.end_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime')
FROM
    downloads_url_chains
INNER JOIN
    downloads
ON	
    downloads_url_chains.id = downloads.id;

So far the rigorous task of manually collecting interesting data from a Chrome browser database is done successfully. In the second part of this writing, we are going to automate the task.

**Ehem, Excuse me. Just a humble reminder, please understand clearly what NOT to automate and do a paperwork before making your hands dirty in code editor. Thanks. — AI Generated Photo**

Section 2: Automating Data Extraction with Python

As a hobbyist hacker or a geek, repeating this same task is rigorous and memorising all these queries are pointless. Because, the locations, DB name and queries will change over time when new versions of Chrome browser will appear in the market. Let us automate this task with a maintainable codebase that can be updated along with the upcoming changes. Here comes the power of Python programming language — Automating tasks to remain focus on the task in hand.

This is also one of the prime reasons behind the popularity of Python among all sorts of testers and automation engineers.

As a hands on example we may have a look at the source code of Mobile Verification Toolkit.

This is a worldwide popular forensic tool among high-risk journalists, activists and forensic investigators to find exclusive compromises in an Android or iOS device. If we go through its documentation and source code, we will find that the scripts are automating the task of fetching various interesting data from an Android and iOS artifact and matches those data against database(s) of Structured Threat Information Expression or STIX to find any Indicator of Compromise.

Have a look at the code snippet from its indicators.py module to understand how the toolkit finds out the IOCs

...
        # If all parsing worked, we start walking through available domain
        # indicators.
        for idx, ioc in domain_matcher.iter(final_url.domain.lower()):
            # First we check the full domain.
            if final_url.domain.lower() == ioc["value"]:
                if orig_url.is_shortened and orig_url.url != final_url.url:
                    self.log.warning(
                        "Found a known suspicious domain %s "
                        'shortened as %s matching indicator "%s" from "%s"',
                        final_url.url,
                        orig_url.url,
                        ioc["value"],
                        ioc["name"],
                    )
                    ...

Lets come back to our task. We will feed our LLM assistant with Python’s sqlite3 documentation and filter out which methods we will need to automate our task.

Example prompt

I have access to a databse using sqlite3 and I want to automate the following tasks using Python's sqlite3 
- Connecting to the database 
- Run some sql queries 
- Return query results from some functions 
- Close the connection 
Read the total Python sqlite3 documentation from here: <https://docs.python.org/3.9/library/sqlite3.html> and suggest the methods needed to accomplish my tasks and script. 
Write down only the necessary methods names. You don't need to generate codes.

The key points are shown in the truncated response below

... 
Connecting to the database:

    connect(database_path): This method establishes a connection to a SQLite database at the specified path.

Running SQL queries:

    cursor.execute(query, params=()): This method executes an SQL query on the database connection. You can optionally pass parameters using a tuple in the params argument.
Closing the connection:

    connection.close(): This method closes the connection to the database.

... 
Remember to replace "path/to/your/database.db" with the actual path to your database file and adjust the SQL queries according to your needs.

So far we have found the basic methods needed to write our script.

Notice the answer for closing the DB connection with the close() method. Instead of closing the connection by hands, we will use Python context manager to avoid the burden of any accidental open connection in the memory once the script is stopped. So, notice that we are not taking the suggestion of our LLM assistant here.

Lets Make Our Hands Dirty

First structure the code-base

Put SQL queries into a query.py file
Create a chrome_history.py file where we will write and execute our functions

The query.py file looks like the following

"""
DB Schema

search term keywords
keyword_search_terms (keyword_id INTEGER NOT NULL,url_id INTEGER NOT NULL,term LONGVARCHAR NOT NULL,normalized_term LONGVARCHAR NOT NULL);

Downloads
downloads (id INTEGER PRIMARY KEY,guid VARCHAR NOT NULL,current_path LONGVARCHAR NOT NULL,target_path LONGVARCHAR NOT NULL,start_time INTEGER NOT NULL,received_bytes INTEGER NOT NULL,total_bytes INTEGER NOT NULL,state INTEGER NOT NULL,danger_type INTEGER NOT NULL,interrupt_reason INTEGER NOT NULL,hash BLOB NOT NULL,end_time INTEGER NOT NULL,opened INTEGER NOT NULL,last_access_time INTEGER NOT NULL,transient INTEGER NOT NULL,referrer VARCHAR NOT NULL,site_url VARCHAR NOT NULL,embedder_download_data VARCHAR NOT NULL,tab_url VARCHAR NOT NULL,tab_referrer_url VARCHAR NOT NULL,http_method VARCHAR NOT NULL,by_ext_id VARCHAR NOT NULL,by_ext_name VARCHAR NOT NULL,by_web_app_id VARCHAR NOT NULL,etag VARCHAR NOT NULL,last_modified VARCHAR NOT NULL,mime_type VARCHAR(255) NOT NULL,original_mime_type VARCHAR(255) NOT NULL);
downloads_url_chains (id INTEGER NOT NULL,chain_index INTEGER NOT NULL,url LONGVARCHAR NOT NULL, PRIMARY KEY (id, chain_index) );

URLs
urls(id INTEGER PRIMARY KEY AUTOINCREMENT,url LONGVARCHAR,title LONGVARCHAR,visit_count INTEGER DEFAULT 0 NOT NULL,typed_count INTEGER DEFAULT 0 NOT NULL,last_visit_time INTEGER NOT NULL,hidden INTEGER DEFAULT 0 NOT NULL);
"""

SEARCH_TERMS_QUERY = """
SELECT
    keyword_search_terms.term,
    urls.title,
    urls.url,
    urls.visit_count,
    datetime(urls.last_visit_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime')
FROM
    keyword_search_terms
INNER JOIN
    urls
ON
    keyword_search_terms.url_id = urls.id
ORDER BY
    urls.visit_count
DESC;
"""

DOWNLOAD_URL_QUERY = """
SELECT
    downloads_url_chains.url,
    datetime(downloads.end_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime')
FROM
    downloads_url_chains
INNER JOIN
    downloads
ON
    downloads_url_chains.id = downloads.id;
"""

BROWSED_URLS_QUERY = """
SELECT
    title,
    url,
    datetime(last_visit_time / 1000000 + (strftime('%s', '1601-01-01')), 'unixepoch', 'localtime'),
    visit_count
FROM
    urls
ORDER BY
    visit_count
DESC;
"""

Note: We put DB Schema in a docstring to write queries easily and avoid connecting to the DB unnecessarily.

Now come to the chrome_history.py file

First we will import necessary modules and ensure that the script runs on both Linux and windows systems

#!/usr/bin/env python3

# Portion 1: Import necessary modules
import os
import query
import sqlite3

# Portion 2: Ensure portability
DB_PATH = ''
if os.name == 'posix':
 DB_PATH = os.path.expanduser("~/.config/google-chrome/Default/History") # Note 1
elif os.name == 'nt':
 local_app_data = os.getenv('LOCALAPPDATA')
 folder_path = local_app_data + r'\Google\Chrome\User Data\Default' 
 DB_PATH = os.path.join(folder_path, 'History') # Note 2
else:
 print('Unknown OS')
 exit(0)

Note 1:

os.path.expanduser("~/.config/google-chrome/Default/History") translates into /home/{USERNAME}/.config/google-chrome/Default/History. So, we need not to hard code the path to DB

Note 2:

The final output of DB_PATH variable will be C:\\Users\\<username>\\AppData\\Local\\Google\\Chrome\\User Data\\Default\\History. This will serve the same purpose of not hard coding the user name in the Windows DB path.

Now we will define three self explanatory functions fetch_search_terms(), fetch_download_info() and fetch_browsing_info() to automate our tasks

def fetch_search_terms():
 print("A list of Searched Terms")
 with sqlite3.connect(DB_PATH) as connection:
  cursor = connection.execute(query.SEARCH_TERMS_QUERY)
  for entry in cursor:
   print(f'Search Term: {entry[0]}')
   print(f'Title: {entry[1]}')
   print(f'Last Visit: {entry[4]}')
   print(f'URL: {entry[2]}')
   print(f'Search Count: {entry[3]}')
   print("-"*100)


def fetch_download_info():
 print("Downloads")
 with sqlite3.connect(DB_PATH) as connection:
  cursor = connection.execute(query.DOWNLOAD_URL_QUERY)
  for entry in cursor:
   print(f'Download URL: {entry[0]}')
   print(f'Time: {entry[1]}')
   print('-'*100)


def fetch_browsing_info():
 print("Browsing Data")
 with sqlite3.connect(DB_PATH) as connection:
  cursor = connection.execute(query.BROWSED_URLS_QUERY)
  for entry in cursor:
   print(f'Title: {entry[0]}')
   print(f'URL: {entry[1]}')
   print(f'Last Visit Time: {entry[2]}')
   print(f'Visit Count: {entry[3]}')
   print('-'*100)

If you notice, we could have written a single function to fetch data from the connected DB. But we wrote three different functions that look pretty similar to each others.

There are three reasons for this ‘repetition’

Each SQL query will return different numbers of columns. So, if a function returns different data, this will create ambiguity and hard to maintain code base in the future
If any of the query changes in the future, we only need to maintain the respective function and query
The second principle of Python Zen: Explicit is Better than Implicit

Lets do some test runs and see if you get expected results from your box.

Adding Command Line Arguments

Finally, we will add command line options to have control over the output data. After adding this functionality, the script has the following final look

#!/usr/bin/env python3

import os
import query
import sqlite3
import argparse

# Portability
DB_PATH = ''
if os.name == 'posix':
    DB_PATH = os.path.expanduser("~/.config/google-chrome/Default/History")  # Get the full path with the username
elif os.name == 'nt':
    local_app_data = os.getenv('LOCALAPPDATA')
    folder_path = local_app_data + r'\Google\Chrome\User Data\Default'
    DB_PATH = os.path.join(folder_path, 'History')
    print(DB_PATH)
else:
    print('Unknown OS')
    exit(0)

parser = argparse.ArgumentParser()
parser.add_argument('-d', '--data', type=str, \
                    required=False, help='-d or --data search, download, keyword, all')
arguments = parser.parse_args()


def fetch_search_terms():
    print("A list of Searched Terms")
    with sqlite3.connect(DB_PATH) as connection:
        cursor = connection.execute(query.SEARCH_TERMS_QUERY)
        for entry in cursor:
            print(f'Search Term: {entry[0]}')
            print(f'Title: {entry[1]}')
            print(f'Last Visit: {entry[4]}')
            print(f'URL: {entry[2]}')
            print(f'Search Count: {entry[3]}')
            print("-" * 100)


def fetch_download_info():
    print("Downloads")
    with sqlite3.connect(DB_PATH) as connection:
        cursor = connection.execute(query.DOWNLOAD_URL_QUERY)
        for entry in cursor:
            print(f'Download URL: {entry[0]}')
            print(f'Time: {entry[1]}')
            print('-' * 100)


def fetch_browsing_info():
    print("Browsing Data")
    with sqlite3.connect(DB_PATH) as connection:
        cursor = connection.execute(query.BROWSED_URLS_QUERY)
        for entry in cursor:
            print(f'Title: {entry[0]}')
            print(f'URL: {entry[1]}')
            print(f'Last Visit Time: {entry[2]}')
            print(f'Visit Count: {entry[3]}')
            print('-' * 100)


if arguments.data == 'search':
    fetch_browsing_info()
elif arguments.data == 'download':
    fetch_download_info()
elif arguments.data == 'keyword':
    fetch_search_terms()
elif arguments.data == 'all':
    fetch_browsing_info()
    fetch_download_info()
    fetch_search_terms()
else:
    parser.print_help()

**While following instructions, do not forget to take a break and ask yourself ‘What am I doing ?’ — AI Generated Image**

Final Thoughts and Key Takeaways

The article is about fetching forensic data from a Google Chrome web browser. There are four key points the author tried to underscore throughout the writing

How to look for interesting forensic data from a browser database. This is just an oversimplified version of a forensic investigation, but these basics will hone the learners’ skills and show a way to find out data from their applications of interest.
How to utilize LLMs to push productivity , save time and when to discard their suggestions based on the investigator’s common sense and experience.
Automating a task that is well defined and well directed. Remember, automation is not the whole thing in any sort of testing or digital investigation. Rather, automated scripts are tools that help an investigator explore an artifact with more insight.
How to explore the day to day magical Python tools we use everyday and gathering more insight about what actually going on under-the-hood while we are utilizing the tools. This will also make readers interested to develop more interesting features in Free and Open Source testing / forensics tools written in Python.

But notice that we did not write the bookmark script as promised in the introduction!!

Bonus

The above script is a rudimentary and tailored-for-tutorial one. To get a more feature reach working browser forensic tool, please use dumpdino.

You can get the source code in my Github repository https://github.com/ph03n1x13/dumpdino

The tool is tested both on *nix and nt systems and fetches the following data

All the URLS the browser had interacted
Download URLs
Search terms
Login info
Most browsed sites
Bookmarks

Enjoy! Happy Hacking !!

Writing Your First Forensic Tool-1: Extracting Google Chrome Browser Data with Python and LLM

Section 1: Exploring Interesting Data Manually

Section 2: Automating Data Extraction with Python

Bonus

Written by Ishtiaque Foysol