Data-Driven Deal Evaluation using PyPI Public Dataset

Jamesin Seidel
3 min readJul 3, 2024

--

Originally published on Substack on June 20th, 2024: https://jamesin.substack.com/p/data-driven-deal-evaluation-using

Introduction

Last week, I was doing diligence on a Seed stage company we were evaluating — and I used the public Big Query PyPI table to verify the number of Python downloads listed as a key metric on the deck.

I’m somewhat embarrased to admit this because data is my thing — but before last week, I didn’t know this dataset existed.

So, to share the data love, I figured I’d write a quick post with the queries in case this type of data validation comes up in anyone else’s work or if you’re interested in tracking Python package downloads.

Context

What is PyPI? PyPI (Python Package Index) is the official third-party package repository for the Python programming language.

What is the PyPI public dataset? It is a public dataset that anyone can query to get metrics about Python package downloads.

How do you get started? This is the main resource doc: Analyzing PyPI package downloads

Analysis and Queries

Below is an easy base query to get started that returns package downloads in the last year.

Note that you’ll need to have Big Query set up.

SELECT 
timestamp,
DATE(timestamp) AS date,
project,
country_code,
details.installer.name AS details_installer_name,
details.python AS details_python_version,
file.version AS file_version
FROM `bigquery-public-data.pypi.file_downloads`
WHERE DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)

If you’re interested in the most popular packages in the last 30 days.

SELECT file.project, COUNT(*)
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
AND details.installer.name = 'pip'
GROUP BY 1
ORDER BY 2 DESC

What about looking at the monthly growth for the top 100 packages in the past year?

WITH file_downloads AS (
SELECT
DISTINCT
file.project AS file_project,
DATE(timestamp) AS date,
COUNT(*) OVER (PARTITION BY file.project, FORMAT_TIMESTAMP('%Y-%m', DATE(timestamp))) AS download_count
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
details.installer.name = 'pip'
AND
TIMESTAMP(timestamp) BETWEEN TIMESTAMP('2023-05-01') AND TIMESTAMP('2024-05-31')
AND
EXTRACT(MONTH FROM TIMESTAMP(timestamp)) = 5
),
top_projects AS (
SELECT
file_project,
download_count AS downloads_2024
FROM file_downloads
WHERE date = '2024-05-01'
ORDER BY download_count DESC
LIMIT 100
),
previous_year_downloads AS (
SELECT
file_project,
download_count AS downloads_2023
FROM file_downloads
WHERE date = '2023-05-01'
)
SELECT
A.file_project,
RANK() OVER(ORDER BY A.downloads_2024 DESC) AS rank,
A.downloads_2024,
B.downloads_2023,
SAFE_DIVIDE((A.downloads_2024 - B.downloads_2023), B.downloads_2023) AS percenage_growth
FROM top_projects A
LEFT JOIN previous_year_downloads B
ON A.file_project = B.file_project
ORDER BY 4 DESC

Lastly, that Pydantic growth from the table above is too juicy not to visualize. In case you’re not familiar, Pydantic is a data validation library for Python. It’s similar to dataclasses — it allows developers to define models and verify the data format.

Below is the of Pydantic monthly downloads. Look at that growth!! 🚀

SELECT DATE(timestamp) AS date, file.project AS file_project
FROM`bigquery-public-data.pypi.file_downloads`
WHERE
DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 365 + 90 DAY)
AND details.installer.name = 'pip'
AND file.project = 'pydantic-core'

Outro

Anyway, thanks for reading! Hope that this comes up in your future work and you know where to look 😊 📊

--

--