Data-Driven Deal Evaluation using PyPI Public Dataset

3 min readJul 3, 2024

Originally published on Substack on June 20th, 2024: https://jamesin.substack.com/p/data-driven-deal-evaluation-using

Introduction

Last week, I was doing diligence on a Seed stage company we were evaluating — and I used the public Big Query PyPI table to verify the number of Python downloads listed as a key metric on the deck.

I’m somewhat embarrased to admit this because data is my thing — but before last week, I didn’t know this dataset existed.

So, to share the data love, I figured I’d write a quick post with the queries in case this type of data validation comes up in anyone else’s work or if you’re interested in tracking Python package downloads.

Context

What is PyPI? PyPI (Python Package Index) is the official third-party package repository for the Python programming language.

What is the PyPI public dataset? It is a public dataset that anyone can query to get metrics about Python package downloads.

How do you get started? This is the main resource doc: Analyzing PyPI package downloads

Analysis and Queries

Below is an easy base query to get started that returns package downloads in the last year.

Note that you’ll need to have Big Query set up.

SELECT 
  timestamp, 
  DATE(timestamp) AS date, 
  project, 
  country_code, 
  details.installer.name AS details_installer_name,
  details.python AS details_python_version,
  file.version AS file_version
FROM `bigquery-public-data.pypi.file_downloads`
WHERE DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)

If you’re interested in the most popular packages in the last 30 days.

SELECT file.project, COUNT(*)
FROM `bigquery-public-data.pypi.file_downloads`
WHERE 
  DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
  AND details.installer.name = 'pip'
GROUP BY 1
ORDER BY 2 DESC

What about looking at the monthly growth for the top 100 packages in the past year?

WITH file_downloads AS (
  SELECT 
    DISTINCT
    file.project AS file_project,
    DATE(timestamp) AS date,
    COUNT(*) OVER (PARTITION BY file.project, FORMAT_TIMESTAMP('%Y-%m', DATE(timestamp))) AS download_count
  FROM `bigquery-public-data.pypi.file_downloads`
  WHERE 
    details.installer.name = 'pip'
    AND 
    TIMESTAMP(timestamp) BETWEEN TIMESTAMP('2023-05-01') AND TIMESTAMP('2024-05-31')
    AND 
    EXTRACT(MONTH FROM TIMESTAMP(timestamp)) = 5
),

top_projects AS (
  SELECT 
    file_project, 
    download_count AS downloads_2024
  FROM file_downloads
  WHERE date = '2024-05-01'
  ORDER BY download_count DESC
  LIMIT 100
),previous_year_downloads AS (
  SELECT
    file_project,
    download_count AS downloads_2023
  FROM file_downloads
  WHERE date = '2023-05-01'
)SELECT 
  A.file_project, 
  RANK() OVER(ORDER BY A.downloads_2024 DESC) AS rank,
  A.downloads_2024, 
  B.downloads_2023,
  SAFE_DIVIDE((A.downloads_2024 - B.downloads_2023), B.downloads_2023) AS percenage_growth
FROM top_projects A
LEFT JOIN previous_year_downloads B
ON A.file_project = B.file_project
ORDER BY 4 DESC

Lastly, that Pydantic growth from the table above is too juicy not to visualize. In case you’re not familiar, Pydantic is a data validation library for Python. It’s similar to dataclasses — it allows developers to define models and verify the data format.

Below is the of Pydantic monthly downloads. Look at that growth!! 🚀

SELECT DATE(timestamp) AS date, file.project AS file_project
  FROM`bigquery-public-data.pypi.file_downloads`
  WHERE 
    DATE(timestamp) > DATE_SUB(CURRENT_DATE(), INTERVAL 365 + 90 DAY)
    AND details.installer.name = 'pip'  
    AND file.project = 'pydantic-core'

Outro

Anyway, thanks for reading! Hope that this comes up in your future work and you know where to look 😊 📊

Data-Driven Deal Evaluation using PyPI Public Dataset

Introduction

Context

Analysis and Queries

Outro

Written by Jamesin Seidel