Use Python to Read and Download a Large CSV from a URL

Amiri McCain
7 min readDec 20, 2023

--

Open Food Facts logo with a Python logo and a CSV download icon
We will be working with the database from Open Food Facts

The Solution

To skip the scenic route and go straight to the code solution, scroll down👇🏽 or cmd+f to “Solution Code” or “Solution Code Breakdown” otherwise, enjoy the stroll 🚶🚶‍♀🚶‍♂️️

The solution provided uses Python to download a large CSV into a Pandas dataframe and then saves that large dataframe into “n” number of smaller CSV files and then finally the smaller CSV files are copied and converted to Parquet files.

Programmatically Accessing a CSV URL

Accessing and downloading a CSV from a URL using Python is typically pretty straightforward. However, there are some gotcha’s to be aware of, especially when you are attempting to download a large file.

Why?

Why not just click the URL or copy/paste the URL into your browser and download it the easy way? Yeah, that’s an option and I did that. But why do it the easy way when I can code it out? Look, do you want to just single click on the CSV URL with your index finger like a cave man and download it that way? Or do you want to have some fun and code this out with Python in a Jupyter Notebook? I knew you were game! ;)

Okay seriously, it is not an uncommon requirement to need to download data from a URL endpoint like this. You may have a client that posts delta exports for you to periodically download. For example, Open Food Facts provides daily delta exports for the previous 14 days and so you can imagine a scenario where you would need to daily check a URL endpoint for new data to process in your data pipeline.

Open Food Facts

We are going to download Open Food Facts’ (OFF) database. Open Food Facts makes their database easily available via different data exports:

  • MongoDB dump,
  • Delta Export,
  • JSONL data export,
  • CSV data export,
  • RDF data export and
  • Image data export.

They have made it really easy to access their data in a number of different ways.

I enjoy eating (and growing) good quality clean food and Open Food Facts’ efforts and large database of foods and food ingredients can help all of us to make healthier food choices. They also have a food scanning app available for Android and Apple. Once you scan a food, if it is in their database, it will provide you with a lot of detailed information including a nutritional rating system and eco score among other information, and these details are even color-coded to quickly and easily assist you in making a healthy food choice.

And besides, this diverse, large 10GB (as of December 2023) dataset makes for some interesting and fun data experiments. There are currently 3,027,506 food products in their database! The Open Food Facts team seems like a great bunch and for years many have volunteered their time and effort to make this possible. It is easy to sign up and start personally contributing to this ever-growing database.

Photo by Edgar Castrejon on Unsplash

Procedure at a High Level

  1. Download large CSV (tab delimited) file from URL into a Pandas dataframe
  2. Replace all empty fields (NaN, null) in the dataframe with the string “null”
  3. Check for empty rows and columns
  4. Convert large (~10GB) dataframe to “n” number of smaller CSV files
  5. Create a table with column names: file, rows, columns, and size (MB) for CSV file diagnostics
  6. Make sure the row count from the dataframe matches the sum total row count from all of the chunked CSV files
  7. Create “n” number of Parquet files from chunked CSV files and create a table for Parquet file diagnostics

Solution Code 🧭

  • For some simple data validation, we compare the number of rows and columns of the large CSV file dataframe to the sum total number of rows and columns of all the smaller CSV files, and we do the same comparison with the Parquet files.
  • In the dataframe, I replaced all NaN (null) fields with the string “null.” I did this because although Snowflake has the ability to identify NaN/nulls (e.g. EMPTY_FIELD_AS_NULL = TRUEin the FILE FORMAT object) it did not seem to be working as I expected. Replacing NaN null fields with the string “null” fixed the issue (e.g. NULL_IF = (‘NULL', ‘null')in the FILE FORMAT object in Snowflake). Since Snowflake is outside the scope of this article, I plan on discussing this a little more in a future article.

Solution Code Steps

  1. Download large CSV (tab delimited) file from URL into a Pandas dataframe:
import pandas as pd
import numpy as np
import os

# Download CSV with read_csv
df = pd.read_csv('https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv', \
sep='\t', low_memory=False)

2. Replace all empty fields (NaN, null) in the dataframe with the string “null”:

default_value = 'null'
df.fillna(value=default_value, inplace=True)

3. Check for empty rows and columns:

#  Check for empty rows
empty_rows = df.isnull().all(axis=1).sum()
print('number of empty rows: ', empty_rows)

# Check for empty columns
empty_cols = df.isnull().all(axis=0).sum()
print('number of empty columns: ', empty_cols)

4. Convert large (~10GB) dataframe to “n” number of smaller CSV files:

create_number_of_files = 10 # number of files to create and populate
for index, chunk in enumerate(np.array_split(df, create_number_of_files)):
chunk.to_csv(f'./chunk/csv/en.openfoodfacts.org.products_{index}.csv', index=False, sep=',')

5. Create a table with column names: file, rows, columns, and size (MB) for CSV file diagnostics:

# Get name of each file and number of rows in each file, add up rows for sum total

# Initialize an empty list to hold data for each CSV file
data = []

# Loop through CSV files in the 'chunk/csv/' directory
for file in os.listdir('chunk/csv/'):
if file.endswith('.csv'):
file_path = os.path.join('chunk/csv', file)
try:
# Read the CSV file and get its shape
df2 = pd.read_csv(file_path, sep=',', low_memory=False)
rows, columns = df2.shape

# Get file size
file_size = os.path.getsize(file_path)
file_size_mb = round(file_size/1000000, 1)

# Create list of data
data.append({'file': file, 'rows': rows, 'columns': columns, 'size (MB)':file_size_mb})
except Exception as e:
print(f"Error processing {file}: {str(e)}")

# Create a DataFrame from the list of data
df3 = pd.DataFrame(data)

# Add up total number of rows and size
print('Aggregate number of rows: ', round(df3['rows'].sum(), 1))
print('Aggregate size of files (MB): ', round(df3['size (MB)'].sum(), 1))

# Display the resulting DataFrame
df3

6. Make sure the row count from the dataframe matches the sum total row count from all of the chunked CSV files:

# make sure the counts match from the original data and the chunked and re-combined data
print('Chunked Data Total Row Count:\t', df3['rows'].sum())
print('Original CSV File Row Count:\t', df.shape[0])

7. Create “n” number of Parquet files from chunked CSV files and create a table for Parquet file diagnostics:

# Get each csv file and create a corresponding Parquet file
# Get name of each file and number of rows in each file, add up rows for sum total

# Initialize an empty list to hold data for each Parquet file
data = []

# Loop through CSV files in the 'chunk/csv/' directory
for file in os.listdir('chunk/csv/'):
if file.endswith('.csv'):
file_path = os.path.join('chunk/csv', file)
try:
# Read the CSV file and get its shape
df4 = pd.read_csv(file_path, sep=',', low_memory=False)

# Build file name with .parquet extension
file_parquet = os.path.splitext(file)[0] # index file name from tuple, without .csv extension
file_parquet = file_parquet + '.parquet'

# Create parquet file (snappy compression is the default)
df4.to_parquet('chunk/parquet/' + file_parquet)

# Get file size
file_size = os.path.getsize('chunk/parquet/' +file_parquet)
file_size_mb = round(file_size/1000000, 1)

# Get df shape
rows, columns = df4.shape

# Create list of data
data.append({'file': file_parquet, 'rows': rows, 'columns': columns, 'size (MB)':file_size_mb})
except Exception as e:
print(f"Error processing {file}: {str(e)}")

# Create a DataFrame from the list of data
df5 = pd.DataFrame(data)

# Add up total number of rows and size
print('Aggregate number of rows: ', round(df5['rows'].sum(), 1))
print('Aggregate size of files (MB): ', round(df5['size (MB)'].sum(), 1))

# Display the resulting DataFrame
df5

Some Pitfalls Along the Way

  • Parameter low_memory=False inpd.read_csvshould be set to False for large files (default is True). I was able to get it to read the CSV without error in 9 minutes, but I could not consistently get it to run without errors with low_memory=True. With low_memory=False I get no errors, but it also takes about 27 minutes to complete the run.
  • When I first used pd.read_csv with the URL, I was getting an error, if memory serves, the error was something like this: Mac OSX python ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:749). On my MacBook Pro, navigating to and double clicking the script “Install Certificates. Command” fixed this issue.
  • Note that even though the source file has a .csv extension, the documentation on this dataset states “The file encoding is Unicode UTF-8. The character that separates fields is <tab> (tabulation).” And so, this is a tab delimited file or TSV, be sure to set sep='\t' when using pd.read_csv.
  • When you save this large dataframe into smaller files, you can now save it with the .csv extension and set sep=',' when using chunk.to_csv. Initially, I was using sep=’\t’ as my separator here and then assuming it was a CSV everywhere else downstream in my code, this caused problems.
  • At first, my Pandas pd.read_csv looked like this: df = pd.read_csv(io.StringIO(s), encoding=’UTF-8', delimiter=’\t’, chunksize=1000). This ran without error in about 11 minutes; however, I realized that the leading zeros in the numerical codes in my “code” column had been stripped out and so I got rid of io.StringIO() (and I also dropped encoding='UTF-8'. I likely had something configured wrong and would like to revisit this in the future.

What’s Next?

I will continue to build on this project. Next, in an upcoming article, I will be pushing the CSV and Parquet files that we created to Amazon S3.

Before you go!

Did you see what happens when you click and hold the clap 👏🏽 button? You could clap up to 5️⃣0️⃣ times, that is, if you enjoyed the article. Try it out! 😀

Author: Amiri McCain, Data Engineer

--

--

Amiri McCain

Data engineering, cloud, and electronics tech enthusiast.