Exploit the power of DuckDB to explore the largest open database in the food market.
What if I told you that you could access any information about the food industry at your fingertips? For free? A database containing products from all over the world, completely open access?
This database exists and is provided by Open Food Facts, a non-profit organization with the mission to make food consumption transparent.
With over 3 million products from various countries, this database includes detailed information about ingredients composition, nutri-scores, allergens, and more. It is updated daily and provided in JSONL or CSV format.
However, since this database is large (10GB to 43GB depending on the amount of information), manipulating the data proved to be a challenging task without the right infrastructure of toolsβ¦
Quack, Quack,
Enter DuckDB, an in-process analytical tool designed to process data blazingly fast using an enhanced SQL language!
In this article, we will walk you through exploring and exploiting the Open Food Facts database using DuckDB for your usage.
Open Food Facts: The Largest Open Food Database in the World
Open Food Facts is an open-source project dedicated to providing consumers with total transparency over food products. This database is composed of over 3 million products from various countries, making it the largest open product database in the world.
The Open Food Facts database has been instrumental in the success of todayβs projects such as Yuka or Foodvisor. You, too, can access this open database directly from the Open Food Facts website in various formats, such as JSONL, CSV, or MongoDB dump.
However, even though accessing the data is easy, interacting with it requires computational resources and technical skills. To this day, the JSONL file alone takes up 43GB of memory once decompressed, making it challenging to exploit.
Luckily for us, tools to process this large amount of data have been developed over the last few years. One of these tools is DuckDB.
DuckDB: The Fastest In-Process Analytical Database
DuckDB is an open-source project developed to analyze and process large amounts of data using a feature-rich SQL language.
For years, Apache Spark has been the most used tool for handling large-scale datasets. However, its complexity and setup requirements can be seen as overkill for mid-sized data (10 to 100GB).
Similarly, SQL is one of the oldest and most widely used languages for database communication. Yet, it often needs external database infrastructure such as PostGreSQL or MySQL, which can be overly complex for analyzing simpler data formats like Parquet or CSV.
On the opposite, Pandas has long been the favorite library for Data Scientists and Data Analysts due to its intuitive syntax and comprehensive documentation. However, when working with large datasets, Pandas can be slow and consume significant amounts of memory during processing.
DuckDB strikes a perfect balance between these two approaches:
- An enhanced SQL language for efficient querying of large datasets.
- An optimized memory usage to process vast amounts of data quickly.
Letβs dive into DuckDB to explore and leverage the Open Food Facts database.
Installation
You can find the installation guide in the DuckDB official documentation. It supports several languages such as Python, Rust, or Javascript.
But in this guide, we will use the best tool to leverage the power of DuckDB: the Command Line Interface (CLI).
If youβre on Windows, the best way to get linux without formating your entire computer is to set up WSL2. (Hereβs a guide to help you set it up)
Download the latest version of DuckDB binaries using the following command (update the link with the latest versions):
sudo wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip &&\
unzip duckdb_cli-linux-amd64.zip &&\
mv duckdb /usr/local/bin
Type duckdb --version
to check if DuckDB was correctly installed:
duckdb --version
v1.0.0 1f98600c2c
You are now set up to play with the Open Food Facts database π.
Explore the Open Food Facts data
We first download the database from the Open Food Facts website. On this page, you can download several types of data format, such as the MongoDB dump, CSV, or JSONL.
For this use case, we will use the JSONL file since it possesses more information than the CSV file. But it is big: ~7GB compressed, and over 43GB once decompressed!
wget https://static.openfoodfacts.org/data/openfoodfacts-products.jsonl.gz
Once downloaded, we can start using DuckDB to explore its content.
βWait, we donβt decompress the file first?β
Under the hood, DuckDB automatically recognizes the file as compressed in a .gz
format and decompresses it for us, enabling us to save memory and time.
Letβs start the DuckDB CLI by creating a database where all the tables from our analysis will be stored.
duckdb off.db
DuckDB creates a custom database off.db
that compacts the data in a columnar storage type as the Parquet format does, which makes column queries fast. Additionally, data stored in the database takes less memory than their file format equivalent (CSV, JSON, etc ...).
Letβs take a look at the Open Food Facts data using the DESCRIBE
command:
.timer on --Activate the timer
DESCRIBE (SELECT * FROM read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True)) --ignore_errors is necessary since some elements in the data are in the wrong format
Note:
--
is used to comment SQL code
βββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ¬ββββββββββ
β column_name β column_type β null β key β default β extra β
β varchar β varchar β varchar β varchar β varchar β varchar β
βββββββββββββββββββββββββββββββββββββββββββββΌββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌββββββββββ€
β _id β VARCHAR β YES β β β β
β other_nutritional_substances_prev_tags β VARCHAR[] β YES β β β β
β last_edit_dates_tags β VARCHAR[] β YES β β β β
β last_editor β VARCHAR β YES β β β β
β nova_group_debug β VARCHAR β YES β β β β
β ingredients_text β VARCHAR β YES β β β β
β traces_from_ingredients β VARCHAR β YES β β β β
β editors_tags β VARCHAR[] β YES β β β β
β informers_tags β VARCHAR[] β YES β β β β
β pnns_groups_1_tags β VARCHAR[] β YES β β β β
β additives_debug_tags β VARCHAR[] β YES β β β β
β nutriscore_grade β VARCHAR β YES β β β β
β packagings_materials β JSON β YES β β β β
β nutrition_data_per β VARCHAR β YES β β β β
β packaging_recycling_tags β VARCHAR[] β YES β β β β
β data_quality_info_tags β VARCHAR[] β YES β β β β
β traces_hierarchy β VARCHAR[] β YES β β β β
β states_hierarchy β VARCHAR[] β YES β β β β
β additives_original_tags β VARCHAR[] β YES β β β β
β minerals_prev_tags β VARCHAR[] β YES β β β β
β Β· β Β· β Β· β Β· β Β· β Β· β
β Β· β Β· β Β· β Β· β Β· β Β· β
β Β· β Β· β Β· β Β· β Β· β Β· β
β ingredients_text_en_ocr_1580151419_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1580151419 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1544855786_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1544855786 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562578542 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562578542_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1563189139_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1563189139 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1547074706_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1547074706 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562841907_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562841907 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1653660842_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1653660842 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1582747034_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1582747034 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1563072801 β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1563072801_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562124853_result β VARCHAR β YES β β β β
β ingredients_text_en_ocr_1562124853 β VARCHAR β YES β β β β
βββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ΄ββββββββββ€
β 1031 rows (40 shown) 6 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Run Time (s): real 16.220 user 8.963006 sys 5.099856
Letβs walk through this command to understand what happened:
SELECT * FROM <table>
is the typical way in SQL to output the entire table. In DuckDB, you can also do the same action by just usingFROM <table>
read_ndjson('openfoodfacts-products.jsonl.gz')
is a function provided by DuckDB to read a JSONL file. As you can see, there's no need to decompress the file manuallyβDuckDB handles it for us! Depending on the file type, you can also useread_csv()
orread_parquet()
.
In addition, we can count the number of products stored in the database:
SELECT count(*) from read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True);
βββββββββββββββ
β count(code) β
β int64 β
βββββββββββββββ€
β 3185774 β
βββββββββββββββ
The database is massive, with over 3 millions products and detailed information for each one of them.
However, not every feature is relevant for our analysis; many are actually legacy. We only need a few features for our analysis.
Since DuckDB is optimized for columnar queries, this will make the process really fast.
Letβs examine a single product to get an idea of what the data looks like.
You can custom the output visual using
.mode <option>
and.maxrows <n>
. This is particularly useful if the data doesn't fit in tables. Use the.help
mode to see the available options.
.mode lines --Configure outputs to be printed as lines instead of box
FROM read_ndjson('openfoodfacts-products.jsonl.gz') OFFSET 10000 LIMIT 1; --Drop the first 10000 lines and output only one
_id = 0011233077226
other_nutritional_substances_prev_tags =
last_edit_dates_tags = [2023-01-04, 2023-01, 2023]
last_editor = kiliweb
nova_group_debug =
ingredients_text = Wheat flour*, corn syrup, vegetable oil (palm and/or soybean), sugar, water, pecans*, whole eggs*, contains less than 2% of the following: food starch - modified, butter*, salt, dextrin, maltodextrin, natural and artificial flavor, potassium sorbate and sodium propionate (preservatives), nonfat milk*, whey (milk)*, dextrose, rice syrup, citric acid, ascorbic acid, soy lecithin*, l-cysteine enzymes.
traces_from_ingredients =
informers_tags = [usda-ndb-import, clockwerx, org-database-usda, kiliweb]
pnns_groups_1_tags = [composite-foods, known]
additives_debug_tags = [en-e322i-added, en-e300-removed]
nutriscore_grade = e
packagings_materials = {}
nutrition_data_per = 100g
packaging_recycling_tags = []
data_quality_info_tags = [en:no-packaging-data, en:ingredients-percent-analysis-ok, en:ecoscore-extended-data-not-computed, en:food-groups-1-known, en:food-groups-2-known, en:food-groups-3-unknown]
traces_hierarchy = []
states_hierarchy = [en:to-be-completed, en:nutrition-facts-completed, en:ingredients-completed, en:expiration-date-to-be-completed, en:packaging-code-to-be-completed, en:characteristics-to-be-completed, en:origins-to-be-completed, en:categories-completed, en:brands-to-be-completed, en:packaging-to-be-completed, en:quantity-to-be-completed, en:product-name-completed, en:photos-to-be-validated, en:packaging-photo-to-be-selected, en:nutrition-photo-to-be-selected, en:ingredients-photo-to-be-selected, en:front-photo-selected, en:photos-uploaded]
additives_original_tags = [en:e1400, en:e202, en:e281, en:e330, en:e322i]
minerals_prev_tags = []
ecoscore_grade = unknown
data_quality_tags = [en:no-packaging-data, en:ingredients-percent-analysis-ok, en:ecoscore-extended-data-not-computed, en:food-groups-1-known, en:food-groups-2-known, en:food-groups-3-unknown, en:serving-quantity-defined-but-quantity-undefined, en:ecoscore-origins-of-ingredients-origins-are-100-percent-unknown, en:ecoscore-packaging-packaging-data-missing, en:ecoscore-production-system-no-label]
traces =
amino_acids_prev_tags = []
...
Nutri-score, list of ingredients in different languages, additives, β¦ Each product comes with valuable information we can use for our applications.
Now that weβve introduced DuckDB and explored the structure of the Open Food Facts database, itβs time to dive into some analysis.
What are the most predominant languages?
First, since we donβt need the entire dataset for our analysis, we can segment our exploration by creating tables in the DuckDB database. This will allow us to persist our work and make any future queries faster and more efficient than querying the entire dataset every time.
Letβs start by creating a table for our analysis:
CREATE TABLE lang_product AS
SELECT lang, code, product_name FROM read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True)
WHERE code IS NOT NULL;
We can show the list of tables using .tables
. In addition to being lightweight, the table is highly optimized for SQL queries.
We can start querying the data to find out which languages are most predominant.
SELECT lang, count() AS count
FROM lang_product GROUP BY lang ORDER BY count
DESC LIMIT 20;
βββββββββββ¬ββββββββββ
β lang β count β
β varchar β int64 β
βββββββββββΌββββββββββ€
β fr β 1209122 β
β en β 1068194 β
β es β 319308 β
β it β 225794 β
β de β 220944 β
β pt β 20032 β
β nl β 19696 β
β ru β 15675 β
β pl β 11395 β
β ro β 10730 β
β bg β 6506 β
β cs β 6358 β
β sv β 5547 β
β th β 4315 β
β ar β 4260 β
β fi β 4048 β
β nb β 4026 β
β lt β 3614 β
β hu β 2994 β
β hr β 2418 β
βββββββββββ΄ββββββββββ€
β 20 rows β
βββββββββββββββββββββ
Run Time (s): real 0.040 user 0.112340 sys 0.035817
0.04 seconds to process over 3 millions products! This is how fast DuckDB can be to process the Open Food Facts database.
List all products containing specific ingredients
We could use the database in a food search engine application.
To retrieve all products containing a specific ingredient, we can first create a table containing a subset of the database relevant to our task:
CREATE TABLE products AS
SELECT code, product_name, ingredients_text, lang FROM read_ndjson('openfoodfacts-products.jsonl.gz', ignore_errors=True)
WHERE product_name IS NOT NULL AND ingredients_text IS NOT NULL;
Letβs say we want to retrieve all products containing βcheeseβ. A method would be to use the pattern matching feature from DuckDB and perform a search in the ingredients_text
column.
.maxrows 10
FROM products
WHERE ingredients_text ILIKE '%cheese%'; -- `%` matches any sequence of zero or more characters - ILIKE is case insensitive - LIKE for case sensitive search
βββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββββ
β code β product_name β ingredients_text β lang β
β varchar β varchar β varchar β varchar β
βββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββββ€
β 0000000018340 β Fire Roasted Hatchβ¦ β Dry roasted almonds, hatch green chile seasoning (organic cheddar cheese powder [organic cheddar cheese {culβ¦ β en β
β 0000000032117 β Tricolor Tortellini β Egg pasta (refined durum semolina wheat flour, pasteurized eggs, dehydrated tomato and spinach), filling (grβ¦ β en β
β 0000020043070 β Biscuit β Enriched wheat flour (niacin, reduced iron, thiamin mononitrate, riboflavin, folic acid, malted barley flourβ¦ β en β
β 0000020043087 β Biscuit β Enriched wheat flour (niacin, reduced iron, thiamin mononitrate, riboflavin, folic acid, malted barley flourβ¦ β en β
β 0000168175589 β Mcvitie's, digestiβ¦ β Fortified wheat flour (with calcium, iron, niacin, thiamin), vegetable oil (palm), sugar, wholemeal wheat flβ¦ β en β
β Β· β Β· β Β· β Β· β
β Β· β Β· β Β· β Β· β
β Β· β Β· β Β· β Β· β
β 29397297 β M&S Chicken Caesarβ¦ β full fat soft cheese (milk) rapeseed oil water parmigiano reggiano cheese , , (milk) chicken breast (4,5%) (β¦ β en β
β 9414323991982 β Beef Lasagne β Meat Sauce (58%) [Tomato PurΓ©e (34%), Beef (19%), Onion (2.3%), Maize Thickener (1422), Brown Sugar, Salt, Hβ¦ β en β
β 01154047 β Sweet Maui rings β yellow corn flour, white rice flour, vegetable oil (contains one or more of the following: corn oil, sunflowβ¦ β en β
β 0005050215800 β Garlic mushroom, mβ¦ β Fortified Wheat Flour (Wheat Flour, Calcium Carbonate, Iron, Niacin, Thiamin), Water, Mozzarella Cheese (Cowβ¦ β en β
β 5208046258194 β β Egg pasta 72% (durum wheat semolina, egg 19.4%), filling 28% [breadcrumbs (soft wheat flour, water, yeast, sβ¦ β en β
βββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄ββββββββββ€
β 38483 rows (10 shown) 4 columns β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Run Time (s): real 0.299 user 2.891531 sys 0.009861
Over 38483 matches in less than 0.3 seconds!
Letβs now export this result in a parquet format for further analysis.
COPY(
FROM products WHERE ingredients_text ILIKE '%cheese%'
) TO cheese_products.parquet (FORMAT PARQUET);
Run Time (s): real 0.422 user 3.038148 sys 0.019620
Using the COPY
feature, we can export portions of the database in no time. Many file formats are supported, such as CSV, parquet, JSON and even EXCEL.
Conclusions
DuckDB is a powerful tool that empowers users to exploit and analyze large amounts of data swiftly and efficiently, all without the need for complex infrastructure. Rapidly evolving with strong community support, DuckDB is becoming a go-to solution for data analysis tasks.
Coupled with the Open Food Facts database, you gain access to a vast catalog of products that can be leveraged for various projects.
Food transparency in the palm of your hand. π¦π
How to contribute to Open Food Facts?
Since Open Food Facts is a collaborative project, you can contribute to enhancing food transparency worldwide.
You can either:
- Contribute to the Open Food Facts GitHub: explore open issues that align with your skills,
- Download the Open Food Facts mobile app: add new products to the database or improve existing ones by simply scanning their barcodes,
- Join the Open Food Facts Slack and start discussing with other contributors in the OFF community.
We canβt wait to see you join the community!