Building a “Bakehouse”: Baking Better Bread using the Databricks Lakehouse Platform

10 min readJul 27, 2023

Disclaimer: Opinions expressed are solely my own and do not express the views or opinions of my employer.

Introduction

As an avid hobbyist sourdough bread baker, I’m always striving to improve my process and bake better bread. With meticulous detail, I maintain a “bread log” document in Google Docs, where I record every step of the process. The ultimate goal of the document is to capture all of the inputs (i.e. quantities of flour, water, salt, etc.) and process steps (i.e. duration of bulk fermentation, quantity of stretch and folds, dough temperatures, etc.), and associate those with the results of the bake (i.e. open or dense crumb, has an ear or doesn’t have an ear, etc.), so I can begin to isolate the elements that make the perfect loaf. Historically, I’ve tried to isolate those elements by manually scrolling through the bread log document and mentally deduce which factors are more likely to contribute to a successful bake, but then it occurred to me: What if I could take a data-driven approach to bake the perfect loaf? Could I systematically process the data from the bread log document and transform it into something useful that can help guide my bread baking process? The answer is, of course, yes, and in this blog, I’ll detail how I use the Databricks Lakehouse Platform to help me bake better bread.

High-Level Design

At a high-level, the solution uses Databricks to:

Ingest raw text data from Google Docs
Store the data in Unity Catalog Volumes and Tables
Apply NLP techniques via OpenAI GPT-3.5 and the ai_generate_text function to impose structure on the text data
Transform the data with SQL queries
Orchestrate the entire process with Databricks Workflows

In this blog post, we’ll walk through each of these components in more detail.

Data Ingestion

As mentioned above, I keep the source bread log stored as a Google Doc, so the first order of business is to extract the Google Doc and persist it in object storage for downstream processing and analysis.

My personal programming language of choice is Python, so I used the Google API Python Client to write a simple script that reads from the bread log and writes it out as a .txt file. I chose to store the .txt files in a Databricks Unity Catalog Volume, which provides the ability to access all of the benefits of Unity Catalog (control fine-grained ACLs, visibility into upstream and downstream data lineage, and much more), while working with unstructured/non-tabular data. For this project, I’m working with both non-tabular data (like these .txt files) and tabular data (more on that in the next section), and through Unity Catalog Volumes, I’m able to govern all of my data with a single experience in Databricks.

Once the data lands in the Volume as a .txt file, Auto Loader incrementally processes any new files and writes them to a Delta Lake table. The code below implements Auto Loader ingestion with just a few lines of Python.

def auto_loader_bronze():
  checkpoint_path = "/Volumes/<path-to-checkpoints>"
  catalog_name = 'justin_kolpak'
  schema_name = 'bread_analysis'

  (spark.readStream
  .format("cloudFiles")
  .option("cloudFiles.format", "text")
  .option("cloudFiles.schemaLocation", checkpoint_path)
  .option("linesep", "\n\n")
  .load("/Volumes/<path-to-raw-data>/")
  .withColumn("_bronze_metadata", col("_metadata"))
  .withColumn("_etl_timestamp", current_timestamp())
  .withColumn("_input_file_timestamp", to_timestamp(regexp_extract(col("_metadata.file_name"), '(\\d{4}_\\d{2}_\\d{2}_\\d{2}_\\d{2})', 1), "yyyy_MM_dd_HH_mm"))
  .writeStream
  .option("checkpointLocation", checkpoint_path)
  .option("mergeSchema", "true")
  .trigger(availableNow=True)
  .toTable(f"{catalog_name}.{schema_name}.bronze_bread_log")
  )

Just like that, all of the raw bread log data is available in a Delta Lake table, ready for additional processing and analysis.

Data Transformation

Step 1: Natural Language Processing with an LLM using Databricks SQL ai_generate_text

Alright — now I have access to the data in a Delta Table, but it doesn’t provide much analytical value in its current form. Here’s an example of what an entry in the bread log looks like:

7/10/23 country sourdough
Yield: 1 loaf
8am: feed starter, no discard and fed last night
12:15pm: mix dough: 450g bread flour, 50g high extraction wheat flour, 350g water 80F, 100g starter 
Dough temperature: 78.5F at 12:15pm
12:45pm: add 10g salt, 25g water, mix by hand for 3 minutes
Bulk started at 12:45pm
Turns: 1:20pm, 2pm, 3:30pm
Preshape at 3:45pm
Bulk ended at 3:45pm
Shape at 4:30pm, straight to fridge 
Baked on 7/11 at 425F on baking stone
Results: excellent rise, opened up very nicely, beautiful ear - very proud of this one! Crumb was pretty decent - wasn’t super open but looked well proved

Through NLP techniques, like Named Entity Recognition, we can turn this unstructured text into meaningful, structured data that can be useful for downstream processing and analysis. Large Language Models (LLMs) are excellent at these types of NLP tasks, and through the Databricks SQL ai_generate_text function and some thoughtful prompt engineering, I’m able to harness the power of LLMs without leaving the comfort of my SQL Editor. In this case, I’m using ai_generate_text to call OpenAI GPT-3.5 to translate the raw text into JSON format.

Screenshot of part of this query in the Databricks SQL Editor

/* SQL query that handles prompt engineering and making requests to GPT-3.5 */
SELECT b.value
    , AI_GENERATE_TEXT(
        CONCAT("""
    Read the input data and translate the raw, unstructured data into a Python dictionary with the following information:
    1) recipe name, which should include the date as well (e.g. 6/3/23 country sourdough with olives and herbs)
    2) yield (if not included, set equal to '1 loaf')
    3) date (will be included in the first line of the recipe, along with the recipe name), expressed in MM/dd/yyyy format
    4) ingredient name, quantity, unit of measurement, temperature of the ingredient (if provided) and the classification of each ingredient, which can be: flour, water, salt, leavening agent (sourdough starter, yeast, etc.), or extra ingredients.
    5) number of turns and the time of each turn
    6) dough temperatures
    7) The duration of the bulk fermentation process (starts when salt is added and is finished at the preshape step)
    8) Sentiment analysis of the results of the loaf, including a rating on a scale of 1-10. It is necessary to include a rating - if there isn't enough information available to provide a rating, return 'Not Enough Info'
    9) Areas for improvement for the next bake
    10) Other comments

    If any values are unknown, return a NULL value.

    The resulting dictionary should conform precisely to the JSON schema below:
"""
  , sts.recipe_json_raw
  , """
    Input Data:   
    """ 
  , b.value),
    'openai/gpt-3.5-turbo', 
    'apiKey', SECRET('kolpak-scope', 'azure-openai-key'), 
    "temperature", CAST(0.0 as DOUBLE)
    ) AS recipe_json_raw
    , b.`_bronze_metadata`
    , current_timestamp() AS _etl_timestamp
FROM bronze_bread_log b
LEFT JOIN silver_bread_log s on md5(b.value) = md5(s.raw_text) -- only want new or changed records
LEFT JOIN sample_target_schema sts ON 1=1 --  only one record in this table, which serves as sample target schema; sample schema is very large, so for code readability, keeping definition stored in a separate table
WHERE s.raw_text IS NULL -- only want new or changed records

The result looks something like this:

{
   "recipe_name":"7/10/23 country sourdough",
   "date":"7/10/23",
   "yield":"1 loaf",
   "ingredients":[
      {
         "name":"bread flour",
         "quantity":"450",
         "unit":"grams",
         "classification":"flour"
      },
      {
         "name":"high extraction wheat flour",
         "quantity":"50",
         "unit":"grams",
         "classification":"flour"
      },
      {
         "name":"water",
         "quantity":"350",
         "unit":"grams",
         "classification":"water"
      },
      {
         "name":"starter",
         "quantity":"100",
         "unit":"grams",
         "classification":"leavening agent"
      },
      {
         "name":"salt",
         "quantity":"10",
         "unit":"grams",
         "classification":"salt"
      },
      {
         "name":"water",
         "quantity":"25",
         "unit":"grams",
         "classification":"water"
      }
   ],
   "turns":[
      {
         "time":"1:20pm",
         "number":1
      },
      {
         "time":"2pm",
         "number":2
      },
      {
         "time":"3:30pm",
         "number":3
      }
   ],
   "dough_temperatures":[
      {
         "time":"12:15pm",
         "temperature":"78.5F"
      }
   ],
   "bulk_fermentation_duration":{
      "start_time":"12:45pm",
      "end_time":"3:45pm"
   },
   "sentiment_analysis":{
      "sentiment_rating":9,
      "comment":"Results: excellent rise, opened up very nicely, beautiful ear - very proud of this one! Crumb was pretty decent - wasn’t super open but looked well proved"
   },
   "areas_for_improvement":null,
   "comments":"Shaping was very sticky, wasn’t great final shape, placed in boule banneton and right in fridge, Baked on 7/11 at 425F on baking stone"
}

This looks much more useful for downstream processing!

Step 2: Building the Medallion Architecture

With the raw bread log now structured as JSON data, thanks to the ai_generate_text function and GPT-3.5, we’re well on our way to unlocking valuable insights and trends in the data. However, to fully harness the potential of this data and make it consumable for downstream reporting & analytics, we need to further structure the data into a robust data model.

While I prefer Python for scripting, I generally prefer to write SQL for data modeling & transformation tasks. Using Databricks SQL Serverless as the underlying compute engine, I wrote a set of SQL queries to parse out various elements from the JSON data, clean up column names and data types, apply business logic (i.e. how do you calculate a Baker’s Percentage for an ingredient?), and ultimately write the data out to a medallion architecture (bronze, silver, and gold sets of tables). With the data structured in this medallion architecture-inspired data model, the data is finally in a format that is easy to understand and ready for analysis.

Reporting & Analytics

DBSQL Dashboard

Using Databricks SQL, I built a quick dashboard that exposes key information about the inputs, process, and results of each loaf. Ultimately, this information will help me identify trends in my process and drive towards more consistent, high-quality loaves. Specifically, as a first iteration for this project, I included information that help me understand:

How many loaves I’m baking over time
The trends in quality (based on sentiment analysis of my notes in the log) of those loaves over time
Duration of bulk fermentation process
Baker’s percentages (hydration %, salt %, etc.)
Mix of flours and other raw ingredients
Cost of raw ingredients

This is just the beginning, and as I bake more loaves and use this dashboard more, I plan to continually refine both the raw data I capture on the bread log and the visuals on the dashboard.

Orchestration

There are various components of the system (.py files for ingestion, SQL queries for transformation, DBSQL dashboard), and to ensure all of these components are refreshed on a regular basis, we’ll need a tool for orchestration. Luckily, I don’t even need to leave Databricks for such a solution. Databricks Workflows provides a fully-managed orchestration service to build reliable, maintainable, and observable data pipelines.

Databricks Workflows provides a rich set of functionality to orchestrate these pipelines. For my bread project, I found these capabilities to be absolutely critical:

Orchestrate multiple types of tasks: As mentioned, the ingestion process is written as .py files, the transformation tasks are SQL queries, and the dashboard is in Databricks SQL. To be able to stitch all of these together with a common orchestration framework was critical.
Provide observability and alerting on job successes and failures: All job runs are tracked through the Job Runs UI, so I have full visibility into job- and task-level successes and failures, execution durations, and other important metrics. I can also choose to be notified (via email, Slack, or custom Webhook to send notifications elsewhere) in the event of a failure. These types of observability capabilities are of paramount importance for any production pipeline.

Job to orchestrate the data transformation queries and refreshes the Databricks SQL Dashboard

Summary

In conclusion, embarking on this data-driven journey with the Databricks has been a game-changer for my bread baking process. Going forward, I’ll be able to have a deeper understanding of the inputs, processing steps, and results of each loaf, and use that information to bake better bread over time.

To summarize the process from a technical perspective, this project leverages the Databricks Lakehouse Platform to:

Ingest unstructured data from Google Docs using the Google Docs Python Client via a Python script
Govern the unstructured files using Unity Catalog Volumes
Incrementally process files from cloud object storage using Auto Loader
Extract meaningful information from .txt files and represent the data as JSON using ai_generate_text and GPT-3.5
Clean the data and store in a medallion architecture-inspired data model using Databricks SQL queries
Expose the data and key insights via a Databricks SQL Dashboard
Orchestrate the entire process using Databricks Workflows

Since building out this process and using the dashboard, I’ve already identified a key few trends that have guided me to bake better bread over the past couple of weeks:

Approx. 3 hours for bulk fermentation yields the best results, in terms of oven spring and crumb structure. The ideal duration will likely change over time, as the seasons change (bulk fermentation tends to happen faster during the summer), but it’s helpful to establish a baseline for this key part of the process.
A final dough temperature (after mixing the dough) of 78–80℉ typically yields the best results for me
If I perform less than 3 turns (or “stretch and folds”) on the dough, then the dough doesn’t develop enough strength, resulting in a loaf with minimal rise / oven spring

I started off baking a lot of loaves like this — flat/little oven spring, and no ear. Still very delicious, don’t get me wrong, but not the ideal loaf.

As I began focusing on specific inputs, like the duration of the bulk fermentation and the final dough temperature, I started seeing better results. Very proud of the ear on this one!

Good oven spring and a moderately open crumb. I still have some work to do to produce a more open crumb, but this is a good starting point.

Typically when I add inclusions (in this one, I added dried herbs & kalamata olives), the dough loses a lot of strength and structure, but with this one, as a result of adequate stretch-and-folds in the first half of the bulk fermentation, the loaf remained strong, resulting in a nice rise and good crumb structure.

As I bake more bread and use this system to track my progress over time, I look forward to driving more insights and baking even better bread!

If you have any questions, feedback, or would just like to chat about data and/or bread, feel free to reach out to me at justin.kolpak@databricks.com!

Special thanks to Mark Hahn and Elliott Herz for helping me write this blog!

Building a “Bakehouse”: Baking Better Bread using the Databricks Lakehouse Platform

Written by Justin Kolpak