[Book Notes] Designing Machine Learning Systems, Chapter 3, Data Engineering Fundamentals

Joanna
7 min readDec 12, 2023

My notes from Designing Machine Learning Systems by Chip Huyen

Table of Contents:

  1. Data Sources
  2. Data Formats
  3. Data Models
  4. Data Storage Engines and Processing
  5. Modes of Dataflow
  6. Batch Processing Versus Stream Processing

1. Data Sources

Source 1: User input data

  • User input data can be easily malformatted, requiring more heavy-duty checking and processing.
  • Users also have little patience. In most cases, when we input data, we expect to get results back immediately. Therefore, user input data tends to require fast processing.

Source 2: System-generated data

  • Include various types of logs and system outputs such as model predictions
  • Logs: Record system state and events (e.g., memory usage, services called). Help debugging and system improvement. Not always actively monitored but crucial when issues arise.
  • Logging Practices: Common to log extensively in ML systems due to debugging complexity. Two problems:

⚠️ High log volume can make it challenging to identify relevant information. Solution: Services like Logstash, Datadog, Logz.io use ML models…

--

--

Joanna

Data Product @ TikTok | Adjunct Professor of Data Science | Python, R, ML, DL