[Book Notes] Designing Machine Learning Systems, Chapter 3, Data Engineering Fundamentals

7 min readDec 12, 2023

--

My notes from Designing Machine Learning Systems by Chip Huyen

Table of Contents:

Data Sources
Data Formats
Data Models
Data Storage Engines and Processing
Modes of Dataflow
Batch Processing Versus Stream Processing

1. Data Sources

Source 1: User input data

User input data can be easily malformatted, requiring more heavy-duty checking and processing.
Users also have little patience. In most cases, when we input data, we expect to get results back immediately. Therefore, user input data tends to require fast processing.

Source 2: System-generated data

Include various types of logs and system outputs such as model predictions
Logs: Record system state and events (e.g., memory usage, services called). Help debugging and system improvement. Not always actively monitored but crucial when issues arise.
Logging Practices: Common to log extensively in ML systems due to debugging complexity. Two problems:

⚠️ High log volume can make it challenging to identify relevant information. Solution: Services like Logstash, Datadog, Logz.io use ML models…

Joanna

Written by Joanna

Data Product @ TikTok | Adjunct Professor of Data Science | Python, R, ML, DL

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams