[Book Notes] Designing Machine Learning Systems, Chapter 3, Data Engineering Fundamentals
My notes from Designing Machine Learning Systems by Chip Huyen
Table of Contents:
- Data Sources
- Data Formats
- Data Models
- Data Storage Engines and Processing
- Modes of Dataflow
- Batch Processing Versus Stream Processing
1. Data Sources
Source 1: User input data
- User input data can be easily malformatted, requiring more heavy-duty checking and processing.
- Users also have little patience. In most cases, when we input data, we expect to get results back immediately. Therefore, user input data tends to require fast processing.
Source 2: System-generated data
- Include various types of logs and system outputs such as model predictions
- Logs: Record system state and events (e.g., memory usage, services called). Help debugging and system improvement. Not always actively monitored but crucial when issues arise.
- Logging Practices: Common to log extensively in ML systems due to debugging complexity. Two problems:
⚠️ High log volume can make it challenging to identify relevant information. Solution: Services like Logstash, Datadog, Logz.io use ML models…