Building Effective Development and Test Sets in Machine Learning Projects: An Andrew Ng-Inspired Approach Illustrated with Examples

5 min readMar 15, 2024

Imagine you’ve developed a social media platform specifically tailored for football enthusiasts. Users can upload images or videos of football matches they’ve attended, showcasing memorable moments, incredible goals, or their favorite players in action. To enhance user engagement and content discoverability, you implement a feature that automatically identifies and tags football players present in the uploaded media.

This functionality allows users to easily search for specific players or browse through content featuring their favorite athletes. Additionally, it enables the platform to generate personalized recommendations based on users’ interests, such as suggesting similar videos or highlighting trending players. Overall, this feature enhances the user experience, fosters community interaction, and promotes the sharing of captivating football-related content.

In your mobile app, users are uploading pictures of various things. You aim to automatically identify football player pictures. Your team obtains a large training set by downloading pictures of football players (positive examples) and non-football players (negative examples) from different websites. They divide the dataset into 70%/30% for training and test sets. Using this data, they develop a football player detector that performs well on the training and test sets. However, upon deploying this classifier into the mobile app, you discover that the performance is significantly poor. Why did this happen?

You discover that users are uploading lower-quality images taken with mobile phones or cartoons, unlike the high-quality website images used for training. Consequently, your algorithm fails to adapt well to this disparity in image quality, resulting in poor performance when deployed in the mobile app.

To avoid this issue, we need to ensure that we set up development and test sets correctly. Let’s examine the important factors to consider when implementing this.

Select development and test sets that mirror the data you expect to encounter in the future, rather than merely splitting a fixed percentage (70%/30%) of your available data. This ensures that your model is evaluated on relevant samples, especially if your future data differs from your training set, such as the transition from website images to mobile phone images.
Another issue with having different development and test set distributions is the risk of building a model that performs well on the development set but poorly on the test set. It’s crucial to choose development and test sets from the same distribution if possible. For example, when developing a football player detection app, if the development set contains clear, well-lit images while the test set consists of low-light, blurry ones, the model may perform well in development but poorly in real-world scenarios.
Choose a single evaluation metric to optimize your team’s decision-making process. If you have multiple objectives, consider combining them into a single formula or defining both satisficing and optimizing metrics. This approach accelerates decision-making by providing a clear ranking among classifiers, guiding progress effectively, especially when selecting from numerous options.
Machine learning involves trying numerous ideas before finding success. That’s why having development and test sets, along with a metric, is crucial. By measuring your idea’s performance on the development set, you can quickly determine if it’s promising. Without these, evaluating each new classifier in the app manually would be slow. Plus, even small accuracy improvements, like 95.0% to 95.1%, might be missed without a metric. A development set and metric help you identify successful ideas efficiently for refinement and discard ineffective ones.

In developing a football player identifier app, using development and test sets alongside a single evaluation metric accelerates progress. For instance, by testing different image processing techniques on diverse sets of football match photos, the team can quickly evaluate which algorithms work best. This enables them to iterate rapidly, refining the app’s performance with each iteration.
For new apps, set up dev/test sets and metrics within a week. More time may be needed for mature apps.
For problems with ample data, the traditional 70%/30% train/test split may not be optimal. Dev and test sets can be smaller than 30% of the data. Your dev set should be sizable enough to detect significant accuracy changes in your algorithm but need not be excessively large. Similarly, your test set should be sufficiently large to provide a confident estimate of your system’s final performance.

We have discussed factors to consider when setting up development and test sets. Now, let’s delve into when to change these sets and metrics.

For example, in mature deep learning applications like anti-spam systems, teams may spend months refining dev/test sets for optimal performance. If you find that your initial dev/test set or metric doesn’t accurately reflect your product’s needs, it’s crucial to make changes promptly. For instance, if your current setup ranks classifier A higher than classifier B, but your team believes B is superior for your product, consider modifying your dev/test sets or evaluation metric. One common reason for misjudgment could be that the dev/test set’s distribution doesn’t match the actual data needed for success. For instance, if your initial set mainly contains images of famous football players, but users upload more images of lesser-known players, update your dev/test sets to better represent the actual distribution required for accurate evaluation.
You may have overfit to the development set if your algorithm consistently performs better on the dev set than on the test set. In this case, obtaining a fresh dev set is advisable. To track your team’s progress, you can regularly evaluate the system on the test set, but refrain from making decisions based on test set performance to avoid overfitting. This ensures an unbiased estimate of your system’s performance, crucial for research or business decisions.
When the chosen metric, like classification accuracy in a football player detection app, doesn’t prioritize project goals effectively, it’s time for reassessment. For instance, if classifier A is deemed superior due to higher accuracy but occasionally allows inappropriate content, the metric fails to capture the app’s true success criteria. In such cases, switching to a new evaluation metric, such as one penalizing inappropriate content heavily, is essential to align with project objectives. Rather than persisting without a reliable metric, it’s advisable to adopt a new one promptly, ensuring clear direction for the team. Changing dev/test sets or metrics during a project is common for swift iteration. If they no longer guide the team effectively, adapting them is necessary, with transparent communication about the new approach.

In conclusion, these are the effective steps to follow when setting up Development and Test Sets in Machine Learning Projects. They serve as a guide to navigate through challenges.

Building Effective Development and Test Sets in Machine Learning Projects: An Andrew Ng-Inspired Approach Illustrated with Examples

Written by Vaishnavi Tamilvanan