Member-only story
Distributed Parallel Computing Made Easy with Ray
Illustrated with an example of Multimodal offline batch inference with CLIP
(Intro) Data Quality over Data Quantity
This post is a technical post summarizing my experience with the Ray library for distributed data processing and showcasing an example of using Ray for scalable offline batch inference.
Recently I had to process a lot of data in a short amount of time. The quality of the data is critical for the downstream application.
For example, if you want to train an LLM, the success of the training depends on the quality of its training data.
You may ask, isn’t data quantity the secret to making a strong model?
It is not. First, Let me share why engineering effort should be given to constructing a good dataset.
The dataset represents the key to success in achieving a good performance as it represents the upper bound of the knowledge (besides emergent behavior). Also, if the dataset contains duplicates or errors, it will have a negative impact on the model (see this interesting blog post about DALL-E’s dataset or this article from Anthropic).