Member-only story

Distributed Parallel Computing Made Easy with Ray

Illustrated with an example of Multimodal offline batch inference with CLIP

Betty LD
Towards Data Science
21 min readJan 6, 2025

--

(Intro) Data Quality over Data Quantity

This post is a technical post summarizing my experience with the Ray library for distributed data processing and showcasing an example of using Ray for scalable offline batch inference.

Recently I had to process a lot of data in a short amount of time. The quality of the data is critical for the downstream application.

For example, if you want to train an LLM, the success of the training depends on the quality of its training data.

You may ask, isn’t data quantity the secret to making a strong model?

Tons of data. Thanks to https://unsplash.com/@jjying for the picture.

It is not. First, Let me share why engineering effort should be given to constructing a good dataset.

The dataset represents the key to success in achieving a good performance as it represents the upper bound of the knowledge (besides emergent behavior). Also, if the dataset contains duplicates or errors, it will have a negative impact on the model (see this interesting blog post about DALL-E’s dataset or this article from Anthropic).

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Betty LD
Betty LD

Written by Betty LD

AI and Geospatial Scientist and Engineer.

Responses (3)