Machine Learning: No Dataset Available? Use The Hustle Strategy

Hadi Saghir
4 min readJan 26, 2024

Navigating the ever-changing landscape of machine learning and artificial intelligence, one quickly realizes that data is the lifeblood fueling innovation’s engines. However, what do you do when your startup or organization doesn’t have the luxury of a tailor-made dataset to power your unique machine learning needs? That’s where the “Hustle Strategy” comes to the rescue.

The Data Dilemma

Startup ventures and organizations often face a data dilemma. They grasp the potential of machine learning to revolutionize their operations but lack the essential data required to effectively train and optimize models. While public datasets exist, they tend to be generic and rarely align precisely with specific requirements. This is when the hustle for data kicks into gear.

The Hustle for Data

The “Hustle Strategy” involves the proactive pursuit of data from diverse sources to construct a custom dataset tailored precisely to your machine learning objectives. This approach proves invaluable when your business demands specialized data not readily available in existing repositories.

Synthetic Data Generation Synthetic data, essentially data artificially crafted to emulate real-world information, steps in when collecting data proves challenging or unfeasible. Its applications span model development, algorithm validation, data augmentation, privacy adherence, simulations, and educational purposes.

Creating synthetic data relies on three primary methods:

  1. Statistical Distribution-Based: Reproducing similar data by random sampling, informed by statistical distributions observed in real-world data. This method employs distributions like the normal, chi-square, and exponential.
  2. Model Agents-Based: Building a model that explains observed behaviors and generates data that adheres to the same model.
  3. Deep Learning-Based: Leveraging deep learning models, including Variational Autoencoders, Generative Adversarial Networks, and Diffusion Models, to produce synthetic data.

Open Source Tools The power of open-source tools shines when it comes to synthetic data generation, offering cost savings, flexibility, and collaborative development. Tools such as Gretel, CTGAN, Copulas, DoppelGANger, and Twinify cater to a wide array of industry needs.

Challenges and Considerations Despite its advantages, working with synthetic data presents challenges such as data reliability, mimicking outliers, and the need for expertise. Effective data management and evaluation become paramount, with techniques like Git LFS aiding in version control and metadata documentation ensuring clarity.

Practical Application: Building a Product-Based Search Engine Now, let’s delve into the “Hustle Strategy” through a real-world example — constructing a product-based search engine.

Picture this: You’re in the process of developing a product-focused search engine, akin to those found on e-commerce platforms. Your users not only seek specific products but also expect suggestions for related items, similar options, and frequently purchased combinations. Achieving this level of search sophistication demands access to data like user click history, order records, and product associations. However, at the outset, such data may be out of reach.

The Hustle Begins: API Calls and Open Source Models

  1. API Calls to Websites: an avenue worth exploring is tapping into resources like OpenAI’s chat mode. It stands out as one of the most advanced options available. With the right approach and API engineering prowess, it can potentially serve as a valuable source of data to enrich your machine learning endeavors.

import requests

# Example of making an API call to retrieve product information
response = requests.get(‘https://example.com/api/products')
data = response.json()

2. Open Source Models:

Another valuable resource for hustling data is open source models and tools. Consider using pre-trained models like Falcon, an open source natural language processing model, to analyze product descriptions and extract useful information automatically. Falcon can help you identify product attributes, such as color, brand, and size, from unstructured text.

Step 1: Create a List of Products First, let’s create a list of products for which you want to generate related search terms and extract attributes. Hopefully, the retail store has a 1000 products in some database.

products = [
“Stylish Blue Jacket by XYZ”,
“Men’s Running Shoes”,
“Smartphone with High-Resolution Camera”,
“Classic Red Handbag”,
“Wireless Bluetooth Earbuds”,
]

Step 2: Initialize Falcon Model Now, let’s initialize the Falcon model to use for extracting attributes and generating related search terms. Make sure you have the Falcon library installed or include it in your environment:

from falcon import Falcon

# Initialize the Falcon model
falcon_model = Falcon()

Step 3: Generate Related Search Terms Loop through the list of products, prompt Falcon to generate related search terms, and save them for later use:

related_search_terms = {}

for product in products:
# Prompt Falcon to generate related search terms
prompt = f”Generate 5 related search terms for ‘{product}’”
response = falcon_model.predict(prompt, max_tokens=10) # Adjust max_tokens as needed

# Extract and save related search terms
terms = response.choices[0].text.strip().split(‘\n’)
related_search_terms[product] = terms[:5] # Save the first 5 terms

Step 4: Extract Attributes You can also use Falcon to extract attributes from the product descriptions and save them for later use. Let’s assume you have product descriptions for each item:

product_descriptions = {
“Stylish Blue Jacket by XYZ”: “A stylish blue jacket from XYZ brand with multiple pockets.”,
“Men’s Running Shoes”: “High-performance men’s running shoes with cushioned soles.”,
“Smartphone with High-Resolution Camera”: “A smartphone featuring a high-resolution camera for stunning photos.”,
“Classic Red Handbag”: “A classic red handbag made from genuine leather.”,
“Wireless Bluetooth Earbuds”: “Wireless Bluetooth earbuds for on-the-go music and calls.”,
}

# Initialize a dictionary to store extracted attributes
extracted_attributes = {}

for product, description in product_descriptions.items():
# Extract attributes using Falcon
attributes = falcon_model.extract_attributes(description)
extracted_attributes[product] = attributes

You can save both related_search_terms and extracted_attributes for future use or for training a BERT model as needed.

In this example, the search engine utilizes the data you’ve collected or extracted to provide search results tailored to the user’s query. These results may include product recommendations, similar items, and more, all based on the custom dataset you’ve built.

The Power of Custom Datasets

By hustling for data and putting together your own datasets, you’re giving your machine learning models the tools they need to perform their best, even when you can’t find the right datasets off the shelf. These custom-made datasets let you fine-tune your models for specific tasks, making the user experience better and uncovering valuable insights.

--

--