Creating Datasets for Machine Learning. My Own Lessons and Experience.
Today I wanted to be more on the technical side in the core of Machine learning and data science which is Data.
Without data, Machine learning, data science, and all these fields wouldn’t exist, in fact, these fields are popular now because we have huge amounts of data being generated daily, a fun fact is for some use cases there’s still no enough data.
So what is a dataset?
A data set is a collection of data, which can be in the form of a table consisting of columns and rows where columns are features and rows are observations, Image datasets can be in the form of file structures stored on disks, they can be stored in dedicated databases, Text datasets can have multiple forms CSV, excel, text files or also in databases.
The need for datasets
Without data, there’s no machine learning nor analytics. We need data to gain business insights, understand our problem, make hypothesis, guess what model architectures to use, and train our ML system.
Diving into our subject, as a data scientist why do I need to learn how to create my own dataset?
I’m speaking here from the point of a pure personal experience and I’m not a working professional, yet.
1- Simply the problem you’re solving or the business may not have an available ready to use dataset that you’d just download and get to work.
When working on a project and before getting excited about the idea make sure that you have access to training data or at least know how will you create a data set. For me sometimes I just LOVE to get excited about projects where I’ll create my own data set.
This faced me during working on my graduation project, the problem we’re tackling is specific to Delta Egypt, there were no ready to use data sets, so we went ahead to read research papers on how to create a dataset with the specifications required in an automated manner ;)
2- It’s a very useful skill for every data scientist or ML engineer and not limited to data engineers.
Learning these skills will boost your skills and resume and will also help you with personal projects or if you’re working with a small team.
3- It’s fun -at least I think so-
I had much fun creating a data set for images and label it on my own, a music features dataset, and a music genres dataset. I learned each time and gained domain knowledge every time as well as learning to use new tools and dealing with different APIs.
Formulating the problem
As I said, before getting too excited about the project’s flashy idea, pose these questions. These 3 questions are everything and are EQUALLY IMPORTANT once answered everything else is easy.
1- What is the solution I’m building?
Am I building a chatbot, a web app, a computer vision app, or what?
2- Where can I get training data?
Is it available open-source, will I pay for it, is it stored in a database and need preprocessing, will I collect it myself, will I collect and label it? Know everything.
3- How much domain knowledge I have?
Building a medical diagnosis system? What do I know about medical diagnosis or will I contact a domain expert “in this case a doctor”, Building a sign language translation system? Do I even know sign language? and so on.
Invest in this stage as much as you can so that you don’t lose time and resources. Try to document as many insights as you can to know exactly what you’re searching for.
My advice for you is
1- If you’re new to ML/data science, use data sets from Kaggle or any other resource to learn more about data distributions and hierarchies and so on, don’t just hop into this as it’s kinda advanced and needs more experience and sense of the problem, the solution, the available products and reading research papers or using other tools like Apache stuff, AWS, SQL.
2- If your problem can be solved without this hazard just download the data set and don’t waste your time.
3- Ask an expert when you get stuck, actually this is the most important thing to do, experience cuts down more than 90% of the time you could spend just searching and being lost -also a personal experience-
4- Don’t be grumpy about it, ML is not about building models with a few lines of Keras, it’s about learning from DATA so if your team decided to go with that option and you’re against it either don’t complain about how hard and time consuming it is or just do something else.
5- SHARE what you learned and how did you create it, oh my god this would’ve saved me months of searching, the research community is not that kind tbh I read nearly +30 research papers just to know how they created a change detection dataset and no one bothered to illustrate or share the code, until I encountered someone who published a paper for his own dataset and guess what he wrote only 4 lines on how he did it, luckily I understood those 4 lines.
6- Finally have fun and focus on the end goal which is expanding your skillset. It’s all about growing and creating cool apps that solve human problems, isn’t it?
Thank you, please feel free to share your thoughts with me and to request any topics fill in this form ❤