An Innovative way of dataset creation for Deep Learning

Swagat Parida
Data Science Innovations
5 min readSep 10, 2020

Motivation

Unleashing the Power of Deep Learning: The Key Lies in Data

Motivation drives us to conquer new frontiers, and in the realm of deep learning, simplicity lies in the algorithms themselves. However, the true challenge lies in the art of collecting and curating a robust and relevant dataset.

Let’s be clear: the more relevant and accurate your data, the more likely your model will deliver remarkable accuracy. Deep learning thrives on the foundation of high-quality data.

Collecting and preparing a good dataset can be a complex task. It requires diligent efforts to ensure the data represents the diverse scenarios and variations present in the real world. This involves ensuring data accuracy, addressing biases, handling missing values, and maintaining overall data quality.

Investing in acquiring comprehensive and well-prepared data is paramount. It forms the bedrock of your deep learning model’s success. By embracing the challenge of data collection and preparation, you unlock the true potential of your model and pave the way for transformative results.

Stay motivated, remain focused on acquiring relevant and accurate data, and witness the incredible capabilities of deep learning unfold before your eyes.

Hence, just to remember to have following formula in mind,

Embracing Innovation in Data Collection for Deep Learning

As deep learning enthusiasts, we often encounter diverse challenges when collecting data for different problems. It’s no secret that data collection can be a time-consuming process, often taking longer than anticipated.

However, let’s challenge ourselves to shift our perspective on how we approach data collection. Can we adopt a more innovative mindset?

By embracing innovation, we can explore alternative methods and strategies to expedite and optimize the data collection process. Here are a few ideas to consider:

  1. Leveraging existing datasets: Look for publicly available datasets or repositories that align with your problem domain. This can save time and provide a valuable starting point for your deep learning projects.
  2. Data augmentation techniques: Explore methods to generate synthetic data or expand your existing dataset through techniques like image transformations, text augmentation, or data synthesis. This can help diversify your data and increase its relevance.
  3. Crowdsourcing and collaboration: Engage with communities, researchers, or domain experts who may be willing to contribute or share relevant data. Collaborative efforts can accelerate data collection and foster a spirit of collective innovation.
  4. Transfer learning and pre-trained models: Consider leveraging pre-trained models or transfer learning techniques, which allow you to utilize existing models trained on large datasets and adapt them to your specific problem. This can significantly reduce the need for extensive data collection.

Remember, innovation often stems from challenging conventional approaches. By thinking creatively and exploring new avenues for data collection, we can overcome challenges more efficiently and drive breakthroughs in the field of deep learning. Let’s embrace the power of innovation and unlock new possibilities in our data-driven endeavors.

The Thought Processes

Unlocking Infinite Possibilities: Harnessing Real-World Image Data for Complex Datasets

Allow me to illustrate the process of creating a dataset of images with an example. When approaching any problem, I often immerse myself in the real world, seeking alignment with its intricacies before attempting to provide solutions. By shifting our perspective, we can uncover countless avenues to generate complex datasets that effectively address intricate problems.

Consider exploring the real world from a fresh perspective, focusing on capturing image data that has already been processed by human brains. For instance, let’s take the example of a hashtag in a social media platform. Within this vast realm, an abundance of real-world image data awaits us. This provides an opportunity to collect an infinite variety of images that capture diverse contexts and subjects.

Moreover, by performing diligent work, we can go beyond merely collecting raw images. We can strive to obtain better annotated data, enriching the dataset with additional information that adds valuable context. Additionally, employing data augmentation techniques allows us to create augmented data, expanding the dataset’s diversity and enhancing its suitability for solving complex problems.

By leveraging the power of imagination and embracing the abundance of real-world image data, we unlock the potential to create comprehensive and dynamic datasets. These datasets, coupled with meticulous annotation and augmentation, equip us with the tools to tackle intricate challenges and drive innovation in the realm of deep learning. Let’s explore the infinite possibilities that lie within the real world and revolutionize the way we approach complex problems.

Looking for Realtime Data
Looking for Realtime data

Let us take instagram as an example;

Example 1: #tiger

#tiger

What do we see in the above search result?

  1. 9+ million image data
  2. Different types of image data
  3. With augmentations
  4. With annotations
  5. Negative data

Now, just imagine the power of this dataset. Now let us look at another example

Example 2: #elephant

Example 3: #beer

Now, after seeing above examples, we all as data scientists would have realised the thought process in achieving something innovative. Let’s continue on our innovation mindset.

What happens next?

  • Extraction of the images from these #hashtag(s)
  • Find false positives and negative images, through data cleanup
  • Data Annotations (either using your own tool or external tools)
  • Training the model (as per your requirement)
Example of data extraction from #tiger

Let us also look at some special accounts/hashtags/public portals, where the data are already classified with great annotations. For example,

These data have greater potentials in making the predictions much accurate and realistic.

Assumptions:

  • The image data are public and available to extract.
  • Data to be extracted carefully to avoid misusing of any personal information.
  • Make sure to give credits to the respective and deserving users/portals/sites/organisations etc. :)
  • Usage of different API filters to extract related image data. e.g. free images/no license images

What’s in next article?

To share some techniques and thoughts to utilize these extractions for better predictions.

Stay tuned and cheers!

--

--

Swagat Parida
Data Science Innovations

Innovative Engineering Leader | Mobile Technologies Expert | AI, ML, AR, VR Enthusiast | Holder of 14+ IPs (6+ Patents, 8+ Defensive Publications)