Collect Data Properly for AI Machine Learning

Brian Ka Chan
Taming Artificial Intelligence
3 min readJan 22, 2019

Practicing Mindful Data Collection in successful Artificial intelligence

How to Collect Data Properly for Artificial Intelligence Machine Learning success

*syndication content from Applied AI blog Mind Data

In the current advancement of Artificial Intelligence technologies, machine learning has always been associated with AI, and in many cases, Machine Learning is considered equivalent of Artifical Intelligence. Machine learning is actually a subset of Artificial Intelligence, this discipline of machine learning relies on data to perform AI training, supervised or unsupervised.

Supervised machine learning is the training of machine using a sample of labeled class data to train the machine what is right vs what is wrong. So after thousands to a millions sample of data, the machine gets to understand and observe patterns.

On the other hand, unsupervised learning is to let the machine to learn on its own by trying to identify a pattern based on the provided data. The machine isn’t told what data is useful vs what is not useful, nor which data is correct.

In both of the above cases, the most important factor is not the learning process, but the quality of data. In my experience in data sciences as well as applied intelligence projects, the most time-consuming part is not waiting for the machine to learn but preparing the data required to train a machine.

On average, 80% of the time that my team spent in AI or Data Sciences projects is about preparing data. Preparing data includes, but not limited to:

  1. Identify Data required
  2. Identify the availability of data, and location of them
  3. Profiling the data
  4. Source the data
  5. Integrating the data
  6. Cleanse the data
  7. prepare the data for learning

Even I only have 7 steps, these 7 steps will determine whether your machine learning project a success or another common failure.

How to avoid a Machine Learning failure by Mindful AI Data Collection

To avoid spending too much time preparing the data, and end up with the possibilities of not bringing values to your machine learning AI project, my best suggestion would be to practice “Mindful Data Collection”. No pun for the name Mind Data by the way.

What is “Mindful Data Collection”?

Mindful data collection is the practice of considering the uses of data before you even create them in your environment. In a typical case, when you create data in your ecosystem, you think for one reason only: transactional. The data are created because we need to perform a transaction in our system, we define the data the way we want it to be. Mindful Data collection takes it a step further. A mindful data collector would consider if such data points are already existing within the organization.

If the data point has already existed within the organization, the better choice is to leverage the metadata and format rather than reinventing the wheel. To practice mindful data, there are a few places you can review:

  1. existing data dictionary
  2. your data governance organization
  3. owners of major processes within the organization
  4. public data standards

Mindful AI Data Collection includes Mindful Data Quality

Mindful data collection in Artificial intelligence/machine learning is not only about how data are collected, the quality of the data is also important. There are many factors in Data Quality

  1. data quality requirements
  2. data rules
  3. data policies

Many of my clients used to consider data quality as a management opportunity, or a fix that is required to manage data as a strategic asset. They are not wrong, but data collection and data quality are more than that in today’s world.

The opportunity and values of managing data collection as well as data quality up front are big. Remember I talked about 80% of the time during a machine learning or data science projects are spent in data cleansing and management? Imagine you can save or shorten the time by 50%?

Imagine the dollar amount of valuable time from skillful (and expensive) data sciences and machine learning engineer you can save by being a bit more mindful?

Imagine the flexibilities and agility your team can have if you reduce that much time in all the hypothesis?

If you think there is an opportunity in your organization to take advantage of Mindful Data collection, explore it and apply it.

Let me know your thoughts, or if you have any questions. This is Mind Data Intelligence.

--

--

Brian Ka Chan
Taming Artificial Intelligence

Technology Strategist, AI Researcher, Human Rights Advocate, High-Impact Philanthropist