Why is Data Annotation so important?
We have moved beyond the time where talking about Artificial Intelligence (AI) was almost science fiction. AI is nowadays pretty much everywhere and real everyday use cases can be found in our phone apps, automobiles, finance products, marketing campaigns, advances in healthcare and in most business decisions. The hype for AI is everywhere but the promise of a perfectly customized and automated AI use case rarely comes alive due to one less sexy word: data!
We are at the point where it is well understood that AI and Machine Learning (ML) systems require large amounts of data to continuously learn from and identify patterns and trends, predict outcomes, categorize and classify data and so on. Tasks that humans can’t simply handle but algorithms do. What is even more critical than the quantity of data is the Quality of Data that we feed into these AI/ML systems and from my own experience one of the major culprits of many low performance, low accuracy ML implementations.
There’s a tendency in the AI world to focus on the more complex engineering tasks while building an AI project end to end, creating the ML models from scratch or re-using existing pre-trained ML models via Transfer Learning and trying to improve accuracy by fine-tuning and tweaking parameters while using available and free public datasets that are not particularly well suited to train the custom use case...There’s not much emphasis on the data related tasks (collection, preparation, clean-up and annotation) specially in small to mid size projects where resources are sparse. Carefully choosing our data annotation strategy (using the right sampling methodology to feed our initial ML model, making sure we have high-quality and customized annotated data or even using more advanced Active Learning techniques) could potentially give bigger wins than a perfectly tuned model, not to mention that we will be able to have a better overall understanding of the behavior of our models and the edge cases that might require more training data.
For many small size organizations, data related challenges have been the major barrier to enter the AI space and for those which have overcome these challenges, the lack of quality data has been the main reason why the projects haven’t delivered on their expectations. This is usually the case when data hasn’t been part of the initial discussion when kicking off an AI project, companies of all sizes need to start designing their data strategies, including gathering data to back up the problem statement, collection, clean up, annotation of training and analysis and need to start thinking about AI as the engine behind which they will be able to build many novel use cases, from internal process automatization to predictions that will help with strategic decisions across departments. As any engine, AI needs the right fuel to run efficiently and this is where Data Annotation should take a big part in any future company strategy around AI.
What is Data Annotation?
We understand data is important and it’s generally agreed that the more we have the better as we will be able to model more real world scenarios and improve accuracy in our ML systems…We briefly discussed why this is so important and why a good data strategy is critical at project level but even better if it’s company wide so more teams have the fuel to develop their own AI use cases without having to build their data pipelines and processes from scratch…but now let’s take a look at what part of the data process is the more expensive in terms of time and resources but at the same time more relevant to a successful implementation.
Data Annotation is the process for which we enrich our data by labeling the content and/or objects of our texts, images, videos and audio with ‘known’ information. These labels will make our data smart and our ML models will be able to fully understand their meaning (i.e. a picture of a cat is a cat, Barcelona is a location in a text or a newspaper article talks about sports). Having annotated data provides the initial setup for training a machine learning model with what it needs to understand and provides examples of ground truth on how to discriminate between inputs with accurate outputs.
There are multiple types of data annotation processes depending on the problem we want to solve and the data at hand for that purpose. The four most common data types are: text, images, audio and video and each one has plenty of particularities, different annotation strategies, tools and skillsets required for the humans-in-the-loop. In all of them, people are involved to a certain degree and are needed to identify and annotate data so that algorithms can learn from the human expertise and then classify, categorize, summarize, translate, transcribe or produce any other type of prediction.
Coming from an international background (I started my engineering career in the localization and i18n fields building tools and hanging out with a bunch of international folks), I’ve always had a thing (or two) for languages, international cultures and traveling and personally there’s nothing that fits in more nicely in the AI world than a perfectly designed NLP use case, with the nuances of language comprehension, culture and domain knowledge…when all these things happen at the same time, it feels like magic.
Contrary to other annotation tasks (mostly image and video) text annotation requires a human-in-the-loop with a specific knowledge (language, domain knowledge/topic expertise, others) and has a broader range of subjective input that makes it more challenging from a quality data point of view:
- Language, linguistics and cultural background: A deep knowledge of the language in which annotation tasks are performed is essential to create labels that reflect real world behavior. Add to the mix the local knowledge required to derivate meaning of a particular word, sentence or document that might be dependent on the local context and can greatly vary on a multitude of factors (location, social perception, cultural background, etc) and you end up with a rather complex mission to obtain a near-perfect annotated dataset that represents the real world (with no bias) with which to train your model.
- Subject matter expertise: it’s not really the same categorizing a legal text than summarizing a healthcare article and the background and expertise of the people doing such tasks should reflect that domain knowledge. As obvious as it seems, not every organization makes this a priority and rather often companies end up with training datasets that are not carefully crafted and annotated by the right people, impacting the overall performance and accuracy of the ML system.
- Human Intelligence: sometimes there are pretty simple binary tasks (positive/negative, yes/no, etc) but rather frequently the text annotation tasks (coreference syntax, questions & answers, document categorization and many others) have a certain degree of “subjectiveness” so two humans annotating the same document might end up with different labels. This is where we need to introduce techniques to obtain the best training dataset possible and mitigate subjective input and annotator bias. From common inter-annotator agreements to more advance error modeling outputs for multiple annotators, it’s a field in constant evolution and now starting to be considered critical to improve our models.
At the end of the day, cutting edge ML technology and the most advanced algorithms cannot solve real world problems without the right data. We understand that having access to large amounts of data is important but having access to learnable and high quality annotated data at a scale is the bigest advantage that companies pushing the boundaries of AI have nowadays.