How to build a comprehensive AI/ML system

A high-level guide to building a comprehensive AI/ML system based on our own experiences

Coupang Engineering
Coupang Engineering Blog
18 min readOct 12, 2018

--

By Hong Chen

This post is also available in Korean.

AI/ML systems are built differently in industrial and academic settings. In the industry, models are developed using a heuristic knowledge-based approach that utilizes large amounts of data. On the other hand, AI/ML research in academia takes a mathematical model-based approach in model configuration, relying heavily on statistics and probability.

However, in practice both the knowledge-based and model-based approaches should be used in coordination to build a comprehensive AI/ML system. Both approaches possess shortcomings that can be supplemented by the other. When both approaches are used in just the appropriate amount, the synergy between the two can lead to an overall system performance improvement.

Definitions

First, let’s look at the difference between knowledge and model-based approaches.

Wikipedia defines knowledge as “a familiarity, awareness, or understanding of someone or something… acquired in many different ways and from many sources” When we limit the definition of knowledge to the specific context of AI, it can be thought of as ‘something described by language’, or a set of labels. For example, a labeled set of brand names or a list of logo names can be construed as knowledge.

A model-based approach to AI/ML refers to functional mapping. It is mapping of a given dataset to its appropriate labels, or knowledge. All parametric or non-parametric algorithms map input values to output values through statistical approximation and optimization techniques.

Limitations of a knowledge-based approach

Knowledge-based systems rely on human knowledge. However, it is not easy to develop a completely knowledge-based model for three main reasons.

First, it is impossible for humans to thoroughly digest the vast amount of knowledge in any single field. For this reason, a system that depends on human knowledge is not scalable.

Second, knowledge is time sensitive. Knowledge can quickly become futile, especially in the fast-paced environment of technology. For example, although the iPhone today is widely recognized, it may not be in the next 100 years or so. The human knowledge base is something that organically evolves, which is tricky to computationally model.

Finally, knowledge is context dependent. Knowledge only gains meaning through context. Apple refers to the company in tech and business contexts, but an apple in food-related contexts refers to the fruit. It is extremely difficult to model such diverse contexts using a knowledge-based approach, while a statistical model can easily identify such contextual correlations using data distributions.

Limitations of a model-based approach

That is not to say the model-based approach is without its own limitations.

First, statistical models approximate the most generalized fit of the entire dataset. In this process, a small number of outliers can end up with inaccurate output results and most of the data points will not fall exactly on the model’s final output. A good curve only closely predicts the distribution of the largest number of the data points. If a model output accurately passes through every single data point, the model is overfitting and will not be generalizable to other data.

Second, human decision-making entails a process of logical reasoning. Human-like logic is difficult to mimic in a statistical or deep learning model, while it is relatively easy to realize in a knowledge-based model.

Figure 1. An example of a function that is a good fit versus one that is overfitted to the data (source)

Case study

Let’s look at examples where the knowledge-based and model-based approaches joined forces in a successful model.

The Google Search engine is one successful case that does not only learns search rankings based on algorithms and data but also realizes knowledge-based logic. For instance, search “嫩滑”, a term that means soft and glossy in Chinese, on different search engines. On certain search engines, you will get adult content, but on Google you will get innocuous images of tofu. Other search engines use purely data-driven models and reflect that most searchers are looking for adult content. Google, on the other hand, applies knowledge irrelevant to the model-approach to show results that are more ‘appropriate’.

Another example is autonomous driving. On the surface, autonomous driving seems to be a composed of mapping algorithms that learns to map input signals from sensors to output signals, such as steering, hitting the brake, and accelerating. However, according to the autonomous vehicle system architecture released during a Udacity lecture, models must also be trained to use contextual knowledge such as lane and traffic signal detection. This type of knowledge-based approach training generates perception.

Figure 2. The system architecture of an autonomous vehicle (source)

Objectives

When building an AI/ML system, there are several ways to use both knowledge and model-based approaches. However, they are difficult to integrate harmoniously. Keep the below objectives in mind to develop an optimal model.

Open-domain generalization

The system should generalize to data outside of the training data. For example, if you have a model that extracts brand names given a block of text, it should be able to detect the names of brands that are not included in the training dataset.

Self-learning

A self-learning model should be a dynamic system, equipped with the ability to accumulate new training data for continuous learning. Furthermore, a self-learning model should be developed to periodically evaluate itself to initiate the re-learning process when necessary.

Adaptable

Although humans can intuitively deduce the needed output format when given a task, machines cannot. Instead, models should be built so that their outputs are adaptable to a diverse range of inputs given certain parameters.

Building an AI system

In this section, we will discuss how to architect a model that efficiently utilizes both the knowledge and model-based approaches.

Modules

To build such an AI system, the following modules are required:

  • Data collection module. Current machine learning techniques depend on large amounts of data. Without data, there is no knowledge and no model.
  • Data labelling module. Initially, data must be labeled manually. As knowledge and the database evolves, data labelling must be partially automated.
  • Model training and evaluation module. Models are trained and evaluated based on the available data.

Operations

The three modules above must work in coordination to continually improve the system. Here we will discuss four functionalities that must be prioritized to develop a successful AI system.

First, the model must fully utilize all available knowledge. For example, a visual recognition model is trained to distinguish whether the products in two separate images are the same. To advance model complexity and improve performance, the model can also be trained to recognize properties in the image, such as shoes or monitors. Using this extra information, the model can better determine whether the two products are the same or not.

Second, the model must have the ability to discover new knowledge in an open-domain environment, because the model is likely to be exposed to unseen data during production. Let’s take an example of a model that is trained to output a brand given a product name. To have this open-domain adaptability, the model must be built in the following ways:

  • Use properties in classification problems. Rather than simply mapping the product and brand name, model performance can be improved by deriving the token embeddings of the product and brand names. These embeddings carry information about the properties of each token that can be used in classification. By using token embeddings, the product name is not a single class but rather becomes a collection of properties.
  • Convert a closed-domain problem to an open-domain problem. The closed-domain approach is to train a classifier on a set of product names labeled with their corresponding brand names. However, the open-domain solution is to train a model to derive meaningful token embeddings of the product and brand names.
  • Utilize one-shot learning. One-shot learning is a method of training a model on a randomly selected a small dataset derived from a larger dataset. The purpose is to attain the same level of performance whether trained on a small or large dataset.

Third, an accurate evaluation metric must monitor model performance. If the model is not fitting to the data as expected, the evaluation metric must be closely followed during training iterations to design improvement methods. In addition, a mechanism to measure the validity and efficiency of training data is useful during training.

Lastly, knowledge must be applied to the model in two different methods. The first method is to use knowledge as labelled training data. In this case, the knowledge is used to train the model to output the desired results. The second method is to use human logic as knowledge to create an ensemble of multiple models to ultimately build an end-to-end system. Below is an example of a deep learning end-to-end model that removes duplicate products given a list of products. The separate modules can be trained separately and then later assembled.

Figure 3. An example of an end-to-end AI model that remove duplicate products

Conclusion

In this post, we discussed a high-level guide to building a comprehensive AI/ML system based on our own experiences. The most important takeaway should be that both the knowledge-based and model-based approaches must be used to secure the best model.

--

--

Coupang Engineering
Coupang Engineering Blog

We write about how our engineers build Coupang’s e-commerce, food delivery, streaming services and beyond.