Building AI Models at Scale for an OTA

Published in

tiket.com

8 min readJun 13, 2022

Have you ever wondered how intelligent models can cope with the ever-changing and massive influx of transactional data to provide accurate predictions and recommendations in AI systems? The main key here for most Data Scientists and Machine Learning Engineers is to build scalable AI models that are ready for production.

When built from the ground up, every organisation will require certain automation to make everyday business operations efficient. Eventually, Machine Learning/AI would come into the picture to help drive the high productivity of the members so that the organisation can grow bigger in business operations and its mission for a global service expansion. This issue is also faced by tiket.com, where AI provides terrific data-driven opportunities to help in reducing laborious and stressful tasks. With more accurate predictions and smarter recommendations provided as the outputs in the daily tasks, these AI systems prove themselves as great tools for busy professionals to be more efficient and effective in delivering more business impacts in the organisations.

In this article, we will get into the thought process of which modelling aspects we should be worrying about as a Data Scientist if we want to build intelligent models that can be scalable at production from the perspective of an Online Travel Agency (OTA).

Inspired by several existing articles (e.g. [1], [2] and [3]), we pointed out several important highlights according to our experience in an OTA such as tiket.com.

Many aspects are incorporated in building a robust model for production at scale. These include the activities performed intensely by both Data Scientists (DS) and Machine Learning Engineers (MLE) from end to end. At the Proof of Concept (PoC) stage, the DS experiment is often performed using a selected data time frame for EDA — Exploratory Data Analysis [4] purposes. Then, the activities are carried on towards data preparation, model training, testing, and evaluation. As a result, a model is ultimately concluded as the final output of the model selection process in a DS pipeline. However, it does not stop here. Consequently, the model needs to be made readily available as a service to support human decision-making or provide recommendations for the platform users. In tiket.com, many of our AI services provide smart feedback to our users (both internal for staff productivity and external for our valuable customers).

Scalable Data Preparation

Like most of what data scientists have experienced, data preparation could take up to 80% of their time to deliver an impactful AI model for production purposes.

Data scientists, according to interviews and expert estimates, spend 50 percent to 80 percent of their time mired in the mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets. — Steve Lohr, 2014 [5]

Data Quality and Integrity

Besides the data collection, labelling and cleaning that we need to focus on, data integrity and its quality also require critical attention for the model training, testing and evaluation. With this factor in mind, we need to have the following questions for more scalable data preparation:

Is the data really representative enough?
Do we have sufficient numbers of data rows for model training?

While we are focusing on data preparation activity, there is a potential risk of common pitfall we could encounter. The assumption of a correct number of data points (especially for a DS problem that requires data expansion) should hold for model training.

Let us think about the hotel search scenario in tiket.com, where we need to provide the most optimal hotel display for our customers according to their intrinsic preferences based on historical hotel searches and purchases.

Use case of data preparation: For all the features we have prepared for training, we need to perform data insert (features and labels) to the BigQuery cloud database. The next workflow will then take all data in the DB to ingest for model training. Every user query in hotel search for a particular destination will require ideal hotel candidates and labels to be stored in the DB. Hence, our data insert operation (after all data cleaning and preprocessing) will be expanded according to the number of hotels available, including their ideal ranks and labels. Imagine the objective function is set for the top 30 hotels where paid/clicked hotel per user search should be placed on the top in the ordered list. The space requirement to store all these data for training will technically increase tremendously. Therefore, your space complexity can be denoted as O(n.l.m), where n is the number of users’ keyword searches, l is the location/destination for the given user search (n) and m is the number of candidates (hotels) available in the location l. The sessiondate and user_id pairs uniquely identify every user search here.

Due to the limited resource and network capacity, your machine/cloud instance may not be able to cope with the insertion of all records, causing data loading/insert failure. In the worst-case scenario, you may have inserted the records partially without realising at all. A more scalable solution is to split this big task into sizeable chunks of data to load/insert into our DB.

Given the following function to perform progressive insert of user search keywords, including their destination and candidate hotels, we need to insert the data (split by their session date).

def load_chunk(df_chunk, project_name, dataset_name, table_name, first_time):
    # Load client
    client = bigquery.Client(project=project_name)
    
    if first_time: 
        job_config = bigquery.job.LoadJobConfig()
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
    else: 
        job_config = bigquery.job.LoadJobConfig()
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND    # Define table name, in format dataset.table_name
    table = f'{project_name}.{dataset_name}.{table_name}'    # Load data to BQ
    job = client.load_table_from_dataframe(df_chunk, table, job_config=job_config)
    # job.result()
     
def load_all_data(df_all_data): 
    unique_sessiondates = df_all_data['sessiondate'].unique().tolist()
    
    print(f'Unique Session Dates: {len(unique_sessiondates)}')
    
    first_time = True
    batch_counter = 0
    for i in range(len(unique_sessiondates)): 
        df_chunk = df[
            df_all_data['sessiondate'] == unique_sessiondates[i]
        ]
        batch_counter += 1
        
        load_chunk(df_chunk, 'tiket-experiment', 'sandbox_datascientist', 'dsexp_withoutfix', first_time)
        
        if first_time: 
            first_time = False

Notice that the job.result() is commented in the above function. Due to the nature of async nature of BigQuery client execution, fast iteration of every DataFrame’s data loading will cause partial insert when your script completes.

-- Note: 1,018,397 out of 1,188,131 inserted 
-- Records are partially inserted. We did a check again after a day.
SELECT COUNT(1) 
FROM `tiket-experiment.sandbox_datascientist.dsexp_withoutfix`;
-- Note: 1,188,131 out of 1,188,131 inserted
-- All records are fully inserted 
-- due to the wait operation via job.result()
SELECT COUNT(1) 
FROM `tiket-experiment.sandbox_datascientist.dsexp_withfix`;

When job.result() is applied, the BigQuery client will at least wait for each iteration’s insert to finish before continuing to the next iteration. This small experiment was done over small numbers of hotel searches in tiket.com. With the partial insert issue in this small sample, the loss of data points will grow exponentially for the full scale of data coverage. While we put effort into scaling the data loading using chunked data upload, this data integrity issue will have a tremendous impact on model training. In other words, the model that we build may not be representative enough, or even worse, if we could have good performance during the experiment but poor performance while running in production.

Representative Data for Model Training

While the number of historical records accumulates over time, the number of features you produce in your data preparation step may significantly grow. During your PoC stage, the time frame may be restricted for a short period. For example, we prepare the data in October for model training, where its evaluation is conducted on November’s data. This scenario may work for a quick analysis to show opportunities to support business operations.

However, the proper model building and analysis should be done properly over a longer period of time, especially if the model is going to be used for production. For example, in the case of the hotel search scenario in tiket.com, we use all historical data (user search, click and paid hotels) within six months to one year. For this type of activity, the data produced is exponential due to the construction of features and top-k candidate hotels for every keyword search at a particular location.

Let us assume that three months of data for training/validation/testing will retrieve 18 million rows of data (with 600 columns/features) from our DB. This data retrieval process will fail if we do not have sufficient memory capacity to store before we even try to do training, validation or testing. The main questions will then be:

“How do we scale the retrieval of data so that we can proceed with the model training? Can we do random sampling? Would this be representative enough for model training?”

While random sampling could be one possible solution to this problem, sampling according to data distribution is preferable for proper model training. Nevertheless, the sampling techniques still persist as a complex ongoing problem in computer science research.

Scalable Modelling

In order to scale the modelling operations, there are two dimensions that we need to take into consideration:

Scaling our modelling process due to the addition or removal of data sources or adding/removing features. This is often associated with the continuation of your data science projects in the organisation, while business objectives and requirements are ever-changing over time.
Scaling our modelling process for the purpose of earlier convergence and dealing with a large amount of training data. A large amount of data comes with a bigger responsibility to train the model in a timely manner. One consideration that data scientists typically apply is to look at parallelisation in model training. Some algorithms are designed to be easily distributed to multiple nodes for scalable modelling. Even parallelisation can be applied for cross-validation, hyperparameter tuning and model selection.

Scalable Production and Model Maintenance

Producing the best model based on a comprehensive evaluation process is not the final chapter for data scientists. In fact, machine learning engineers will work closely with the data scientists to bring the full potential of the AI models so that they can be used for production at scale. This is the most crucial stage for most data scientists and machine learning engineers to design an architecture that can scale for real-world production scenarios. Hence, there are several factors (described as questions below) that we need to consider when bringing the AI models to live as a service:

Does the speed of prediction matter? Is the model contributing to high latency as a part of the global tech workflow?
Do we need to precompute the features, prediction results or recommendations?
How does the AI model scale for many domains? A typical example in OTA is “would the pipeline, architecture and model be suitable to cover multiple countries since we only trained the model for Indonesia?”
How do we handle potential model performance degradation? Can we retrain the model easily based on the most recent historical data? How frequent should we retrain the AI model?
How do we handle concept drift? In the travel industry, data change is prevalent, especially with the human behaviour changes as we are transitioning from pandemic to endemic era for these covid situations. One way to help identify such concept drift is to have an integrated tool for data and model monitoring in production.
Should we consider different experimentations concurrently in production? This situation could be typically handled by either configuring an A/B testing experiment or shadow models deployment.

Nevertheless, many other aspects may contribute to our design consideration of scalable AI, especially for OTA. This article aims to bring the clear awareness that making a model simply within the scope of experimentation is never enough; building AI models at scale is more important to bring valuable impacts to organisations. Hence, scaling AI models is, in fact, a non-trivial task for Data Scientists and Machine Learning Engineers.

REFERENCES:

Building AI Models at Scale for an OTA

Written by Jonathan Liono