Smart Scheduling: Looker Whitelist Decision System

Denizhan Yılmaz
Trendyol Tech
Published in
6 min readJul 22, 2024

In this article I will mention how we implemented a decision support system to a manual process we conduct for responding business teams’ Looker Schedules Whitelist Requests. First, I want to mention what this Whitelist process is and how much effort we spend every week.

Overview

We work with business teams and individuals to schedule Looker reports for operational analyses and reporting. These reports, exported as CSV, Excel, or image files, are sent via email, Google Sheets, etc., on a recurring basis (daily, weekly, hourly).

Schedules may be set for specific periods, such as during events or for daily order checks over months. Once these periods end, the reports can become redundant and burden the server. Additionally, most reports are requested around 9 am, causing server overload due to limited slot capacity. Thus, we channel report owners to schedule their report for another time, like 9.05 am to prevent overload. To address this we, DWH Analysts, implemented a whitelist procedure. If schedules are not approved by an analyst and whitelisted, they are automatically deleted after one month.

To maintain database performance and prevent unnecessary reports from bogging down Trendyol’s system, on-call analysts review all schedules requests weekly, adhering to specific rules as Looker Explore, filters used in the report and cron-time-based rules etc.. Approximately 200 to 250 scheduling requests are received each week, and analyzing each report takes about 1 to 2 minutes. This review process is a significant, time-consuming responsibility for the on-call analyst.

For this project, our goal was to automate the process using a Machine Learning application, thereby reducing the time and effort involved. Initially, we considered creating a checklist for business owners, allowing them to perform the checks we usually conduct before the whitelist process. However, this approach could lead to unpredictable number schedule additions without a supervised check. Therefore, we opted to develop an application that assists in determining whether a report should be added to the whitelist. This solution is more reliable and sustainable for an operation of this scale.

Whitelist Scheduling Process

Our main aim has been to maximize customer-business satisfaction while minimizing database slot use with minimum human interaction.

Project Steps

  1. Collect Data
  2. Parse and Clean the Data
  3. Implement and Analyze Machine Learning Algorithms
  4. Deploy
  5. Gather Feedback and Improve Results

Collect Data

We collect whitelist requests in a Google Drive sheet with its Schedule Name, ID and Creation Date. Using the ID’s, we can observe the schedule run date and time in cron format, Explore used in Looks, fields and filters applied on the Look etc. details via Looker sources. To collect data in our application, we implemented Looker SDK giving the scheduleId as input and Look details as output:

import looker_sdk
from looker_sdk import models40

sdk = looker_sdk.init40("looker.ini")

### GET QUERY ID FROM LOOK
query = sdk.look(look_id='60971').query



### CREATE NEW QUERY AND ADD A FILTER
new_query = sdk.create_query(body = models40.WriteQuery(model="system__activity",
view = query.view,
fields = query.fields,
filters = {"scheduled_plan.id":str(scheduleId)}))

#### UPDATE LOOK WITH NEW QUERY ID
sdk.update_look(look_id="60971",
body = models40.WriteLookWithQuery(query_id = new_query.id))

#### RUN LOOK
result = sdk.run_look(look_id="60971", result_format="json")

Thanks to Looker SDK, now the collected data is a json file that includes all the necessary fields analysts use to determine whether the schedule needs to be approved or not.

Parse and Clean the Data

The collected data includes report filters in JSON format. We need to verify necessary filters, such as whether the Business Unit Name is filtered. The data source used for the report is crucial, as scheduling hourly reports from a daily source is inefficient. Additionally, most reports have date filters like “Order Date: 2023/01/01 to 2024/01/01” or “last 4 weeks.” A function must accurately parse these periods and use as a criteria to approve the request or not to prevent server lags and timeout errors.

After parsing and collecting filters applied, we filled or deleted null rows with null values to make them suitable to use as input in the machine learning algorithm.

Implement and Analyze Machine Learning Algorithms

The requests are either approved or not, so we need to retrieve binary results from the algorithm. The fields used in the model are categorized as either categorical or numerical variables. For example, Time Period in Days is a numerical variable, while cron_schedule is a categorical variable.

We have around 7,000 requests in our dataset, with a 10% rejection rate, making it an imbalanced dataset. Initially, we applied data balancing techniques such as undersampling and oversampling to Random Forest Classification and Gradient Boosting algorithms to create our model. Better results were received with SMOTE with Gradient Boosting algorithm. SMOTE stands for Synthetic Minority Oversampling Technique. This technique is used to address imbalanced datasets and create synthetic nodes for the minority class in the learning dataset.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import pandas as pd
import category_encoders as ce
from sklearn.model_selection import train_test_split



# Load the parsed training data
df = pd.read_csv('./trains/train_df.csv', index_col=0)

# Ordinal encoding for categorical variables
encoder = ce.OrdinalEncoder(cols=[…])



# Split the data into training and test sets
X = df.drop('Is Added to Whitelist?', axis=1)
y = df['Is Added to Whitelist?']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67, random_state=1)

# Apply ordinal encoding to the training and test sets
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

# Initialize and train the Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=300, random_state=100)
gbc.fit(X_train, y_train)

# Predict the test set results
y_pred = gbc.predict(X_test)

# Evaluate the model
print(f'Model accuracy score: {accuracy_score(y_test, y_pred):.4f}')
print('Feature importances:')
print(pd.Series(gbc.feature_importances_, index=X_train.columns).sort_values(ascending=False))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

Oversampling responded to our dataset well with the initial results below for Random Forest and XGBoost algorithms.

Initial Results for Random Forest and XGBoost Algorithms

These were the initial results. Trying different variations of inputs and sampling methods we now use the Gradient Boosting algorithm to support our decisions at the first base because it resulted in flagging “No” decisions more accurately, which means false positive rates were more promising with XGBoost. With continuous improvement and replies we aim to fully automate this process with minimum human interaction. The model creation is as below:

Deploy

In the first phase we adapted our Google Sheet where we collect requests, we added results from our model and compared algorithm results with our decisions and analyzed edge cases and common model mistakes to enhance our model’s performance.

Conclusion

We are in the Gather Feedback and Improve Results step of the project. This project needs a continuous learning mechanism as standards, states, and decision criteria can change over time. Implementing a decision support system for Looker Schedules Whitelist Requests has streamlined our process. Using the Looker SDK for data collection, filter parsing, and machine learning, we’ve reduced manual effort and improved accuracy.

Gradient Boosting outperformed Random Forest, achieving high accuracy. During deployment, we refined our model and integrated it with Google Sheets for real-time feedback. This system maximizes efficiency and satisfaction while minimizing database load, aiming for full automation with minimal human intervention.

By automating the decision-making process, we’ve significantly reduced the need for manual reviews and checks at the first step, where we use it as a decision support system. Previously, analysts spent considerable time manually verifying and processing requests, but soon the machine learning model will handle these tasks efficiently, ultimately eliminating the need for human intervention. This transition will lead to more consistent and objective decision-making, allowing our team to focus on more strategic activities. As a result, the on-call analyst will spend 3 hours less on these tasks.

About Us

Ready to take your career to the next level? Join our dynamic team and make a difference at Trendyol.

--

--