Insights From Telkomsel’s Data Engineering Team — Framework, Key Principles and Tools

Published in

Life at Telkomsel

7 min readAug 3, 2021

For the past year, I have been part of the new Data Engineering team in Telkomsel Data Management Platform (DMP) project, in which I have been really enjoying the journey. We are a small team, each of us came from diverse industry verticals. The goal of our team is to build a data engineering pipeline to support Telkomsel’s new business.

We recently completed the development phase one of the projects, and the production pipeline has been put into production. In this post, I will share with you this journey, how every member of our team strived to build a solid data engineering pipeline. For me personally, it is such a unique experience. I will start by explaining the agile framework which we use, then I will go deeper with key guiding principles that build our effective, scalable data pipeline. In the end, I will share the tools and technologies that we use in Telkomsel Data Engineering.

Agile Framework

Agility is about being flexible and adapting your plan based on feedback from incremental deliverables. Agile is a way of working which started gaining popularity in the early 2000, is now widely used to manage software development projects. It is suitable for fast-paced development cycles and has provision for specifications change during the design and build process. It is flexible and strives for iterative incremental improvement in the product through team collaboration. In short, Agile is to plan, build, test, learn, repeat. At Telkomsel Data Engineering, we use this framework.

What aspects of the agile framework work well with data engineering?

1. Planning and prioritization

Having regular planning and prioritization meetings provide business users a better understanding of the time associated with each data engineering effort.

2. Data Engineering Ideation

Every time our team is doing development, they might find a new way to tweak the system or generate new ideas. We gather this as our backlog and put it in the next sprint.

3. Retrospectives

At each retrospective, the team reflects on the past month’s activities. There are many ways how this can be done, but here is an approach I’ve found to work. Everyone fills up the whiteboard with points on what they found: start doing, stop doing, and keep doing.

4. Active PR Review

All team members will become reviewers, not only some senior members. That’s why every PR must contain a standard description to ensure any member has enough information to become a reviewer.

5. Monthly sharing session

To encourage everyone to speak up, we conduct monthly sharing sessions in which team members share experience in solving technical problems.

6. Cluster booking system

We have three main queues: development, stages, and production. Sometimes few development or ad hoc jobs need a huge resource and cannot be accommodated by the development queue. We have some kind of system for booking stages and production queues to run those big jobs.

Routines that bring improvements

- Active decision log

Every initiative development needs to log their approach. Maybe there will be input from other personnel to make the process more efficient.

- Book Day Club

We have Percipio as our learning platform. We pick a topic, then pick a course. Then we will split the chapter and start explaining the content of that chapter.

- Telco knowledge sharing session

Not all data engineering team members, especially the new hires, come from the telecommunication industry. so it is better to have a session to share knowledge from any senior guy. Knowing the telco domain can help direct the data exploration and greatly speed (and enhance) the feature engineering process. Once features are generated, knowing what relationships between variables are plausible help for basic sanity checks.

- Better retrospective

The mad sad glad retrospective is a great way to boost morale, uncover barriers to productivity and get our team working better together.

Data Pipeline: Key Concept, Architecture, and Tools

Telkomsel operates one of the largest on-premises Big Data installations in the South East Asia region. Our Data Engineering pipeline is built on top of this Big Data platform to process hundreds of terabytes of data each day. To handle the volume, we designed the pipeline to have four different layers that look like the following:

key principle : four layer data pipeline

1. Aggregation Layer

This reads the ingested data sets from disparate sources in datalake via data connectors. Here, aggregation processes are also performed such as data typing, basic cleansing, and standardization.

Sub-Selecting Data:

Our datalakes store a lot of data, including years of historical data. As the predictive models that we build mostly are concerned with the latest few months of data only, we need to sub-select data. We select the required columns. Removing unnecessary data from the pipeline results in the performance improvement of the pipeline.

Weekly Aggregation Data:

As the data science models that we build normally work on rolling windows of weeks (1 week, 2 weeks, etc.), we do not actually need to store the data at daily transaction level. By aggregating the data into weekly levels, the size gets reduced a lot resulting in performance improvement of the pipeline.

2. Primary Layer

Primary Layer performs several tasks: filtering partner data, scaffolding, and filling time series. The purpose of the Primary Layer is to bring all the data into a common, clean format.

Filter Partner Data:

This was a specific set of MSISDNs (mobile phone numbers) from Weekly Aggregated Data. As Weekly Aggregation is done on a complete database meaning all the TSEL customers data is present in Weekly Aggregation Data. For training purposes, as we do not need all the customer base data but a specific set of MSISDNs are used only, we don’t need to process the whole customer base for the further steps. This improves the training pipeline performance significantly.

Scaffolding:

Scaffolding is a combination of MSISDNs & all the weeks of the lifetime for that MSISDNs. Suppose an MSISDN “X“ is activated on June 26, 2019, and gets deactivated on Sep 18, 2019, the lifetime of the MSISDN is June 26, 2019, to Sep 18, 2019, and all the weeks between June 26 & Sep 18 will be in the scaffold.

A scaffold will look like this:

As we are only selecting data from 2019 onwards in the pipeline, so the scaffolding will also contain weeks from Jan 2019 onwards.

What’s the use of Scaffolding?

Scaffolding is used to fill the time series in the data.

Time Series Filling:

Time series filling is the process in which we fill the missing weeks in the weekly aggregated data. As the TSEL data is daily transaction-level data, sometimes we do not have any data for some weeks for an MSISDN. For example, an MSISDN “X“ has done some recharge on July 02, 2019, and then the next recharge was done on Aug 14, 2019 in that case weekly aggregated data of recharge will not have any record for July 15 to Aug 12 weekstarts. Adding these missing weeks in the data is the time series filling

Example:

Recharge Data:

Weekly Aggregated Recharge Data:

Recharge Time Series Filled Weekly Aggregated Data:

Why time series filling is required?

Several of our Metrics are rolling windows of 2 weeks, 3 weeks and so on. In case we do not fill missing time series we will end up losing data for these metrics.

As in the above case for Week start 2019–07–15:

· Feature Recharge rolling window 2 weeks Weekstart July 15 will have a value 170,000

· Feature Recharge 1 week it will have 0 value.

If we do not fill the time series we will lose the Weekstart of July 15 entirely, as we do not have any record in the base table for MSISDN “X“ in July 15 Weekstart.

3. Feature Layer

Here, we engineer the features that represent the underlying business problem. This assists in verifying the initial hypotheses. In this layer, consideration goes to reusability, discoverability, backfilling, and precomputation of features.

4. Master Layer

Create a unified view of all the features of the use case at a granular level of the unit of analysis. Thereafter, you can apply feature selection, encoding, and imputation to prepare the final model input layer, which contains the actual features to be used by the ML model

Data Pipeline Tools

To build such a rich data infrastructure, we are a mix of different programming languages, data management tools, data warehouses, and whole sets of other tools for data processing, data analytics, and AI/ML. Following are tools that we in Telkomsel use for building effective, efficient data infrastructure.

- JIRA, for project tracking

- Confluence, for documentation

- Cloudera, for on-premise Hadoop cluster ecosystem

- Ab Initio, for comprehensive ETL tool and orchestrator

- Apache Nifi, to automate the flow of data between systems

- GIT, for version control system

- Kubernetes, which is a container orchestration system for Docker containers

Conclusion

This agile methodology, data pipeline, and tools have been born out of the Data Management Platform (DMP) project. Now I am working on several other projects, and might find better options that replace existing ones. I am sure there are lots of other principles and tools, so please do let me know of any further approaches you have found effective in managing data pipelines.

Insights From Telkomsel’s Data Engineering Team — Framework, Key Principles and Tools

Written by Eri