Top E-Commerce Data Science Skills

Published in

Tokopedia Data

8 min readFeb 4, 2020

Top 5 skills for an E-Commerce Data Scientist: a guide for aspiring Data Scientists

E-commerce has revolutionized how we buy and sell around the world. Take Tokopedia, for example. Founded in 2009 with the mission to democratize commerce through technology, it has become the leading e-commerce company in Indonesia, a market for over 260 Million people!

With more than 90 million monthly active users, Tokopedia has driven more than 1% of Indonesia’s economy.

A couple of months back, I was thrilled to have the opportunity to join Tokopedia’s Data Science team located in Singapore. Since then, working closely with my colleagues in Singapore and Jakarta, I’ve discovered skills that successful data scientists have in this industry and would like to list them down here to help anyone looking to break into this field. Fret not if you lack any of these specific skills! A growth & learning mindset is highly encouraged in Tokopedia. As long as you identify your areas of improvement and work hard, the sky’s the limit! Without further ado, here are the top 5 skills of an e-commerce data scientist:

Skill 1: SQL and Python

No surprises here! In 2020 and for the foreseeable future, cutting edge data science requires significant coding. However, to do any data science, you first need to access data. Chances are, the data you need will sit in a database that you will need to connect to and extract data from using SQL (or a variant of it).

Once you have the data extracted, Python comes into play. Python is the de-facto data science language today. Python is so ubiquitous that in 2019, for the first time in history, it overtook Java as the second most popular language on Github [ref]. One of the significant reasons for Python’s popularity is the availability of useful packages that add substantial functionality on top of an already feature-packed programming language. Specifically for data science work, Python libraries that you must be an expert in are Pandas, Numpy, and Matplotlib.

The Pandas package [ref] is used to load and manipulate tabular data, which is the basis of a significant portion of data science work like feature engineering.
The Numpy library [ref] provides efficient ways to interact with multi-dimensional arrays and is the basis for more advanced frameworks.
Finally, Matplotlib [ref] is used to plot visualizations and charts to depict your data in a graphical format.

Once you are confident working with SQL, Pandas, Numpy, and Matplotlib, you are all set to jump into the exciting world of machine learning.

Skill 2: Machine Learning (ML)

Let’s face it; Machine Learning is the bread and butter of data scientists today. Once you complete feature engineering, the next step is to build, tune, and select an appropriate model or an ensemble of models to explain and predict your target variables. Sometimes, feature engineering and model building go hand in hand and need to be tuned together via hyper-parameter tuning.

ML is a vast topic involving mathematics, statistics, and programming. As such, a solid understanding of the mechanics behind various ML techniques is necessary. A great resource is the book “The Elements of Statistical Learning” [ref], which has served me well throughout the years and continues to serve as a reference when I need to refresh some concepts.

Out of the box models and hyperparameter optimization such as those provided by the Python packages “scikit-learn” [ref], “XGBoost” [ref] and “hyperopt” [ref] is a good starting point for those new to this field. But as you start solving real-world problems, a solid understanding of the underlying mechanics goes a long way helping you to select, tune and even construct your custom modeling function, evaluation metrics, and hyperparameter search optimizer.

Knowing how to construct an ML problem from a business problem so that you can select, adapt and tune the appropriate ML techniques is critical for success. For example, Data Scientists have a key role in business risk management at Tokopedia. Fraud is one of the significant types of risk and hence we have models to detect the various types of fraud. Generally, these models are classification tasks and hence the XGBoost model is our basis. However, each task is different in terms of their features and complexity, so a framework like scikit-learn’s “Pipeline” [ref] that ties together feature transforms and XGBoost modeling to perform hyper-parameter searching saves us a lot of time. We can then focus on investigating new types of frauds.

Skill 3: Computer Vision or Natural Language Processing or both

Strictly speaking, Deep Learning (DL) is a subtopic within ML. However, in the past few years, due to the advent of revolutionary techniques like ResNet, Transformer, and Transfer Learning, the field has taken off in solving problems that we traditionally considered hard for machines to compete with humans.

Take Computer Vision (CV), for example. At Tokopedia, we are harnessing the advances in CV to fast track product cataloging and visual product search. Our AI models generate deep tags for products and help merchants create the most detailed product catalogs in minutes. The deep tags turn product images to metadata-rich signals that are easy to search, discover, and sell. Also, buyers can upload pictures from their camera or photo gallery to perform an image search. We automatically tag incoming images and fast track the buyer’s transaction by returning the most visually similar products within milli-seconds.

If CV is not your cup of tea, perhaps Natural Language Processing (NLP) is? Here at Tokopedia, as you can imagine, there are several text data sources such as product titles, product descriptions, customer reviews, etc. We use NLP information extraction and NER models on product titles to categorize the product better, to improve search results, or to give price suggestions to sellers. To improve customer experience, we used Natural Language Understanding to facilitate automatic reply (Chatbot), and we analyzed customer reviews with topic modeling. We also run sentiment analysis on users’ feedback to improve our services.

Whether you choose to deep dive into CV or NLP, the common Python deep learning frameworks you must know are the “Tensorflow” [ref] or “Pytorch” [ref] python packages. I started my DL journey in “Keras” [ref], which is a higher-level interface to Tensorflow and recently picked up Pytorch for its flexibility. In addition to these Python frameworks, you must be confident with the techniques of Transfer Learning, where we adapt a “pre-trained” model that was built on a large dataset like all of the text on Wikipedia, for solving our task which has lesser labeled data [ref].

Skill 4: Cloud computing

As you might imagine, real-world data science is a lot more than what you learned at university (or online courses). In an E-Commerce company at the scale of Tokopedia, there are >90Million monthly active users who generate such a humongous volume of data that today, data science is not done on a laptop or a desktop anymore, but on the Cloud.

One of the Cloud platforms we use for Data Science in Tokopedia is the Google Cloud Platform (GCP). Data Warehouses hosted on GCP’s BigQuery (BQ) stores all the data from our 10-year history. We extract data from BQ using standard SQL, making it super convenient. We then develop our models on GCP Virtual Machines and GCP AI Platform notebooks. Model training is distributed across several machines (using distributed TensorFlow [ref], for example) and serving APIs deployed to an AI Platform. Cloud computing goes a long way in improving our day to day productivity without having to worry about the underlying infrastructure.

Moving forward into 2020 and beyond, I only see the pace of Cloud usage accelerating with many traditional companies also moving some or all of their workload to the Cloud. Hence, knowledge of at least one major cloud service provider (AWS, Azure, GCP, etc.) is becoming more and more essential to data scientists in E-Commerce.

Skill 5: Soft Skills and Collaboration Tools

Last but not least is soft skills. Data Science has the power to transform our lives. Even many traditionally non-tech companies are building their data science teams to take advantage of the AI revolution. But many times as data scientists, we get so focused on solving a problem that we do not do an excellent job at communicating our insights to the business. I firmly believe to be a great data scientist; you also need to be a good storyteller using data. Along those lines, effective presentation skills (think Steve Jobs, and TED talks) is an absolute must-have for data scientists.

Tokopedia’s first “3 DNA” culture is to “Focus on consumer” [ref]. As a data scientist, our consumers are the stakeholders from the various departments who collaborate with us. Having empathy is vital for effective collaboration across multiple departments within the company. Data scientists need to understand each stakeholder’s unique requirements, expectations, timelines, etc. and ensure we are committing what we can reliably deliver with the scope and dependencies spelled out at the beginning of the project.

And finally, data science work in big organizations like Tokopedia is highly teamwork oriented. As technologies get more complicated and coding standards mature, data science is beginning to look more and more like software development where teams of data scientists collaborate to solve an impactful problem. As such, typical software collaboration tools like Git [ref], JIRA [ref], and Confluence [ref] have become indispensable tools, you will use every day, and thus knowledge of these are critical.

That’s it for my top 5 skills for an E-Commerce data scientist! Data Science, in general, is a vast discipline with many possible avenues of specialization. I hope this post gave you some clarity on what specific skills and technologies you need to focus on if E-Commerce data science is your interest. If you like the post, do let us know below! And keep watching this space for more in-depth posts on some of the topics I highlighted above. Thank you for reading and happy learning!

The original article was published on Medium at Link.

Special thanks to the Singapore Data Science team at Tokopedia for helping me with this post.