Top Scorer — Network Analysis via Networks and Graph theory

Angela W.
Analytics Vidhya
Published in
4 min readJul 10, 2021

This is a solution my team (myself & Jun Lin Goh) have crafted in response to the competition by Shopee Code League 2021 — Multi-Channel Contacts.

Summary: We used networks and graph theory to understand the relationship between each ticket submitted. With this, we managed to map the unlinked customer 100% either via the phone number, email address or order id.

Problem Statement:
Customers can contact customer service via various channels such as the livechat function, filling up certain forms or calling in for help. Each time a customer contacts us with a new contact method, a new ticket is automatically generated. A complication arises when the same customer contacts us using different phone numbers or email addresses resulting in multiple tickets for the same issue. Hence, our challenge here is to identify how to merge relevant tickets together to create a complete picture of the customer issue and ultimately determine the RCR

Dataset:
You can download the dataset from https://www.kaggle.com/c/scl-2021-da/data

Solution:
The solution is simple and sweet, with minor cleaning and using basic network to map out the relationship between each ticket and we managed to formulate the solution and map out the customer network 100% within the timelimit.

First we load the dataset

import pandas as pd
import networkx as nx
df = pd.read_json('dataset/contacts.json')
df
  • Each Order ID represents a transaction in Shopee.
  • Each Id represents the Ticket Id made to Shopee Customer Service.
  • All Phone Numbers are stored without the country code and the country code can be ignored.
  • Contacts represent the number of times a user reached out to us in that particular ticket (Email, Call, Livechat etc.)
  • If a value is NA means that the system or agent has no record of that value.

Data Cleaning

Data cleaning is the mandatory step before any modeling.
We append “Email_ ” , “Phone_” and “OrderId_” to help us identify the data type later on.

df.Email = df.Email.apply(lambda x: "Email_"+x if x !='' else '')df.Phone = df.Phone.apply(lambda x: "Phone_"+x if x !='' else '')df.OrderId = df.OrderId.apply(lambda x: "OrderId_"+x if x !='' else '')

Creating the Graph network

  1. Create the nodes with Id and with attributes as Contacts
# Create the nodes with Id and with attributes as Contactsnodes = []for _,Id,_,_,Contacts,_ in df.itertuples():
nodes.append((Id,{"Contacts": Contacts}))

# Add the nodes into the new graph G
G = nx.Graph()
G.add_nodes_from(nodes)
# Now that we have 500000 nodes, 0 edges
G.number_of_nodes(), G.number_of_edges()

2. Create the edges between: Id Email, Id Phone, Id OrderId

And the nodes of Email, Phone & OrderId are auto created in the process.

# Create the edges between: Id ↔ Email, Id ↔ Phone, Id ↔ OrderIdG.add_edges_from(df[df.Email != ''][['Id', 'Email']].to_records(index=False))G.add_edges_from(df[df.Phone != ''][['Id', 'Phone']].to_records(index=False))G.add_edges_from(df[df.OrderId != ''][['Id', 'OrderId']].to_records(index=False))# Now that we have 1129135 nodes, 837231 edges
G.number_of_nodes(), G.number_of_edges()

3. List down the connected component. Each list contains 1 connected component

# List down the connected componentconn_comp = list(nx.connected_components(G))
conn_comp

Extract the graph network

The idea is one ticket_id one line, with their connected ticket id concated, number of contact points made summed up.

For each connected component, i.e. each list, we concated the ticket_id within, and summed up the contact made (stored under the nodes attributes earlier)

output = []# for each connected component, i.e. each list, we apended the ticket_id within (to concat in the later part)for each_connected_component in conn_comp:

id_list = []
for each_node in each_connected_component:

# check if the node is a number, append to the id_list

if str(each_node).isnumeric():
id_list.append(each_node)sum_of_contacts = 0

for order_id in id_list:

# summed up the attributes ie. contact made that belongs to the node

sum_of_contacts += G.nodes[order_id]['Contacts']
output_str = '-'.join([str(each_node) for each_node in sorted(id_list)]) + ', ' + str(sum_of_contacts)
for order_id in id_list:
output.append([order_id, output_str])

Convert the output as dataframe, and sorted according to the competition requirement, and exported as csv file.

output_final = pd.DataFrame(output)output_final= output_final.rename(columns={0:'ticket_id', 1:'ticket_trace/contact'})output_final.sort_values('ticket_id').to_csv('output.csv', index=False)

Voila, we are done! Hope you enjoy the article / tutorial I have done up.

If you have any question, or like the article — please leave your comments.

Github repository: https://github.com/angelawongsw/Top-Scorer---Network-Analysis-Networks-and-Graph-theory

Full codes:

--

--