Challenge of Wits
Challenge #2 Data Science — Check out the solution!
Background
Defence Science & Technology (MINDEF)’s Challenge of Wits returned for its second edition! CSIT put up a Data Science challenge that ran from 15 to 28 August 2023.
Read on for the challenge unveil!
The Challenge
Scenario: Millions of stolen cash was stashed away by terror organization APOCALYPSE in their safe. A thumb drive containing log files was recovered when they were captured. Handed to you, your team’s mission is to find the secret key from the “Logs.zip” files that will unlock the safe to recover the stolen cash.
Armed with programming expertise and knowledge of text processing, you know how best to represent insights from data to answer mission requirements. You are told APOCALYPSE used a 2D shape as their secret key. Can you find the secret key to unlock the safe?
A Logs.zip file was given via the challenge website with a hint: Write a script to solve the challenge.
Motivation
According to Amazon,
“Data science is the study of data to extract meaningful insights for business. It is a multidisciplinary approach that combines principles and practices from the fields of mathematics, statistics, artificial intelligence, and computer engineering to analyze large amounts of data. This analysis helps data scientists to ask and answer questions like what happened, why it happened, what will happen, and what can be done with the results.”
Data scientist Abraham Enyo-one Musa has a comprehensive summary of the data science lifecycle here.
In this post, I will focus on the parts of the lifecycle that this challenge aligns with, specifically:
- Problem Understanding
- Data Collection
- Data Cleaning and Processing
- Data Visualization
These 4 stages were chosen as they focus on thinking and reasoning, which is more important in data science compared to model building and evaluation (which has plug and play code that makes it hard to test one’s understanding, and is better left to online competition platforms such as Kaggle), and model deployment and maintenance stage (which is less relevant to student participants).
Solution
This solution was run on a jupyter notebook and requires Logs.zip. Launch a notebook from the same folder as where Logs.zip is placed.
Problem Understanding
At this stage, what a data scientist receives is a problem statement, which is often broadly defined or unclear. What he/she has to do is to distil what the mission requires, and possibly elicit more information.
The challenge description can be broken down into comprehensible bites.
Data Collection
Knowing mission requirements helps you identify what data you need and whether the collected data is able to answer the question.
For this challenge, you are tasked to find out APOCALYSE’s secret key which is a 2D shape.
A 2D shape is a set of coordinates represented in a 2D space, and each coordinate is a pair of real numbers. Let’s sample a few log files to see if they fulfill what we need.
First, unzip Logs.zip (aka the thumb drive). It contains 2,255 log files.
!unzip -q -o Logs.zip
Next, sample two files and look at their content.
log_1617.xml
<?xml version ="1.0" encoding ="UTF-8" ?>
<data>
<location>161,43</location>
<convo>im not sure how i feel im shocked honestly</convo>
<class>casual</class>
</data>
log_2155.xml
<?xml version ="1.0" encoding ="UTF-8" ?>
<data>
<location>264,142</location>
<convo>i feel very happy and excited since i learned so many things</convo>
<class>casual</class>
</data>
Notice that each sample contains 3 XML elements — location, convo and class. What we want to extract is the coordinates in the location element.
Data Cleaning and Processing
String Slicing — Wrong Approach
Some participants might have attempted to use string slicing but this method yields incorrect coordinates. Closer inspection on the same two samples suffices to explain why. Suppose we extract the 58th through 63rd characters of the input text, we get the correct coordinates for log_1617.xml but the wrong coordinates for log_2155.xml.
Regular Expressions — Suggested Approach
Here, I present one approach. Regular expression searches for a pattern of characters in a text. RegexOne is one online source for learning purposes.
The search pattern is <location>(\d+),(\d+)<\/location>. This extracts <location>264,142</location> from the example text.
(\d+) captures groups of digits. The first (\d+) captures 264 and the second (\d+) captures 142.
Now, let’s apply to all the log files.
import os
import re
files = os.listdir('Logs')
x, y = [], [] # lists to store coordinates
# iterate all log files
for file in files:
# read
with open('Logs\\' + file, 'r') as f:
text = f.read()
# pattern search
pair = re.search(r"<location>(\d+),(\d+)<\/location>" , text)
# extract coordinates and cast as integer
x_coord = int(pair.group(1))
y_coord = int(pair.group(2))
x.append(x_coord)
y.append(y_coord)
assert len(files) == len(x)
assert len(files) == len(y)
Other approaches are also possible. If your approach differs from mine, feel free to share!
Data Visualization
With the extracted coordinates, the remaining step is to visualize them. I used matplotlib. Feel free to use any Python visualization libraries you are comfortable with.
Running the following code, you will get this plot.
!pip install matplotlib # run this if you have not installed library before
import matplotlib.pyplot as plt
plt.plot(x,y)
plt.axis('off');
Uh oh. The plot doesn’t seem right. What went wrong?
The above image is a line plot which assumes that data points follow an order (think about time series/trend plots), but you will realize that the challenge did not mention any inherent order of the extracted coordinates.
Instead, use a scatterplot and you will derive the correct answer.
import matplotlib.pyplot as plt
plt.scatter(x,y)
plt.axis('off');
APOCALYPSE’s secret key is of shape letter A. Did you manage to unlock the safe?
Congratulations to all participants who solved the challenge!
Summary
Challenge too difficult?
If you weren’t able to get the answer, don’t be discouraged! Hopefully this blog entry has shed some light on the thought process to solving this challenge.
Final words
In this post, I went through the solution to CSIT’s challenge in the Challenge of Wits in stages, aligning to the different stages of the data science lifecycle.
I’d like to thank Tay YQ for collaborating with me on creating this challenge, who sourced for the texts in the convo and class elements, which together with self-generated coordinates in the location elements form the dataset Logs.zip.
While this challenge does not aim to provide a complete picture of the lifecycle, I hope it has highlighted the more important stages and the lessons that can be learnt:
- Distilling information from the question — what is the problem you’re trying to solve
- Garbage in, garbage out — when preprocessing is not done right
- How best to represent insights from data — when to use what visualization
Toodaloo!
P.S Find out more about Defence Science & Tech sector’s Challenge of Wits here.