The What and Why
For an advertising company, it is extremely valuable to understand where people are spending their time in order to categorize them or map individual interests to products or brands. However, during the COVID-19 pandemic governments began commercial shutdowns of shops, restaurants, theaters, etc. Thereafter, people have shifted their movement and visitation from commercial locations to outdoor lifestyle and hobby activities, such as beach-going, hiking, and camping. Adapting to these shifts in behavior, my intern project this Summer at Xandr has been focused on identifying these non-commercial visitations through consumer insights available on mobile devices.
Implementation and Techniques
The volume of GPS data being gathered is massive, billions of records per day, and from this jumble of data there is a substantial subset which is errored or irrelevant in solving our problem. The goal of the data cleaning process is to remove this unnecessary information by identifying, for example, devices with too few data points, instances of low accuracy, or excessive speed (driving or flying).
With the data cleaned and in the proper format, the next step is to identify areas where a device has multiple GPS pings densely packed, which represents an activity with some amount of dwell time. To accomplish this, I settled on the DBSCAN algorithm, which supports geospatial clustering, does not require the number of clusters as an input, and can identify non-circular clusters. This allows us to identify geographic locations where people are (relatively) stationary and filter out noise, or non-clustered data (Figure 1).
Given spatial clusters, we can identify where people are spending their time — but not for how long or how often. Sequencing is the process of separating spatial clusters temporally. See Figure 2 below for an illustration.
Once spatial-temporal clusters are identified, we need to identify whether or not a device actually visits a non-commercial location. Here I identify the cluster(s) within the polygon of interest and derive a score to represent our confidence that the given cluster(s) represent a visit. In order to create this confidence score I am considering how far into the polygon each cluster is and, for those that are close to the border, calculating the local centrality (Figure 3) of each cluster.
Personal value / growth
My internship this Summer has taught me more than I could have hoped for. Leading up to the program I was concerned about working from home, but I’ve come to enjoy the setup. While socializing is more difficult in a virtual setting, everyone I have spoken with has adapted positively, and the benefits of not commuting can’t be understated. In terms of career growth this program has surpassed my expectations and left my previous experiences in the dust. I’ve been able to brush up on existing technical skills and learn some new ones. I spent a lot of time doing data analysis and manipulation to better understand the dataset and its relationship to the problem I was solving. I learned how to handle data at scale by using distributed computing frameworks, such as PySpark and EKS nodes. I had to expand my knowledge on the theory of clustering and design my own solutions rather than just applying off-the-shelf models. Aside from the technical skills I’ve also had many opportunities to learn about Xandr and the advertising industry as a whole and to network with people from various domains and backgrounds. I have improved my public speaking and presentation skills, especially in a business setting. The breadth of experiences you have access to as a Xandr intern is really up to you, how much do you want to get out of the program?
About the Author
Wayne is a Graduate student at The University of Texas at Dallas focusing on Data Science and Intelligent Systems. Outside of work he enjoys spending time with friends playing board games, playing soccer, and camping.