Assessing batch-processed analytics of anonymized rideshare data

Farhan Juneja
Jan 6 · 6 min read
Image for post
Image for post
Image by author

Sharing my experience of a successful big data implementation using some of the most prominent analytic tools for data storage, management, visualization, and batch ETL processing.

The City of Chicago is the first city in the country to publish anonymized rideshare data from companies including Uber, Lyft, and Via. Such ride-hailing platforms heavily rely on data-driven decisions at many levels to deliver safe and reliable transportations. I explore a Big Data solution with the public dataset in contemplation of the need for insights that have resulted in petabytes of analytical data.

  • Creating a Data Strategy for meaningful insights
  • Building a cloud-based Big Data Architecture
  • Designing the Dimensional Data Model
  • Big Data processing using HDFS and Hive
  • Deploying effective analytics for Insights and Data Visualization

Creating a Data Strategy for meaningful insights

Image for post
Image for post
Uber Engineering

Data reports related to drivers and trips are provided from Chicago’s licensed ride-hailing services (Uber, Lyft, and Via). The anonymized rideshare data provides a perceptive step in illustrating common travel patterns within Chicago and other large metropolitan areas. I decide to assess the impacts of ride-hailing services in economically connected areas and traffic peaks during commute periods along with other insights that generate real business value.

Building a cloud-based Big Data Architecture

In the interest of building an effective data infrastructure similar to the ones used by ride-hailing services, my Big Data platform mostly consisted of the Hadoop ecosystem. Apache Hadoop offers a fast, efficient, and reliable platform for analytical data through distributed processing and cluster analysis. Following Apache Hadoop documentation, the cluster was built on Linux virtual machines in Azure.

Image for post
Image for post

The configurations of the Azure Virtual Machines followed constraints limited to the dataset where each virtual machine represented a node of the cluster.

The following factors were considered before creating the appropriate VM:

Image for post
Image for post
Image by author
  • Application resources group
  • Storage resources region
  • Base operating system
  • Data disk size and encryption type
  • Maximum number of VMs that can be created
  • VM related resources

Upon creating the resource group ‘BigData’ that was shared amongst all of the VMs, each VM belonged to the same storage resource region ‘East US’ for simplicity in security. Due to the demanding workload, I opted for 128 and 64 GB SSD-based disks with Ubuntu Server to produce low latency and high-performance. All of the virtual machines shared an SSH public key ‘bigdata_azure’ to filter network traffic within the Azure virtual network. The ‘Hadoop’ tags were simply name-value pairs that helped with access control and categorized resources.

Image for post
Image for post
Image by author
Image for post
Image for post
Image by author
Site-specific configurations
for a fully distributed Hadoop cluster:
etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml etc/hadoop/yarn-site.xml etc/hadoop/mapred-site.xml
Standalone Operation:
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep input output 'dfs[a-z.]+'
$ cat output/*
Pseudo-Distributed Operation:
$ vi etc/hadoop/core-site.xml
<configuration>
<property><name>fs.defaultFS</name><value>hdfs://localhost:9000</value>
</property>
</configuration>
$ vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value></property></configuration>
Image for post
Image for post
Hive 3.1.2 Setup

Designing the Dimensional Data Model

Image for post
Image for post
Image by author

Dimensional modeling techniques were drawn from The Data Warehouse Toolkit, Third Edition

  1. Selecting the business process to model: Customer trips including shared rides
  2. Declaring the grain of the business process: TNC trips
  3. Choosing the dimensions to apply to each fact table row: Location and Time
  4. Identifying the numeric facts that will populate each fact table row: Revenue, miles, trips, and ride duration
Image for post
Image for post
A Day in the Life of Uber

Big Data processing using HDFS and Hive

Image for post
Image for post
Hadoop UI

Schema on read: The dataset was moved to HDFS in Hadoop and external tables in Hive were built to read from the CSV file.

  • Location dimension (Location_text): The trip dataset contains latitude and longitude coordinates for pickup and dropoff centroid locations. With a reverse Geocoding API, the geographic coordinates were converted into a human-readable address for analysis.
  • Calendar date dimension (Dates_text): The trip dataset timestamps were in MM/DD/YYYY format. Using PostgreSQL, a relational table for attributes such as weekday, weekend, and public holidays was created.
locations_text -> user/data/dimensions/locations
dates_text -> user/data/dimensions/date
time_buckets -> user/data/dimensions/time
trips_data -> user/data/fact
Image for post
Image for post

Using the external tables, managed tables for each dimension were created. The fact table was then loaded using the dimension keys.

CREATE TABLE IF NOT EXISTS fact_trips(
trip_id string
,pickup_date_id int
,dropoff_date_id int
,pickup_location_id int
,dropoff_location_id int
,pickup_time_id int
,dropoff_time_id int
,fare DOUBLE
,tip DOUBLE
,additional_charges DOUBLE
,total DOUBLE
,trip_duration_seconds INT
,trip_miles DOUBLE
,trip_pooled int)
PARTITIONED BY (trip_part_date INT);

Over 280,000 records consisting of 74 MB data were produced for an average day. Using the external tables and SQL, I built partitioned fact and dimensional tables for the attributes mentioned in the dimensional data model. Partitioning by day aided in easier control and faster query resolution since the amount of data processed was growing.

Image for post
Image for post
Image by author

A de-normalized aggregated table was created to support ad-hoc queries. The table could be further aggregated to generate weekly and monthly snapshots.

Image for post
Image for post
Batch process (Image by author)

Analytics for Insights and Data Visualization

Data on passenger behavior reveal insights on how ride-hailing platforms use it to improve their service and illustrate their impacts on the city’s transportation system. Visualizing with Apace Superset highlighted the importance of collecting and producing big data for driving business growth.

Image for post
Image for post
Total trips — Week 44 (Image by author)
Image for post
Image for post
Image by author

Over the two month period from 11/01/18 to 12/31/18, the travel patterns were partially deviant due to the major holidays.

  • A weekly average of over 1.2 million trips were taken with 84% of them being individual bookings and the rest as shared.
  • The trip start and end times are rounded to the nearest 15 minutes with a median length of trips at 3.4 miles. They tend to be clustered around early morning commute hours and “nightlife” hours. The average speed of the TNC trips declined during traditional commute hours and TNC usage peaks.
  • Trips to and from economically connected areas (83% of 12 million TNC trips) were the larger generator of travel, had shorter trip length, and embodied a lower proportion of shared rides than trips taken outside of their areas.
  • As per the map, the pickup locations were dominant around the central business district, West Loop, River North, Midway, and O’Hare International Airport.
Image for post
Image for post
Image by author

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Farhan Juneja

Written by

Big Data & Analytics — linkedin.com/in/farhan-juneja/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Farhan Juneja

Written by

Big Data & Analytics — linkedin.com/in/farhan-juneja/

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store