Sharing my experience of a successful big data implementation using some of the most prominent analytic tools for data storage, management, visualization, and batch ETL processing.
The City of Chicago is the first city in the country to publish anonymized rideshare data from companies including Uber, Lyft, and Via. Such ride-hailing platforms heavily rely on data-driven decisions at many levels to deliver safe and reliable transportations. I explore a Big Data solution with the public dataset in contemplation of the need for insights that have resulted in petabytes of analytical data.
The proposed distinct phases of my solution are the following:
- Creating a Data Strategy for meaningful insights
- Building a cloud-based Big Data Architecture
- Designing the Dimensional Data Model
- Big Data processing using HDFS and Hive
- Deploying effective analytics for Insights and Data Visualization
Creating a Data Strategy for meaningful insights
Data reports related to drivers and trips are provided from Chicago’s licensed ride-hailing services (Uber, Lyft, and Via). The anonymized rideshare data provides a perceptive step in illustrating common travel patterns within Chicago and other large metropolitan areas. I decide to assess the impacts of ride-hailing services in economically connected areas and traffic peaks during commute periods along with other insights that generate real business value.
Building a cloud-based Big Data Architecture
In the interest of building an effective data infrastructure similar to the ones used by ride-hailing services, my Big Data platform mostly consisted of the Hadoop ecosystem. Apache Hadoop offers a fast, efficient, and reliable platform for analytical data through distributed processing and cluster analysis. Following Apache Hadoop documentation, the cluster was built on Linux virtual machines in Azure.
The configurations of the Azure Virtual Machines followed constraints limited to the dataset where each virtual machine represented a node of the cluster.
The following factors were considered before creating the appropriate VM:
- Application resources group
- Storage resources region
- Base operating system
- Data disk size and encryption type
- Maximum number of VMs that can be created
- VM related resources
Upon creating the resource group ‘BigData’ that was shared amongst all of the VMs, each VM belonged to the same storage resource region ‘East US’ for simplicity in security. Due to the demanding workload, I opted for 128 and 64 GB SSD-based disks with Ubuntu Server to produce low latency and high-performance. All of the virtual machines shared an SSH public key ‘bigdata_azure’ to filter network traffic within the Azure virtual network. The ‘Hadoop’ tags were simply name-value pairs that helped with access control and categorized resources.
Hadoop HDFS and YARN configurations
for a fully distributed Hadoop cluster:
etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml etc/hadoop/yarn-site.xml etc/hadoop/mapred-site.xml
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.0.jar grep input output 'dfs[a-z.]+'
$ cat output/*
$ vi etc/hadoop/core-site.xml
$ vi etc/hadoop/hdfs-site.xml
Designing the Dimensional Data Model
Dimensional modeling techniques were drawn from The Data Warehouse Toolkit, Third Edition
- Selecting the business process to model: Customer trips including shared rides
- Declaring the grain of the business process: TNC trips
- Choosing the dimensions to apply to each fact table row: Location and Time
- Identifying the numeric facts that will populate each fact table row: Revenue, miles, trips, and ride duration
Big Data processing using HDFS and Hive
Schema on read: The dataset was moved to HDFS in Hadoop and external tables in Hive were built to read from the CSV file.
- Location dimension (Location_text): The trip dataset contains latitude and longitude coordinates for pickup and dropoff centroid locations. With a reverse Geocoding API, the geographic coordinates were converted into a human-readable address for analysis.
- Calendar date dimension (Dates_text): The trip dataset timestamps were in MM/DD/YYYY format. Using PostgreSQL, a relational table for attributes such as weekday, weekend, and public holidays was created.
External tables for staging data:
locations_text -> user/data/dimensions/locations
dates_text -> user/data/dimensions/date
time_buckets -> user/data/dimensions/time
trips_data -> user/data/fact
Populating the managed tables
Using the external tables, managed tables for each dimension were created. The fact table was then loaded using the dimension keys.
CREATE TABLE IF NOT EXISTS fact_trips(
PARTITIONED BY (trip_part_date INT);
Over 280,000 records consisting of 74 MB data were produced for an average day. Using the external tables and SQL, I built partitioned fact and dimensional tables for the attributes mentioned in the dimensional data model. Partitioning by day aided in easier control and faster query resolution since the amount of data processed was growing.
A de-normalized aggregated table was created to support ad-hoc queries. The table could be further aggregated to generate weekly and monthly snapshots.
Analytics for Insights and Data Visualization
Data on passenger behavior reveal insights on how ride-hailing platforms use it to improve their service and illustrate their impacts on the city’s transportation system. Visualizing with Apace Superset highlighted the importance of collecting and producing big data for driving business growth.
Over the two month period from 11/01/18 to 12/31/18, the travel patterns were partially deviant due to the major holidays.
- A weekly average of over 1.2 million trips were taken with 84% of them being individual bookings and the rest as shared.
- The trip start and end times are rounded to the nearest 15 minutes with a median length of trips at 3.4 miles. They tend to be clustered around early morning commute hours and “nightlife” hours. The average speed of the TNC trips declined during traditional commute hours and TNC usage peaks.
- Trips to and from economically connected areas (83% of 12 million TNC trips) were the larger generator of travel, had shorter trip length, and embodied a lower proportion of shared rides than trips taken outside of their areas.
- As per the map, the pickup locations were dominant around the central business district, West Loop, River North, Midway, and O’Hare International Airport.