DC Department of Transportation Service Requests

Jdavitz
INST414: Data Science Techniques
6 min readFeb 20, 2024

The question that I set out to answer was if all regions serviced by the DC Department of Transportation (DDOT) are equally efficient at completing service requests. A number of different stakeholders would be interested in the efficiency of DDOT such as taxpayers in the District of Columbia or supervisors at the Department of Transportation. For this analysis I will focus on the viewpoints of a DDOT supervisor. If this analysis finds areas where service requests are not being fulfilled as efficiently as in other regions, then this can inform a supervisor’s decision to allocate more resources and manpower to a specific region or give cause to investigate the performance of a specific region.

The data that can answer this question would be the list of work tickets that DDOT received and worked recently. Specifically, I would want to see the fields for when the ticket was first generated and then the date that the ticket was closed out and the work was completed. In order to sort by region some sort of location information would also be needed, preferably by city name or borough. I was able to find a dataset that contained the information I needed and even more on the DC open data website. The data came directly from the government as a CSV file containing a large amount of information. To start my data analysis, I begin by significantly cutting down on the number of columns in the dataset from 51 down to just 12.

service_requests = service_requests[[‘OBJECTID’, ‘DESCRIPTION’, ‘STATUS’, ‘WARD’, ‘REQUESTCATEGORY’, ’INITIATEDDATE’, ’CLOSEDDATE’, ’NEIGHBORHOODCLUSTERS’, ‘NEIGHBORHOODNAMES’,’DAYSTOCLOSE’, ‘LONGITUDE’,’LATITUDE’, ‘SUFFIX’]]

A few of the columns I kept are listed below with explanations but not every column had an informational name and I was unable to locate a key for all of the column names so some are listed with my best guess at what they represent.

· ObjectID — This numerical attribute appeared to be an index of all of the service requests included in the dataset.

· Description — A short text description of the work being performed.

· Status — A text description of the service request status; the most common being Closed, Pending, Open, and INSPCOMP.

· InitiatedDate — This date field seems to be when the issue was first submitted.

· ClosedDate — This date field seems to be when the service request was closed, and all work was completed.

· DaysToClose — This text field is amount of time it took the service request to be closed

· Ward — This numerical value represents the ward in the city based on established city wards.

Next, I needed to eliminate most of the rows in the data frame, currently there were over 1.3 million entries dating back to 2004. I decided to only look at data from 2020 onwards leaving me with a more manageable 264 thousand rows.

def convertToDate(row):

row.INITIATEDDATE = pd.to_datetime(row.INITIATEDDATE)

row.CLOSEDDATE = pd.to_datetime(row.CLOSEDDATE)

return row

service_requests = service_requests.apply(convertToDate, axis=1)

service_requests_2020 = service_requests[service_requests.INITIATEDDATE.dt.year > 2019]

This required converting the text date fields into actual dates which was a time-consuming process before being able to filter the values by date. If you are going to replicate this process, definitely make a copy of your data here and output it to a CSV. Finally with a manageable amount of data I only needed to take one more step before I could start looking at the data and visualizations. The ward value which I mentioned above divides the city into a few chunks with a numerical value representing each ward. One row in the data had a non-numerical value which had to be removed. While I could have located the ward based on the longitude and latitude of the address listed, I chose to simply drop this row as the notation for what N meant was not explained.

service_requests_2020 = service_requests_2020[service_requests_2020.WARD != ‘N’]

Once that row had been dropped and I had a fully numeric ward value I could begin to create visualizations based on the ward column and the ‘days to close’ column. I started out with a box and whisker plot in order to be able to tell if the min or maxes of any of the data was drastically out of touch with the other wards. Wards 4 and 5 seemed to have the longest average days to close especially compared to wards 2 and 7.

sns.boxplot(x=service_requests_2020[‘WARD’], y=service_requests_2020[‘DAYSTOCLOSE’])

Next, I wanted to see if there were any seasonal challenges with closing service requests in a timely manner especially around the end of the year when many employees may be off for the holidays. In order to take a look at this I created another box and whisker plot this time analyzing the ‘days to close’ and the month of the year. This gave me a very surprising result and was the opposite of what I was expecting. The days to close clearly went down as the year progressed.

sns.boxplot(x=service_requests_2020.INITIATEDDATE.dt.month, y=service_requests_2020[‘DAYSTOCLOSE’])

In order to dig a bit deeper into this I started to look into how many service requests were opened each month. The number of service requests that were submitted at the end of the year did drop but not noticeably more than in June and July, where there was also a dip in the number of service requests.

sns.barplot(x=service_requests_2020.INITIATEDDATE.dt.month, y=service_requests_2020[‘DAYSTOCLOSE’])

I am not sure exactly what could have caused the downward trend in days to close as the year went on, this is something that would need further in-depth analysis in order to determine the exact cause. Moving on from the mysterious downward trend as the year goes on, I wanted to make one final visualization plotting the service requests to try and see if the physical distribution showed anything. After the visualization had been completed there were too many points to draw any conclusions from this map.

dc_map = gpd.read_file(‘shapefile/Washington_DC_Boundary.shp’)

crs = {‘init’:’EPSG:4326'}

geometry = [Point(xy) for xy in zip(service_requests_2020[‘LONGITUDE’], service_requests_2020[‘LATITUDE’])]

geo_df = gpd.GeoDataFrame(service_requests_2020,

crs = crs,

geometry=geometry)

fix, ax = plt.subplots(figsize = (20,20))

dc_map.to_crs(epsg=4326).plot(ax=ax,color=’lightgrey’)

geo_df.plot(ax=ax, column=geo_df[‘WARD’])

A few possible errors with my data and analysis are firstly that I was unable to locate a dictionary for the data. One of the main values I focused on was ‘Initiated date’ which I assumed was when the service request was first put in but if this assumption was incorrect then the graphs would be incorrect. In addition, my decision to use the established wards could have introduced some bias into the analysis, the city has large areas of parks, rivers, and water which is also included in each ward. A measure of roads by square mile inside each ward compared to the number of service requests could be a better option but was not included in this data set.

Full Code available here: GitHub.

The Full dataset is available here: Google Drive

--

--