Towards a Priority Framework for Open Data

Open Data - Toronto

Published in

Open Data Toronto

10 min readFeb 6, 2020

Original article by Ryan Garnett, republished with permission

1.0: Purpose

The City of Toronto developed an Open Data Master Plan that was unanimously accepted by Toronto City Council on January 31, 2018. The Open Data Master Plan provides strategic direction for the improvement and delivery of the City’s Open Data team. The plan available here consists of four themes and 13 actions, with one action focusing exclusively on the prioritization of data releases.

1.1: Theme 1b –Prioritize Data Releases

The prioritization framework, including expected value scores and weighting methodology, will be shared with the public for enhanced visibility into planned open data releases.

Where we are

Low visibility into release prioritization and publication status reduces transparency
Corporate data inventory is currently in progress and can be leveraged as a starting point
Centralized but manual process to request and provide feedback on datasets limits scaling capacity
Community input on release prioritization not captured via a formal mechanism

What we need to do

Publish the corporate data inventory with regular refreshes
Identify the publishing status (i.e. In Review, Restricted, or Open) for each item in the corporate data inventory, with estimated release dates
Provide a public listing of all open dataset requests
Identify where open datasets link to strategic initiatives
Target open data releases around key civic issues
Publish the prioritization framework, including expected value scores and weighting methodology

What we’ll measure

Degree to which open data released enables the City to address key civic issues
Gaps in open data around key civic issues
Publication of datasets in corporate data inventory
Alignment between demand for data, data released, and prioritization method

2.0: The Framework

What is the priority framework? The Open Data team receives numerous requests to publish new data, as well as update existing datasets. Currently data releases and updates are release in the order of the request,prioritizing the date rather than the civic benefit of the dataset.

While the Open Data team believes that all available data should be made open, the team is focusing on releasing datasets that provide value for the City’s residents, businesses, visitors, and employees. For this the team has developed a prioritization framework to assist with objectively ranking dataset priority according to a set of criteria, consisting of four groups with each group containing different elements:

Civic Issue
Requester
Output
Source

The framework will continue to be developed and will evolve over time. It has been designed in a way that allows for elements to be added, modified, or removed, based on feedback and direction from council and the public. Each element and group is assigned a value and weight, which is combined into a calculation, allowing the process to be repeatable and objective.

2.1: Civic Issue

The Open Data team heard a lot of feedback during the development of the Open Data Master Plan around civic issues and open data publishing. The information gained during the Open Data Master Plan development phases resulted in the identifications of five actions (1b, 2c, 3a, 3b, and 3c) and eight deliveries tied to civic issues. Aligning to strategic initiatives identified by Toronto City Council, the Open Data team has decided on the following five civic issues:

Affordable housing
Climate change
Fiscal responsibility
Mobility
Poverty reduction

Further information on the Open Data civic issue initiative can be found here

2.2: Output

The release of open data is great, however benefits are only realized through the activation of datasets by providing a beneficial and consumable output. Outputs have been characterized into the following six elements:

Application development
Consists of outputs related to the creation of software applications (i.e. mobile apps, visualizations, or web apps), or commercial business ventures.
City
Consists of outputs related to the creation of city reports, studies or other documents, which can be developed by either internal or external groups.
Education
Consists of outputs related to the development of educational course development at the K-12 (public or private classes), college, or university level.
Media
Consists of outputs related to the delivery of media based products (i.e. newspaper article, blog post, tv/radio interview, etc.).
Personal
Consists of outputs related to actions that primarily have personal benefit (i.e. personal development/learning/interest, hobby, etc.).
Research
Consists of outputs related to the enhancement of academic research projects.

2.3: Requester

Council
Decision Support: Open data has been published to coincide with the release of City initiatives that used data driven approaches to influence a process or policy direction.
Public
Other

2.4: Source

Associated to level of effort…source system connection allows for the utilization of the automated publishing process, resulting in unlocking a wide range of enhanced features on the Open Data Portal

The input elements and groups are used within an equation that provides a requested dataset priority score, which is used to rank the priority of the request (new or update), allowing for an objective evolution of each request.

Table 3: Priority Framework Equation Codes

3.0: Test Evaluation of the Framework

3.1: Setting up the Environment in RStudio

library(tidyverse)## Warning: package 'tidyverse' was built under R version 3.4.4## -- Attaching packages -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --## v ggplot2 3.1.0       v purrr   0.3.0  
## v tibble  2.0.1       v dplyr   0.8.0.1
## v tidyr   0.8.2       v stringr 1.3.1  
## v readr   1.3.1       v forcats 0.4.0## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'dplyr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4## -- Conflicts ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()#Import data
odpSource <- read_csv("C:/Temp/openData/OpenDataPriorityFramework.csv")## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Dataset name` = col_character(),
##   Division = col_character(),
##   `Who is requesting the dataset` = col_character(),
##   `Identify the civic issue that the request aligns to` = col_character(),
##   `Is there a source system connection` = col_character(),
##   `Category of outputs` = col_character()
## )#Rename columns
odpSource <- odpSource %>%
  drop_na() %>%
  rename(dataset = `Dataset name`) %>%
  rename(division = Division) %>%
  rename(requestor = `Who is requesting the dataset`) %>%
  rename(civicissue = `Identify the civic issue that the request aligns to`) %>%
  rename(datasource = `Is there a source system connection`) %>%
  rename(output = `Category of outputs`)odpSource

3.2: Framework Weighting

#Framework weight factors
#Requester weight factor
rWF <- 0.4#Civic issue weight factor
cWF <- 0.3#Source weight factor
sWF <- 0.1#Output weight facator
oWF <- 0.2#Requester variables
r1 <- 1
r2 <- 0.85
r3 <- 0.75
r4 <- 0.25#Civic issue variables
c1 <- 1
c2 <- 1
c3 <- 1
c4 <- 1
c5 <- 1#Source variables
s1 <- 1
s2 <- 0.25#Output variables
o1 <- 0.69
o2 <- 0.75
o3 <- 0.61
o4 <- 0.59
o5 <- 0.37
o6 <- 0.59

4.0: Framework Variables

4.1: Requestors Entries

#Clean requestor data
odpR <- odpSource %>%
  select(dataset, requestor)
  
odpR <- odpR %>%
  separate(requestor, into = c("r1", "r2", "r3", "r4"), sep = ",")## Warning: Expected 4 pieces. Missing pieces filled with `NA` in 11 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].#Convert NA values to 0
odpR[is.na(odpR)] <- 0#Function to remove trailing whitespace
cTrim <- function(a) {
  str_trim(a, side = "both")
}#Run whitespace function of columns 2 to 5
odpR <- data.frame(lapply(odpR[2:5], cTrim))odpR

#Prepare for rScore
#Function to recode text to numeric
cvtRFun <- function(R) {
  case_when(
    R == "Council" ~ r1,
    R == "Decision support" ~ r2,
    R == "Public" ~ r3,
    R == "0" ~ 0,
    TRUE ~ r4
    )
}#Apply convert function on column 2 to 5  
odpR <- data.frame(lapply(odpR, cvtRFun))#Calculate rScore
odpR <- odpR %>%
  mutate(rScore = r1 + r2 + r3 + r4)
odpR

4.2: Civic Issue Entries

#Clean requestor data
odpC <- odpSource %>%
  select(dataset, civicissue)
  
odpC <- odpC %>%
  separate(civicissue, into = c("c1", "c2", "c3", "c4", "c5"), sep = ",")## Warning: Expected 5 pieces. Missing pieces filled with `NA` in 11 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].#Convert NA values to 0
odpC[is.na(odpC)] <- 0#Function to remove trailing whitespace
cTrim <- function(a) {
  str_trim(a, side = "both")
}#Run whitespace function of columns 2 to 5
odpC <- data.frame(lapply(odpC[2:6], cTrim))odpC

#Prepare for cScore
#Function to recode text to numeric
cvtCFun <- function(C) {
  case_when(
    C == "Affordable housing" ~ c1,
    C == "Climate change" ~ c2,
    C == "Fiscal responsibility" ~ c3,
    C == "Mobility" ~ c4,
    C == "Poverty reduction" ~ c5,
    TRUE ~ 0
    )
}#Apply convert function on column 2 to 5  
odpC <- data.frame(lapply(odpC, cvtCFun))#Calculate cScore
odpC <- odpC %>%
  mutate(cScore = c1 + c2 + c3 + c4 + c5)
odpC

4.3: Source Entries

#Clean requestor data
odpS <- odpSource %>%
  select(dataset, datasource)#Calc sScoreodpS <- odpS %>%
  mutate(sScore = case_when(
    datasource == "Yes" ~ s1,
    TRUE ~ s2
  ))odpS

4.4: Output Entries

#Clean output data
odpO <- odpSource %>%
  select(dataset, output)odpO <- odpO %>%
  mutate(o1 = str_extract(output, "Application")) %>%
  mutate(o2 = str_extract(output, "City")) %>%
  mutate(o3 = str_extract(output, "Education")) %>%
  mutate(o4 = str_extract(output, "Media")) %>%
  mutate(o5 = str_extract(output, "Personal")) %>%
  mutate(o6 = str_extract(output, "Research"))#Convert NA values to 0
odpO[is.na(odpO)] <- 0odpO#Prepare for oScore
#Function to recode text to numeric
cvtOFun <- function(O) {
  case_when(
    O == "Application" ~ o1,
    O == "City" ~ o2,
    O == "Education" ~ o3,
    O == "Media" ~ o4,
    O == "Research" ~ o5,
    O == "Personal" ~ o6,
    TRUE ~ 0
    )
}#Apply convert function on column 2 to 5  
odpO <- data.frame(lapply(odpO[3:8], cvtOFun)#Calculate oScore
odpO <- odpO %>%
  mutate(oScore = o1 + o2 + o3 + o4 + o5 + o6)
odpO

4.5: Rds Score

#Create score dataframe by combining from each of the dataframes
odpScore <- cbind(odpSource[1:3], rev(odpR)[1], rev(odpC)[1], rev(odpS)[1], rev(odpO)[1])odRds <- odpScore %>%
  mutate(Rds = (rScore * oWF) + (cScore * cWF) + (sScore * sWF) + (oScore * oWF)) %>%
  select(Timestamp, dataset, division, Rds)odpScore

#Theoretical priority values
#Low: 0 - 1.2; Medium: 1.21 - 2.4; High: 2.41 - 3.6
odRds <- odRds %>%
  mutate(priorityT = case_when(
    Rds <= 1.15 ~ "Low",
    Rds >= 3.32 ~ "High",
    TRUE ~ "Medium"
  )) %>%
  arrange(desc(Rds))#Actual priority values
#Low: 0.683 - 1.138; Medium: 1.138 - 1.593; High: 1.593 - 2.049
odRds <- odRds %>%
  mutate(priorityA = case_when(
    Rds < 1.138 ~ "Low",
    Rds >= 1.593 ~ "High",
    TRUE ~ "Medium"
  )) %>%
  arrange(desc(Rds))odRds

inner_join(odpScore, odRds) %>%
  arrange(desc(Rds))## Joining, by = c("Timestamp", "dataset", "division")

odRds %>%
ggplot() +
  aes(x = priorityA) +
  geom_bar(fill = "#0c4c8a") +
  labs(title = "Open Data Publishing Priority",
    x = "Priority Classes",
    y = "Count") +
  theme_minimal()

#Value vs. Effort visualizationVE <- odpScore %>%
  mutate(vScore = (rScore + cScore + oScore)) %>%
  mutate(value = (rScore + cScore + oScore)/max(rScore + cScore + oScore)) %>%
  mutate(ease = sScore)VE$id <- seq(1, 11)VETimestamp

ggplot(data = VE) +
  aes(x = ease, y = value) +
  geom_point(color = "#0c4c8a", size = 3.5) +
  labs(title = "Ease of publishing vs. Dataset Value",
    x = "Ease of publishing (low to high)",
    y = "Dataset value (low to high)") +
    geom_hline(yintercept = 0.5) +
    geom_vline(xintercept = 0.5) +
    xlim(0, 1) +
    ylim(0, 1) +
    #geom_text(aes(label = id),hjust = -1, vjust = 0) +
  theme_minimal()

Ease of Publishing (Low to High) vs. Dataset Value (Low to High)

5.0 Next Steps

The first step of the Open Data priority framework provides a reproducible and objective process for dataset publication. We plan to integrate additional elements to improve the uptake of open data, such as data quality. Additionally the Open Data team intends to publish the prioritized datasets list for public consumption, so that users are aware of what’s coming up.

The priority framework was developed to be modular, allowing users to adjust weighting factors with changing priorities. The team envisions community members and City staff using the priority framework to proactively assess open datasets to understand where their request fits according to the City’s strategic directions.

For questions about this data story, please contact the City of Toronto Open Data team via Twitter at Open Data — Toronto, or through email via opendata@toronto.ca.