My Internship Experience at DCP
In January of this year I started my first internship at the New York City Department of City Planning working within the Enterprise Data Management Division, with the Data Engineering team. It has been a great experience which I want to share with you in this blog, including my contributions to the ongoing Data Library project.
Data Library is Data Engineering’s newest input data storing system. It consists of a Python package that downloads, reformats, and stores a given dataset on our S3 Cloud server. This input data staging system is important because the data used to create the team’s data products comes from different sources and in various formats. Sources include NYC Open Data API, other government open data stores, and even scraped data from websites. Files come in various formats, including CSVs, geoJSONs, and Shapefiles. Furthermore, spatial data formats and projections will vary between spatial datasets.
Before ingesting input datasets to create our data products, such as the Facilities Database, we must first parse, standardize, and save this input data in a central location. Having a consistent format and naming convention across input datasets enables us to easily ingest them into multiple data products. To accomplish this goal, Data Library uses configuration files containing the necessary instructions on how to retrieve, read, store, and reformat a dataset.
My first task on the data library project was to write a batch of configuration files like this one. To work in Data Library each field in the configuration file must follow a set of rules or contain a specific value. For example, the name field takes only a string as a value, while other fields, such as geometry or url have predefined schemas (more on this later). Additionally, some fields have restricted values. For example, acl, which contains information on the access type, must be “public-read” or “private”. Additionally, the geometry field must have a valid geometry type supported by GDAL.
After some creation and testing, I noticed that Data Library had no way to determine if a given configuration file was valid or not. If required fields were missing or unexpected values were passed through, the program would execute, but crash when it reached an invalid value and the user would receive no indication as to why the file failed. This opened up an opportunity for me to build a new feature in data library.
In order to validate our files I made use of Pydantic, a python library that allows us to do data validation over a specified schema using python’s type annotations.
These are the imports I used for the validation:
from typing import List, Literal
from pydantic import BaseModel, ValidationError
A basic validation schema would look like this:
class Url(BaseModel):
path: str
subpath: str = None
In the example above, we are specifying that a url field must have a path as a string and may have an additional subpath, but not required (It can be None).
We can also restrict possible values like this:
VALID_GEOMETRY_TYPES = ("NONE", "GEOMETRY", "POINT", "LINESTRING",
"POLYGON", "GEOMETRYCOLLECTION", "MULTIPOINT", "MULTIPOLYGON",
"MULTILINESTRING", "CIRCULARSTRING", "COMPOUNDCURVE", "CURVEPOLYGON", "MULTICURVE", "MULTISURFACE")class GeometryType(BaseModel):
SRS: str = None
type: Literal[VALID_GEOMETRY_TYPES]
Here, I passed a Literal to specify that type must be only one of the values within VALID_GEOMETRY_TYPES.
A schema may also be passed as a field of another schema like this:
class SourceSection(BaseModel):
url: Url = None
socrata: Socrata = None
script: str = None
geometry: GeometryType
options: List[str] = []
In this example, I am referring the previous url and geometry schemas, which are part of a bigger schema referring to our source field. At the end, the main schema specifying a configuration file looks like this:
class Dataset(BaseModel):
name: str
version: str
acl: Literal[VALID_ACL_VALUES]
source: SourceSection
destination: DestinationSection
info: InfoSection = None
Now we are able to pass a file as an input and validate it like this:
try:
input_ds = Dataset(dict)
except ValidationError as e:
print(e.json())
Here, I try to initialize an instance of a Dataset schema. The input is a dictionary that in this case, comes from a parsed .yml file. If the input is not valid, Pydantic throws a ValidationError object containing useful details that can be views as a JSON. When parsing a file missing an acl value, the returned error would look like this:
[
{
"loc": [
"acl"
],
"msg": "field required",
"type": "value_error.missing"
}
]
I hope you have found my experience interesting, as well as my demonstration useful if you find yourself in a similar project. The entire Data Library package can be viewed, along with the the validation module on GitHub. Thanks for reading!