Wrestling to Define Myself as a GIS Data Scientist
I am beginning to understand what a data scientist is at this point. I’ve held this title for a little over a year and I wrestle with it more often than I’d like to admit. On the surface, its definition is easily uncovered with a quick Google search and a Wikipedia entry: ‘ Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured…’ . But this defines every job I’ve had, short of mucking stalls on a farm, since I was an undergrad Physics major at a liberal arts school.
‘Data Science’ is redundant in my mind. Is there a science were data doesn’t define it? Perhaps that is the root of my confusion. To the dismay of those in my personal life, I seek identity through my job. So in wrestling with what my title (printed neatly under my name on a business card) means, I seek a relationship with its intention. As a PhD Astrophysicist, ‘Scientist’ was always concrete and undisputed. But in my new role as ‘Data Scientist’, I’m unsure if the science implication is as universally concrete. I’ve met a few data scientists at this point and they encompass psychology undergrads and PhD statisticians alike. This is the inherent bothersome predicament: data science is un-quantitative in application for (as it turns out) a highly quantitative role.
In this ruminating, I have settled on what I believe is an archetypical (GIS) data science problem that defines my existence and (current) knowledge base at OmniEarth. Our data science team has been very interested in evaluating residential water use in CA. As it turns out, a significant water shortage is troubling the Western U.S. Our end-to-end data driven solution has a clear goal: What is the piece of actionable information that an individual could take based on the (hidden) combination of data sources, mathematics, and coding? This is the archetype of GIS data science I identify with: (1) ingesting high-resolution aerial and satellite multi-spectral imagery (insert astronomy degree utility here + cloud computing — there are a lot of pixels), (2) performing land cover machine learning to identify objects of things based on spectral and textural features (insert machine learning and statistics), (3) fusing regional weather information such as evapotranspiration as a function of landcover type (insert meteorology), (4) evaluating tabulated vector information that identifies parcel level details about residential districts (insert data management, cleansing, organizing and a lot of Python pandas/geopandas coding), to (5) yield a per parcel budget of outdoor water needed by each of the 12+ million households in CA to maintain their landscaping, pools, etc. Once this is generated, the budget may be compared with actual residential usage. This, now, is the usefulness of the solution: armed with a budget versus actual difference, individuals can take action to modify their water use based on scientific data assessment and not because someone just told them to stop watering their lawn.
In my year of reflection as an active GIS data scientist, I will venture to define my role as a data scientist: an epoxy of Franken-code and tears which produces a semi-curated emporium of digital awesomeness that can be sold for money (sometimes), but more importantly, provides a piece of actionable information anyone, technical or not, can understand and utilize.