Data Science Teams: GEOINT Analysis of the Future
By Rafael de Ameller and Michael Hauck
This article was originally published in USGIF’s State & Future of GEOINT Report 2017. Download the full report here.
Who would have thought an augmented reality game would reach 75 million users within a month of its release, and the average user would spend 43 minutes a day in that artificial reality? In July 2016, Pokémon GO rose to the №1 spot in both the Apple and Android App Stores in record-breaking time. It is a harbinger of the future, and a prime example of the concept of immersive GEOINT discussed in the 2016 State of GEOINT Report.
Pokémon GO, however, is just part of the story. The big news is the continued exponential increase in data — much of which includes location information such as where the data was created, where it is stored, where it is used, or its relationship to other location-based data. The volumes of data are mind-boggling, and so is the increase in its rate of production. The new reality is the creation of far more available data than there are human eyes to observe. The days of a team of experts analyzing the latest image from a one-of-a-kind source are over for all but the most denied areas.
In this new reality, the GEOINT Community will need teams of diverse experts because the work being performed is increasingly complex, thanks in part to the proliferation of multimodal collection systems that require some level of expertise for each collection component. Moreover, customer expectations have now been influenced by the likes of Pokémon GO.
What Customers Want
Imagine public safety stakeholders being able to know before a hurricane makes landfall which areas have an increased risk of additional flooding and fallen trees due to soil saturation, and would thus be susceptible to outages or road closures. Emergency managers could proactively deploy personnel. For example, data science teams could exploit NASA’s Soil Moisture Active-Passive Earth observation mission for this purpose.
There are hurdles, however, to practical use of Earth observation data in areas such as emergency management. Much Earth science research data is shared across departments, agencies, and the public, yet the data sets are often not accessible to decision-makers. One reason is a lack of expertise in accessing the data in a relevant and timely fashion. Data science teams would need both remote sensing and emergency management expertise.
Cloud-based, Software-as-a-Service data visualization solutions combined with open data standards provide a way forward to facilitate the kind of necessary collaboration. They allow consumers, industry, governments, and non-profit organizations around the world to integrate GEOINT into their data visualization tools and operations.
Currently, most organizations store and manage data in their own manner, with little or no valuable metadata or expertise to exploit the information outside of their own systems. Cloud technology providers such as Amazon Web Services allow organizations to benefit from massive economies of scale on shared infrastructure, facilitating data accessibility while reducing costs.
For example, the National Weather Service handles massive amounts of data and data products. However, the agency is moving away from a product-based operating model to one in which interpretation and relevant information is provided through Impact-based Decision Support Services (IDSS). Success depends on the relationships with the weather service’s core partners, such as the emergency management and water resources management communities. Those users need more actionable weather, water, and climate information from the agency. So, the agency is placing staff closer to where IDSS workloads are located, deepening the relationship between the agency and stakeholders. Essentially, it is building ad hoc data science teams.
Advances in Imagery Compel a Team Approach
One of the factors that compel a team approach is the explosion of geospatial data. This has come as a result of advances in technologies developed primarily outside the geospatial industry. New materials, higher capacity batteries, miniaturization of electronic components, more powerful processors, increased bandwidth, the proliferation of wireless-connected smartphones, open-source software, the global use of social media, and cloud platforms all contribute. In the geospatial industry, these technologies have underpinned the development of low-cost sensors that can be deployed on commercial unmanned aerial systems (UAS), terrestrial mobile mappers, and small satellites. Today, even small companies can command a specialized imaging satellite that only the wealthiest nation states could afford just a decade or two ago.
Newly available commercial sensors facilitate new commercial imagery products and services. For example, short-wave infrared (SWIR) can see fire through smoke. Synthetic aperture radar allows collection of topographic data through cloud cover. Multi-band LiDAR facilitates seamless bathymetric and topographic mapping in transition zones. High-definition video cameras allow a continuous, real-time look at the Earth from space. New multi-band visible and infrared imagers can completely cover half of the Earth every few hours.
New sensors deployed on new platforms increase the availability of data from various points of view, ranging from space-based observation to ground-based, from kilometer to millimeter resolution, from whole-Earth images to tightly defined areas of interest. Inexpensive commercial UAS can now carry sophisticated LiDAR sensors to create close-range, high-precision maps. Airplanes carrying high-resolution thermal sensors can see cracks in dams and pipes. Robotic tunneling vehicles can map underground spaces. Geospatially-registered, full-motion video is commercially available from vehicles in space, on the air, and on the ground. As technology advances, an unprecedented array of commercial sensors can now be placed on a similarly unprecedented array of platforms to collect imagery never before possible.
What this means is it’s getting harder for a single specialist to have all of the necessary expertise. It also means organizations are driven to employ generalists because they cannot find — or cannot afford — all of the specialists needed to cover each source technology or data type. While the proliferation of sensors and data creates new opportunities for analysis, it also makes analysis more difficult if done the “old-fashioned” way. A new way is needed that leverages multidisciplinary teams.
Beyond Imagery: Geospatial Big Data
In addition to imagery, consider the wide variety of other location-based data being generated every day by human and machine activity. Reflect on the activities and interactions of people revealed in social media. Factor in the location information intrinsic to telecommunications traffic — particularly mobile. Add the measurements and actions of infrastructure elements managed with Supervisory Control and Data Acquisition systems. Couple that with the engineering information contained within infrastructure asset inventories. Add geo-tagged banking and point-of-sale transactions. Finally, at a high level, integrate the economic and social information contained within census data.
The spatial, spectral, and temporal resolution of commercially available imagery and other location-based data continues to progress with no end in sight. As the volume of data has increased, so has the complexity of the data, which means the potential number of cross-correlations increases exponentially. Certainly, the scale of data in question is well beyond terabytes and petabytes. Although the term “big data” is without a consensus definition, geospatial data as described above certainly qualifies, given the volume, velocity, and variety of data in flux. This situation presents a compelling need for a new analytic framework appropriate for the age of “geospatial big data.”
Geospatial Data Science
There is currently no consensus meaning for “data science,” although the term is routinely applied in the analysis of also loosely defined “big data.” Arising from the field of statistics in the 1960s, data science has recently come to capture various interdisciplinary approaches to computer-based analysis. To analyze big data, the data scientist needs to consider the nature of the data; is it structured or unstructured, streaming or stored, real-time or historic, qualitative or quantitative, measured or simulated, real or virtual? Other fields related to data science include data mining, machine learning, knowledge discovery, and predictive analytics.
Given the volume of data, much of the analysis must be automated, so data science has at its heart algorithms to extract knowledge from data. More than simple coding ability is needed to write a useful algorithm. The analysis is likely to require whole libraries of algorithms — which is one reason open-source software and data are so appealing to many data scientists, who also tend to appreciate the massive storage and computation power of cloud platforms. Visualization is where today’s data science seems to fall short. The visualizations look great in the movies, but in real life are incredibly difficult to generate for general problems for which visualizations have not already been designed.
While visualization certainly includes traditional, two-dimensional maps and charts, some emerging technologies for visualizing data have gained commercial traction. Sophisticated analytic solutions include Palantir, Recorded Future, and Spotfire, but there are simple, consumer-facing visualizations as well. Examples include mobile games such as Pokémon GO, storytelling applications such as Story Maps, and navigation units such as Garmin or apps such as Waze. The exciting new frontier is immersive visualization, for which the technology is driven largely by computer games and industrial design and simulation. Anyone who has donned 3D glasses to watch a Pixar animated film has experienced storytelling within a virtual world. If the virtual world is a replica of the real world, then virtual walk-throughs are possible for places one could not otherwise visit. Alternatively, as one walks through the physical world with augmented reality technology, one virtually “sees” the pipes behind the walls of a building or labels for objects that have been geo-located. For training and simulation, what one sees can be altered so that, for example, an electrician could see a live wire as red-hot. Thanks to consumer and high-end industrial markets, these data visualization technologies are rapidly becoming commercial off-the-shelf products.
What Makes a Data Scientist?
Like any good analyst, a good data scientist asks the right questions, then looks for answers hidden within the data. Therefore, a data scientist should understand the problem domain, be it politics, military science, engineering, economics, etc. The extra dimension of data science is the volume and complexity of data for analysis, so a data scientist should also understand computing and be comfortable with complexity. He or she must be able to perform formatted and free-format data input/output, organize storage of the data, and be able to write or use tools to analyze the data. Writing algorithms to explore the data can be very challenging, because one has to figure out how to turn a question into code. This is detective work, in which discovery leads to discovery, yet not all hypotheses test positive. A good data scientist must be able to dig through the details, while at the same time maintaining a high-level perspective. Once answers are obtained, the data scientist must be able to effectively share the insights with others. He or she must be able to clearly state the question(s), articulate the answers, and explain how the insights were obtained, all of which is best done through visualization. But, possessing the personality and skills to successfully perform all of the aforementioned is a tall order for one person.
The Case for Data Science Teams
Since few people possess all data science knowledge, skills, and abilities, teams will be required to fully exploit all data science has to offer. What comprises a good team would depend on the problem at hand, but the most important player on the team would be someone who knows how to ask the right questions — likely a domain expert, hopefully with good communication skills and a curiosity that is not easily satisfied. Geographers, historians, political scientists, military scientists, engineers, geologists, etc., would likely fit this profile. In addition to domain specialists, useful expertise on a data science team includes statistics, signals processing, database architecture, visualization, and modeling and simulation. Ingesting, organizing, and processing data each require specialized expertise as well. Where can one find such people? Academic departments in the areas of mathematics, physics, computer science, electrical engineering, operations research, philosophy, architecture, economics, marketing, and linguistics, to name a few. An effective data science team must be able to wrestle with data in its full complexity. That is the new reality.
Perhaps this is the future of GEOINT analysis. Data science teams will interact with stakeholders to find answers to their hardest problems. This will empower individuals to make better decisions when interacting with GEOINT data. The products created by data science teams will be designed to be easy to digest and share among decision-makers, subject matter experts, and all who could benefit from the information. Regardless of their location or the device they use, the data will be accessible anywhere and anytime. Well-rounded data science teams and ubiquitous data will help all stakeholders unlock the full value of geospatial big data and the GEOINT it makes possible.