What is a Dow Jones DNA Snapshot, and why does it exist?

Patricia Walsh
Dow Jones Tech
Published in
3 min readApr 8, 2019

By Patricia Walsh, Dow Jones Principal Technology Product Manager

This is the second blog of a series focusing on DNA. To read the first blog post, please go here.

Data scientists are charged with using algorithms to find patterns and meaning in large sets of data. A data scientist wants to spend their time tuning models to better inform business decisions, not messing around with complex integrations. The Data Engineering team was tasked with finding a way to make the over 30-year archive of premium news (with rights cleared for text mining) dating back to the 1950’s available to data scientists. The solution had to be easy to integrate and work with varying tech stacks to reduce any disruption to existing workflows.

The Dow Jones DNA platform, which stores the archive as well as ingests all incoming data, is made available by the Data Engineering team.

A data scientist can run a query on the licensed content to acquire a data set. The data is then made available via a snapshot, which is a downloadable file containing the DJID metadata as well as the article content.

Snapshots are technology agnostic. They can be used on-prem behind a firewall, on a cloud provider such as AWS or GCP, as well as on any hybrid system. For data scientists, this means less time wasted troubleshooting and more time dedicated to solving problems.

The data archive is comprised of articles from information providers who have given us rights for text mining. Each information provider is identified as a separate source.

A snapshot is created by an API call containing a SQL WHERE clause, Google BigQuery leveraged as a query engine acting as a search index.

The WHERE clause is injected to BigQuery backend services as a simple query. BigQuery compares the WHERE clause against a table containing a list of sources licensed for text mining. If the source has appropriate rights, it is then appended to the query. The new query becomes the WHERE clause, in addition to the allowed sources. The new query is then compared against a table containing the archive, and BigQuery returns the list of indexes that match the new query.

The returned indexes become input to a Dataflow job. The job pulls all indexes in the list to then be included in a downloadable snapshot. Dataflow runners are available continuously in order to perform the join and transformations.

The result is compiled files ready for download. These files are then made available on client storage, which is itself a GCS bucket. The data scientist is notified that the snapshot is ready for download.

The value of a snapshot to a Data Scientist is that they provide easy access and integration of the data archive. The snapshots can be integrated into existing workflows to save time and lessen the need for familiarisation with new tools.

These snapshots work in existing workflows, allowing data scientist to spend less time adapting to new tools. Snapshots remove barriers in integrating large custom content sets to existing workflows in a manner that is technology agnostic.

To read more about the depth of use cases enabled by Dow Jones DNA, visit dowjones.com/dna.

Additionally, I will be speaking alongside my colleague, Dylan Roy, about Dow Jones DNA at Google Next. Be sure to attend our talk, if you will be there!

--

--