How Dow Jones DNA powers data scientists with the tools they need to succeed

Patricia Walsh
Jun 4, 2019 · 2 min read

By Patricia Walsh, Dow Jones Principal Technology Product Manager & Engineering Manager

Image for post
Image for post

Data scientists are leveraged by enterprise organizations to find patterns in data. These patterns can then be used to identify risk, optimizations, or evaluate new opportunities. The Dow Jones archive is a data corpus well-suited to meet the appetite of data scientists, containing over 30 years of licensed articles, starting as early as the 1950s.

Data scientists want to spend time-solving problems, not spinning their wheels validating licensing rights. They need to be confident that they can use the data for their machine learning and big data workloads.

Enter Dow Jones DNA, a solution for data scientists to consume licensed content at scale. Key to making the data available at scale is programmatic validation of licensing rights. One way to validate these rights is the implementation of a whitelist. The content processing pipeline ingests articles from information providers, each considered a source. A whitelist is a list of sources allowed to be included in a snapshot or stream. Each article has a flag identifying the source from which it was published. Applying a whitelist compares the list of allowed sources against each article to programmatically verify licensing rights.

Whitelists leverage Google Pub/Sub, a messaging service that follows the publisher-subscriber model. The publisher is the message sender, decoupled from the subscriber, who receives the messages.

In the case of the Dow Jones DNA platform, the publisher is the content processing pipeline, and the subscriber is the data archive. All content is consumed from the content processing pipeline. The data archive has subscribed to the topic that maps to sources licensed for text mining.

The data archive lives on Google Cloud Platform on both BigQuery and Google Cloud Storage regional buckets. The data scientist requests a snapshot or stream via API call that includes a query. That query is injected to BigQuery to return a list of articles, which are compared against a BigQuery table of allowed sources. Only articles that originate from information providers that allow text mining are then included in snapshot or stream for delivery to data scientists

By leveraging whitelists as Pub/Sub topics, data scientist never have to worry about licensing rights – they can be heads down in the data where they thrive. The whitelist makes it possible to ingest data at scale, providing assurance to data scientists that they can use Dow Jones DNA content for their text mining machine learning and big data workflows.

To read more about the depth of use cases enabled by Dow Jones DNA visit dowjones.com/dna.

Previous articles in this series:

Patricia Walsh

Written by

Dow jones Technology product director

Dow Jones Tech

Dow Jones is a global provider of news and business information, and has produced unrivaled quality content for over 130 years. Read insights from our innovative Technology team who work across our publications and products, including Wall Street Journal, Factiva and Barron’s.

Patricia Walsh

Written by

Dow jones Technology product director

Dow Jones Tech

Dow Jones is a global provider of news and business information, and has produced unrivaled quality content for over 130 years. Read insights from our innovative Technology team who work across our publications and products, including Wall Street Journal, Factiva and Barron’s.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store