How Dow Jones DNA powers data scientists with the tools they need to succeed
By Patricia Walsh, Dow Jones Principal Technology Product Manager & Engineering Manager
Data scientists are leveraged by enterprise organizations to find patterns in data. These patterns can then be used to identify risk, optimizations, or evaluate new opportunities. The Dow Jones archive is a data corpus well-suited to meet the appetite of data scientists, containing over 30 years of licensed articles, starting as early as the 1950s.
Data scientists want to spend time-solving problems, not spinning their wheels validating licensing rights. They need to be confident that they can use the data for their machine learning and big data workloads.
Enter Dow Jones DNA, a solution for data scientists to consume licensed content at scale. Key to making the data available at scale is programmatic validation of licensing rights. One way to validate these rights is the implementation of a whitelist. The content processing pipeline ingests articles from information providers, each considered a source. A whitelist is a list of sources allowed to be included in a snapshot or stream. Each article has a flag identifying the source from which it was published. Applying a whitelist compares the list of allowed sources against each article to programmatically verify licensing rights.
Whitelists leverage Google Pub/Sub, a messaging service that follows the publisher-subscriber model. The publisher is the message sender, decoupled from the subscriber, who receives the messages.
In the case of the Dow Jones DNA platform, the publisher is the content processing pipeline, and the subscriber is the data archive. All content is consumed from the content processing pipeline. The data archive has subscribed to the topic that maps to sources licensed for text mining.
The data archive lives on Google Cloud Platform on both BigQuery and Google Cloud Storage regional buckets. The data scientist requests a snapshot or stream via API call that includes a query. That query is injected to BigQuery to return a list of articles, which are compared against a BigQuery table of allowed sources. Only articles that originate from information providers that allow text mining are then included in snapshot or stream for delivery to data scientists
By leveraging whitelists as Pub/Sub topics, data scientist never have to worry about licensing rights – they can be heads down in the data where they thrive. The whitelist makes it possible to ingest data at scale, providing assurance to data scientists that they can use Dow Jones DNA content for their text mining machine learning and big data workflows.
To read more about the depth of use cases enabled by Dow Jones DNA visit dowjones.com/dna.
Previous articles in this series: