Google Cloud Next ‘19 Recap: Transforming to a Data Business

Patricia Walsh
Dow Jones Tech
Published in
4 min readNov 7, 2019
Patricia Walsh and Dylan Roy presenting at Google Cloud Next ‘19

This past April, Dylan Roy and I had the great opportunity to present at Google Next19 on the topic of transforming Dow Jones into a data business. We spoke to how we ran experiments using cloud services that then informed our architecture decisions. We used these experiments to better understand the trade-offs of cloud cost optimization versus cloud performance benefits. We walked the audience through the progression of our architecture framing each evolution around the usage pattern observed and the cost to perf trade-off that best met the customer expectations and business objective.

There exist tradeoff decisions when designing a platform to make data available for machine learning and big data workflows. Cloud-based services present an opportunity for cost benefits to include preemptible instances, sustained use discounts, tiered storage, and per-minute billing. There also exist opportunities to ensure service for customers, such as managed instances, performant per-minute, and autoscaling. What we discovered in designing the Dow Jones DNA Platform is that, for any architecture change, you will need to evaluate the trade-off between a potential cost optimization vs a performant experience.

What does it mean to be a data business?

Being a data business means that we both create and sell data. We use data gathered on users to inform what data we create and sell. Monetization opportunities include ad targeting based on anonymized user preferences, increase in subscription renewals by implementing recommendation engines based on anonymized reader affinity data, selling content as data to include Factiva membership, and the Dow Jones DNA product selling data as a service.

Running experiments in Cloud Services:

The case for moving to the cloud is often centered around cost savings. What we have discovered in practice is that cost savings are never a guarantee. Costs in cloud services depend heavily on the use case and the service leveraged. The actual cost is not always evident without some experimentation. For example, when Dow Jones DNA was first launched, we chose Google Cloud Datastore because it integrated well with Google Cloud Dataflow pipelines for ingestion, offered high availability, and performant read/write at scale.

In using Google Cloud Datastore as the storage layer, we discovered that there was a cost associated with maintaining our own DSL. The read/write operation was expensive at scale. We needed querying capability and extraction capability at scale. Google Cloud Datastore met the need but was not optimized for our use case.

This first experiment with Google Cloud Datastore uncovered a piece of the puzzle for further development: in order to actualize cloud cost savings, the usage pattern has to be understood. Capturing usage data on how your services are consumed provides insight into customer expectation and business need. In a few instances, the usage patterns observed were different than expected. Capturing this data made it possible for the team to adapt.

Matching usage patterns to trade-offs between performance expectations and cost savings:

There is no one cloud service that is better than the others – it is a matter of finding the cloud service that strikes the right balance for what you are trying to serve to customers. Cloud savings come in the form of preemptible instances, sustained use discounts, archival storage, and storage tiers, among others. Cloud performance comes in the form of managed services, response times, ability to query large data sets over distributed systems, and analytics capabilities. Once we identified where there were savings opportunities and where there were performance expectations we were equipped to match the services to the usage pattern. For example, there was opportunity to error on the side of cost savings for processing of incoming data, we leverage preemptible instances, as it is batch processing. In the case of running a snapshot, the customer expectation was performant response. In this example, we erred on the side of managed services and Google Cloud BigQuery as a search index.

Lessons we learned in iterating on our architecture:

  • Understanding the usage patterns is key to actualizing both performance expectations and cost optimization.
  • Building a flexible architecture makes it possible to respond to usage patterns as they become apparent or as they change. Flexibility also makes it possible to adapt, as new services become available or as existing services improve.
  • Being an early adopter is a double-edged sword, meaning you can have influence over the direction of a cloud service, but also you’re building on a moving target. An alpha service can be discontinued at any time.
  • There are hidden expenses in data shuffling, joining across distributed systems, cost of data extraction, cost of read/write that all differ based on the services.

What we landed on is Google Cloud BigQuery as the search index joined with Google Cloud Storage as the storage layer. Understanding usage patterns helps us deliver the customer benefit without going bankrupt in the process.

It is our belief that the lessons learned through experimentation of architecture, understanding trade-off decision, and matching cost to performance based on usage patterns will help future-proof Dow Jones as it transforms to being a data business.

To view our full Google Next talk, go here. To view our slides, go here.

To read more about the depth of use cases enabled by Dow Jones DNA, visit dowjones.com/dna.

Previous articles in this series:

What is DNA content and Where does it come from?

What is a DNA Snapshot and Why does it Exist?

Google Next 19 Transforming to a Data Business

--

--