The Future of Data Collaboration

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

12 min readJun 2, 2022

Data Sharing, Data Exchanges, Data Marketplaces

In this article, we will cover the future of Data Sharing and Collaboration, including the evolution we are seeing around the increased usage of Snowflake’s Data Sharing, Data Exchanges and Data Marketplaces. We will focus on how the Snowflake Data Cloud is empowering the future of structured and semi-structured data collaboration throughout organizations across the world. Data sharing is one of Snowflake’s major differentiators from other RDBMS and cloud data systems. The ease of sharing governed data securely is what differentiates Snowflake from other cloud database, data warehouse, or data lake systems. Snowflake’s unique write ahead micro-partition architecture has helped Snowflake differentiate its offering into a full (no-copy within a region) Data Cloud where 1000s of 1000s of public, semi-private, and completely private data sets can be shared. (again, no copying within a region which is HUGE!)

Accenture has worked with Snowflake to build data sharing offerings which provide high availability and highly scalable data to their clients for global and vertically enhanced data sharing solutions. Snowflake’s unique metadata driven architecture allows them to become a full data cloud single source of truth for an organization or multiple organizations globally throughout the world. Accenture’s unique deep knowledge of customer’s data sharing concerns allow Accenture to be a top data sharing Snowflake solution implementation specialist. Accenture’s data technical knowledge around cloud providers, combined with our incredibly deep vertical industry business process experiences and expertise enables Accenture to deliver a broad set of joint digital transformation data sharing vertical solutions.

As this data sharing functionality and vision continues to grow, it is a very powerful conceptual paradigm change that moves all of us towards more data interconnectivity and single copies of live data sets which could be shared in near real time. These data sets can also be shared to a virtually unlimited amount of data consumers. This is partially enabled through Snowflake’s separation of storage and compute architecture as well. We can see a future where initially thousands, then millions of organizations; including public companies, private companies, other organizations and governments could be sharing data in this interconnected Data Cloud. This requires a conceptual data sharing paradigm shift.

Data Processing Paradigm Shift

In the past, humans and their corporations have duplicated most of their “common” data due to technical on-prem databases or architecture scaling challenges. Duplication and copying of data were required to share data along the business value chain to multiple types of constituents. (Suppliers, Distributors, etc.). The reality is though that a constant push for more innovation, automation, simplification, ease of use, and higher quality centralized data will continue to push organizations toward sharing certain data sets. Organizations do not require to have millions and millions of different data copies of data tables for the same exact standard list tables such as countries, states, and cities throughout the world. The current data copying techniques also result in additional unnecessary duplication of storage and increased energy consumption that impacts climate change. At the same time, if a single version or limited controlled copies of a data set are used by many organizations, they will become dependent on the quality and security of that data set.

Data Sharing vs. Current Data Copying

The major difference between using Snowflake Data Sharing versus previous data copying and transferring is the transformative reduction of friction and cost to process and use the data from the data source provider.

The old mechanisms involved sizable friction and monetary costs due to the following steps of transferring and using data. Let’s recap what was involved in data copying and sharing before Snowflake’s Data Sharing technology.

Data Transfer Agreement Specification.

Every time a data provider and data consumer wanted to work together and transfer data, they needed to go through an elaborate and cumbersome process to share this data via technical specifications. Depending on the amount of data and data structures, this was an incredibly complex and cumbersome step to replicate the data structure from the data provider over to the data consumer. At a minimum, it involved some form of programming or export of the data source to a common ingestion structure in delimited, fixed width, or JSON formats. All previous data transfer techniques were engineering projects with time, friction, and monetary costs to an organization. Let’s recap some of the popular data copying techniques used for data sharing.

API Data Transfer.

If the data copying and sharing mechanism were APIs, depending on the complexity and amount of data, this adds medium to incredibly large costs and friction. Programmers needed to be engaged on both sides of this data copying process. It became an engineering project for both the data provider and consumer. Often times, the data provider developed the APIs once, but the reality was a continuous maintenance and upkeep for the data provider to keep the API up to date and working 24x7.

Data Transfer through automated file transfer or copy protocols.

This data transfer process involved writing and maintaining batch type data transfer jobs (FTP, SFTP, SCP, etc.) that extracted and transferred data to the consumer designated retrieval area. Each of these data transfer jobs typically required some form of a workflow and process that ALSO required some knowledge of the data set and its structure for that data to be properly transferred and ingested and eventually used for business value . This again, required sizable maintenance and operational costs to keep it up and running. This process was often is not well secured either, especially with most data lakes. It also again required security maintenance and operations to keep running on a constant 24x7 basis.

Data Transfer through proprietary or open-source tools.

This data transfer process typically involved acquiring extraction, transfer, and load (ETL) or ELT (extraction, load and transformation) tools that can be both a combination of visual and coded toolsets. These tools required additional training and specialized expertise from employees or consultants. Besides initial costs of time and money to install the tool on-prem or in the cloud, it again had sizable maintenance and operational costs to have these vendor tools running 24x7 or as often as required.

Sidenote: New data connector type tools such as Fivetran and Stitch have changed the EL portion to a partially business user task because “known” data sources can be directly connected to the final cloud destination through minimal business processes.

Data Transfer through semi-manual movement and copies.

This process in my opinion is a horrible solution in most cases but it’s still actually done a lot today. Exporting. transferring and potentially loading data manually typically involves some business, technical employee or consultant having to go to the source system and either manually scheduling or manually performing an extraction of data to some specific formatted file(s). Once they have the files in some form of medium, they must transfer them to the consumer agreed upon pickup location. Even if the consumer has everything automated from that point, it’s still a partially manual process that can be impacted by employees leaving who perform the operation or the number of issues with human manual task dependency. In our IoT and near real-time data processing movements, this approach can become a real cost to business in both potential human errors around quality of the data as well as impacting data decision timeframes.

Data Source Specification and Data Structure Understanding

After the Data Provider and Consumer agree on the data transfer mechanism. (API, File Transfer, Vendor Tool, Semi-Manual) then the data consumer still needs to understand the data tables, schemas, columns, and often the grain and aggregation of the data. This again can be a large amount of both business and technical friction, time, and cost especially if it’s just not a few data tables being copied and shared. At its best it is another small engineering project to finalize the transformation, curation, and consumption of the data before it is shared to all data consumers within a business.

Also, beyond the costly overhead for previous techniques for data sharing through the steps of physical data copying, engineering resources usage, and vendor tools usage, there still remains a real business complexity cost to having different copies of data. Let’s go into the business challenges and analytics integrity issues of data copies in more detail.

Data Copies Everywhere Create Data Integrity Issues

Data copies, copies, copies, everywhere. Hey, I get it, we humans want control. We wanted to own our data and know that no one can take it from us. The problem is that copied data without strict control and organization is BOTH an organizational and technical nightmare which can potentially lead to truly bad business outcomes based on inaccurate or just outdated data.

Before Snowflake’s groundbreaking Data Sharing feature came to the market, we data professionals for tens of years, built millions of data copying jobs across organizations across the world. We copied and copied data like this:

· We copied data within one database.

· We copied data from one database to another database on the same platform.

· We aggregated data to scale querying more efficiently within one database or multiple databases.

· We copied data from one database vendor to another one to try and achieve better performance or just easier usability or security across business units and/or partners.

· We copied data into CSV files and opened them in Text editors or Excel and then loaded them to other areas for analysis.

· We copied data into Tab Separated files and other delimited files everywhere.

The challenge with copying (and this is still very predominant with data right now) is how does an organization’s analytical and regular business users definitely know that they are working off of the correct most up to date data set copy that is of validated quality. When you have multiples copies of data; often which is ungoverned, this becomes much more challenging. It is all too easy for any data analyst or decision maker to be analyzing data which is inaccurate and out of date.

Many vendor technologies also have exacerbated this data quality and governance challenge by creating additional extracts of data which often are ungoverned. Too many times we have seen analysts come to different calculations and conclusions because they used different copies of data for analysis. Let’s cover how the future of data collaboration works with our improved data sharing technology.

Data Sharing Technology takes Data Collaboration to New Levels of Speed and Quality

Accenture and Snowflake have enabled customers to move to near real-time improved data collaboration by implementing this data sharing technology that has only the data provider governed single copy of data available to data consumers and it does not require data copying. This is enabled via Snowflake’s micro-partitioned hybrid shared-nothing and shared-disk solution. This technique of write ahead micro-partions enables the separation of storage and compute so that the data provider’s compute is separated from the data consumer’s compute. The amazing part of this is that it is incredibly less complex, faster, and cheaper than previous data copying solutions. This also enables joint Accenture Snowflake improved data life-cycle solutions around data protection and continuous integration and development (CICD) enabled by this underlying technology which provides features of time-travel and zero-copy cloning.

Private Data Sharing Exchanges

Let’s also explain how Accenture and Snowflake’s partnership around private data exchanges have been transforming how Accenture and Snowflake customers are transforming and digitizing their businesses. We are seeing amazing data led business digitization and transformation based on our solutions in Financial Services, Healthcare, Life Sciences, Media and Communications, Energy, Oil, and others. The massive removal of friction related to how Accenture has enabled vertical data led transformation for our clients using Snowflake’s data sharing helps our joint partnership transform businesses in how they collaborate and share data at new near real time speeds not previously . We can now move supply chain decisions to minutes versus previous solutions through our joint private data sharing solutions. Let’s cover what an Accenture Snowflake private data exchange and data sharing solution looks like.

Our joint private data sharing exchange solution allows Accenture or the client’s private data exchange owner to control what specific objects are shared and the security around those objects, providing well established governance at petabyte scale. This is changing the game for our joint customers. Transforming their businesses to being able to have data operations and decisions at the speed their business needs. We remove the friction of copying data and enable our customers to share and join data easily and securely. We have also added functionality by customized secure user defined functions on Snowflake that allows for double blind data sharing and joining solutions to enable our customers new solutions that help them drive business results while at the same time meeting privacy regulations. We have built some amazing Data Clean Room solutions for customers. This was an area that I helped pioneer with one of my clients and Snowflake over 2 years ago building the original proof of concept of a Data Clean Room on Snowflake.

Data Marketplaces (Snowflake and Accenture)

Both Snowflake and Accenture provides data marketplaces for our customers. The Snowflake data marketplace which re-launched under this data marketplace brand in 2020 is just the beginning of the functionality to share open and private data sets. Currently it is still relatively immature, but it does have great potential and continues to grow and improve functionality. I have worked with Snowflake from the beginning of the data exchange creation in 2018, providing guidance for the improvement of data sharing on both private data sharing exchanges and the data marketplace. The Snowflake Data Marketplace has really improved and grown since its debut in June 2019 when it was named the Data Exchange. Accenture and Snowflake are working jointly on new capabilities and functionality around our Data Marketplaces to bring the most reliable, governed, accessible data to our joint customers in all of the industries we jointly support.

Overall, Data Marketplaces are very new and for many data professionals, it is an extremely large mental paradigm shift. Data and business professionals must move their corporate data silos and their corporate controlled dimensions of data to a more interconnected data collaboration mindset. It will take time for data professionals, business professionals, and corporations to change, but once the change happens, it will enable amazing business data and digital transformations that improve business decisions and bottom-lines tremendously. The unlocking of data for businesses will be one of the most powerful data led business transformations over the next few years.

The Future of Data Collaboration and the Overall Power of Instantaneous Data Sharing

Over three years ago, I wrote an article about the ‘Power of Instantaneous Data Sharing on Snowflake.’ While the paradigm shift via education and understanding of these enormous data sharing and data collaboration benefits takes time, it’s really starting to speed up. The key use cases from this article are still accurate today and covers why data sharing is so extremely valuable to businesses.

Some of the major business/organizational challenges this solves today are the following use cases:

1. Cross Enterprise Sharing. In many of the Fortune 500 companies I have consulted, there still remains many silos of data. The easiest way to break down those silos with scale, speed, and no copying, is to engage Accenture pre-built solutions which leverage Snowflake and enable “secure” data sharing with complete data governance.

2. Partner/Extranet Data Sharing. This can be enabled with basic data sharing or with more administrative control and ease of use with the private data sharing exchange functionality . Most organizations have many suppliers and other business partners and can benefit from securely sharing governed data with them. Extranets, EDI (Electronic Data Interchange), APIs, etc.

3. Data Provider Marketplace Sharing. Many new and old companies are moving to share their data and monetize it through a data marketplace. Data providers like FactSet and others have reduced their cost to share data through Snowflake’s Marketplace.

The Future of Data Collaboration

By 2025, IDC says worldwide data will grow to 175 zettabytes, with as much of the data residing in the cloud as in data centers. I believe this is underestimated. The world is constantly moving towards increased data capture and automation which is increasing data creation and capture. Our data sharing solutions ease of use will increase the sharing and collaboration around the data captured.

Based on statistics and data decision maker and professional surveys, we predict the following related to structured and semi-structured data collaboration:

· Data collaborations happening within hours or minutes. No longer constrained for certain business processes to daily, weekly, or monthly decisions.

· Data value chain durations decreasing. Decreased durations around business value capture will be created by data sharing and collaboration friction being removed. Businesses are often modeled by the value chain concept introduced by Porter long ago. We see increased acceleration of this data value chain speed pre-empted by marketplace customer knowledge of this data sharing 2.0 where data sharing involves no-copies and is securely governed, cheaper, and faster.

· Heavily increased usage of private data sharing exchanges.

· The predominate copying of data and batch jobs will move more to data sharing functionality due to lower costs of implementation, operations, and maintenance.

· Increased competition around being the preferred and best “no-copy” data sharing technology.

· Increased usage of data marketplaces.

· Machine learning will continue to grow quickly, and it will be severely improved by near real-time data sharing solutions. Snowflake’s ML capabilities with Snowpark will continue to grow.

· Self-shared, self-describing, and self-processed data. I have done this myself in my own data inventions and I believe this will be a major trend which should grow tremendously. I see this starting to happen as humans push towards more and more automation. The technology is there. It just needs standards, some more human creativity, and implementations.

Sources:

https://snowflakesolutions.net/the-power-of-instantaneous-data-sharing/

https://www.networkworld.com/article/3325397/idc-expect-175-zettabytes-of-data-worldwide-by-2025.html