Insights from our Small & Synthetic Data roundtable v.1
DataSeries |VRT280720
In July, DataSeries, an OpenOcean led initiative hosted a Virtual Roundtable about “Small & Synthetic Data”.
INSIGHTS GATHERED:
CHALLENGES:
85% of the data collected is “small data” (IDC, Gartner…)
The challenge of small data is different from big data, as combining data sets is a rarity, hence decent data quality is crucial. The data must be in a standardized format. An obvious challenge is that if a pattern isn’t present in the data, then there is no way for the algorithm to learn it. According to Mydatamodels, evolutionary and genetic algorithms offer a great opportunity in the small data field (we’ve hosted a follow-up roundtable where deep dive into this further — see here).
Data transparency and interoperability is a key challenge and companies are increasingly aware of its importance. Also in the context of compliance. Understanding how the model is built has to be understood.
Privacy regulations can force/enable companies to break down silos and use the data in new ways. A key challenge is still to educate corporates regarding privacy-related issues (e.g. What am I allowed to do with my data? What is private when it comes to the anonymization of data? etc).
Also, GDPR regulations have created a new class of data (e.g. data requests that force firms to share data with the user/client). However, the data that you receive is most of the time completely unusable (e.g. archive file format). A challenge here is to understand this often-times invasive data and create meaningful insights for both the developer and customer. By being able to understand this type of data, we can enhance the utility of data and use accurate historical data for predictive analytics (e.g. ancestral data combined with activity data).
An interesting field of synthetic data next to addressing privacy issues is the enhancement of very niche-specific data. This is data that is just a very small percentage of a large data set and is difficult to get (e.g. medical endoscopy, 60 fps video data that generates massive data sets). Only a small amount of data is relevant to the Machine Learning model.
In Health Tech - a few years back one of the key challenges was getting access to data. This has changed significantly with the implementation of safe havens (deploying your ML models at a hospital behind a firewall). Hospitals and health tech providers are now starting to open up their data as long as it’s anonymous.
Data agility is still a key challenge. Organizations have traditionally been hamstrung by their use of data due to incompatible formats, rigid database limitations, and the inability to effectively combine data from multiple sources. Also, it is crucial to understand the processes that generate the data lineage in order to see the strategic picture.
Example: Retail Banks are sitting on massive amounts of data which are distributed across their organization and they are oftentimes blind in utilizing their transactional data/customer data in an effective manner (e.g. targeted Marketing Campaigns). Many of them are also still hesitant to move to the cloud, but that’s changing: The National Bank of Australia aims to catapult the bank’s current public cloud base of just 25% to almost total public cloud adoption.
There is a serious lack of professionals who are able to shape/understand the business strategy whilst executing on the data strategy (e.g. a Data Artist who is able to have an architectural point of view). A key challenge is explaining the strategic aspect to data scientists. In order to create a decent solution, everyone has to understand the problem and objectives crystal clear.
OPPORTUNITIES:
- Have privacy regulations created new data sets for a B2B market opportunity? Can this be opened up and democratized?
- Federated Machine Learning to enhance the security of data and also solving latency especially with live interconnectivity.
- Opening up (personal) data sources for developers to create opportunities for organizations. E.g. Facebook has millions of data points per user.
- Data providers should move further than just sourcing and aggregating the data to cleaning, engineering, tagging, enriching, etc.
- Synthetic data can solve privacy issues and data storage in organizations. Proper data management in this context can enable companies to break down silos and open up their (anonymized) data to third parties enabling them to potentially finding new business models.
- Edge computing allows more efficiency and data can be analyzed where at the source of its production(infrastructure gets too expensive, security issues, etc.) — data being produced is growing at an exponential rate.
- Will the EU crack open big players to distribute their vast amount of data and thereby offer a more competitive advantage to startups (data access, insight extraction, etc.)?
- The increasing opportunity lies in synthetic methods: data augmentation and generative models to simulate what that data might look like from different viewpoints.
- Small data developments. With a relatively small amount of data, we can accelerate certain model fabrications rather than analyzing massive data sets.