Data Engineering in a Personal Data Lake

Enterprises have long embraced the concept of data lakes, which in effect is a storage repository holding a large amount of raw data until it is needed. At Prifina, we are utilizing this same concept on an individual level with our “personal data engine”, organizing data in a comparable personal data lake (“personal data cloud”). Aside from data storage, this brings about several interesting data engineering challenges that we would like to introduce and explore in this post.

Markus Lampinen

Published in

Prifina

5 min readAug 24, 2021

In a nutshell:

Whether data is stored in individual clouds or in a central location may be secondary.
Several established and new data engineering techniques (including federated learning) can be applied to personal data clouds.
Companies like Google are bringing innovative techniques to processing data at the edge, including the federated learning of cohorts model, which combined with more available data could yield far greater use than privacy alone.

Narrow data on a large population versus vast data on an individual-by-individual basis

A typical data engineering setup may look like this: collect a set of data points on a large population, organize them in a set of tables and organize them by different parameters such as age, gender, socioeconomics and so on. This works well for a large population with narrow data points. However, on an individual basis, in an individual data cloud, this does not work (i.e., filtering by age will never give you a different age than the individual’s actual age, no matter how you twist and turn it).

Imagine a scenario where you have a personal data cloud with activity data from an active individual and their wearable devices, fitbits, smart scales, smart rings, bloodwork, genetics, the entire show. Despite the abundance of data, you cannot segment it based on age or gender. What can you segment and organize it by? It turns out that you can use such attributes as uniform timestamps to cluster data points and examine the contents.

For example, you could take a look at what data points were generated or updated between 1 and 4 pm on a Sunday afternoon for our active individual. This could give you data points such as location data, activity events (e.g., a run or a swim, with the attached data fields), measurements such as heart rate. Maybe you would run cross references of these with some public data points, such as weather at the GPS locations, to give more context and richness to the data.

Another example of data clustering of user data using Spotify data.The description of the features that will be analyzed are in the table below:

Since there are 5 features, visualization is not possible for all features. We can choose only 2 features to represent in 2-D space.

These techniques can help make sense of data in the personal data cloud. From making sense next comes providing utility to the user. This can mean providing better recommendations for songs based on the location (suitable library music) or incorporating it into a full application.

Combining data points and results from different individuals

Now what if you wanted to examine more than just this one individual, however active they are? Well, the same analysis you performed for that one individual could be recreated in other individuals’ personal data clouds. You could see if there are patterns between the different individuals, and now apply different segments such as age, gender etc.

If you wanted to apply a distributed algorithm to analyze the data “locally” in the individual’s personal data cloud (which can perform functions and run apps), you could choose not to combine the data itself, but rather combine only the results of your algorithm’s analysis. This could answer questions to your hypothesis, such as whether under 35 year olds exercise more outside on Sunday afternoon in the San Francisco Bay Area despite the fact the Air Quality Index (AQI) was above 100?

Training data, machine learning, and personal data clouds

Training an algorithm takes a lot of data, and even then it is a lengthy process. In most cases, this data is in a massive central location and the server is running non-stop. This may seem like a challenge in a distributed network of personal data clouds. Yet, there is a logical analogy in the world of mobile devices where companies like Google have already come up with a pragmatic solution for training algorithms on the phone and sharing results (not the data).

Federated learning as a concept is powerful and applying it to individual clouds makes a lot of sense. This is a topic we are exploring with some of our partners and expect to release several utilities for developers to leverage. An all-time favorite resource in the Prifina team on the subject, is Google’s libraries of content around federated learning and differential privacy. They’ve even released a comic strip.

For the privacy conscious — federated learning has grown popular with privacy-preserving techniques such as differential privacy, which in all simplicity introduces self-canceling noise to the data in order to make tracing unique or outlier data points back to the source.

Many opportunities, yet non trivial challenges

Individuals have an abundance of data. This data is also underutilized and lacking a similar piping that exists in the enterprise context. Organizing this data in personal data lakes creates new opportunities from personal data and calls for a rethink of what methods work the best. We invite our data scientist network to suggest their favorite approaches, libraries and even challenges, and we can turn the most valuable techniques into developer utilities.

Companies like Google are already pushing the envelope with processing data on the users machine, with their federated learning of cohorts (FLOC) model. We can see much more utility from that processing at the edge, when the edge also contains much more data. This is where applications and personal data can strike great utility.

Connect With Us and Stay in Touch

Prifina is building resources for developers to help create new apps that run on top of user-held data. No back-end needed. Individual users can connect their data sources to their personal data cloud and get everyday value from their data.

Follow us on Twitter, Medium, LinkedIn, and Facebook or listen to our podcast. Join our Facebook group Liberty. Equality. Data. where we share notes about Prifina’s progress. You can also explore our Github channel and join us at Slack.