What do we want from a dream data platform as a Service?

Published in

Met Office Informatics Lab

8 min readMay 18, 2020

Over the last couple of years we’ve been working with technologies like Jupyter and cloud computing to create modern data platforms, largely as part of the Pangeo project. We’ve started rolling out these tools to real users which has given us a good perspective on what we think needs to be done next. We thought it was about time to take a step back and try to really verbalise, from a user-centric perspective, what functionality a dream data platform needs. As such, we’ve come up with a series of imagined user stories.

It’s important to know that we are describing a particular perspective. Our organisation, the UK Met Office:

Is relative big, with hundreds of users
Is built on cutting-edge research science
Provides operational real-world advice/value based on this science
Regularly collaborates with others including sharing and comparing big datasets
Already has a big legacy data science IT estate

The advent of various data platforms as a service (PaaS) has been a huge step forward in recent years, answering the needs of many users. However, we think there is still the opportunity for new functionality a PaaS should include for organisations like ours.

We’ve tried to express our user stories like this: As a ____, I need to ____, so that ____. We’ve also tried to sketch out a few user personas for people, but we’ve also played fast and loose with introducing ad hoc personas. We’ve constrained ourselves to user stories that we think are currently under-served.

The personas

Analyst: Someone who wants to extract some information from data, for instance a research scientist. Often very big data. They are code proficient in their domain but not necessarily up to speed on how to use and configure the myriad of cloud services. They write code as a means to an end, the less time thinking about code and computers and the more time thinking about their problem domain the happier and more productive they will be. Not surprisingly for a data analysis platform, most of the user stories fall under this person!

“As an analyst, I want an easy point-and-click based user interface (UI) which allows me to access the power features (see below!) so that I can concentrate on gaining insight from my data and not spend time learning a new system/language/service/what cloud computing means.” This user story is foundational! It’s implicit in every other user story here. A big barrier to entry at the minute is that our users don’t want to learn how to use the cloud consoles, or a cloud provider Python module: they just want to get going. Many of these users already use Jupyter, so that’s a good start. The barrier to entry for any extra features needs to be as low as possible and the return on the “time to learn” investment clear.
“As an analyst working with big data, I need to be able scale my compute horizontally (more computers) and vertically (bigger/faster computers) so that I can perform my analysis in a reasonable timeframe.” The research that our analysts perform range from embarrassingly parallel problems that scale well over 100/1000s of small nodes to highly coupled problems that require a single large machine perhaps with GPUs or other specific features.
“As an analyst, I need to be able to quickly create and publish user friendly applications so that I can push new discoveries from research to applied science/operations/the real world.” Pull through from science to operational weather forecasting is a key part of the Met Office mission.
“As an analyst, I need to be able to share and publish notebooks so that I can work on problems with colleagues.” This also means sharing everything needed to run the notebook: the software environment; routes to access any data used; accompanying modules of code. This is the minimum — ideally we’d have realtime co-editing of Notebooks, complete with tracked changes, and code and data version history.
“As an analyst, I need to be able to control the rate I spend my allocated funding/resource, so that I can choose to spend it effectively.” As Uncle Ben said in Spiderman, with great power comes great responsibility. If we give people tools to let them analyse big datasets, we need to give them to ability to do it wisely.
“As an analyst, I need a way to easily browse, understand and load datasets into my analysis session so that I use the best data available and spend more time doing actual science.” Ideally this would comprise drag and drop loading, natural language searchable datasets, versioning, human readable descriptions, etc etc. Loading means getting an analysis ready object, not a file path string or 10,000 individual blobs that need hand stitching together. Finding, accessing and loading the right data is one of the major blockers in data-led science.
“As an analyst, I need to effortlessly work with various compute estates which are coupled to datasets, so that I do not have to change platform when I use a new dataset.” A lot of cloud platforms have implicitly assumed that we can move all our data into one cloud platform. This is probably true for a lot of small data-centric businesses but it’s unthinkable for a whole scientific community. Users can’t move all their data upto the platform and they can’t use the platform to access their data where it is. We need a hybrid approach.
“As an analyst, I need to be able to submit and monitor long running tasks to a robust compute service, so that I can complete analysis that takes too long to actively monitor.” Whilst we are fans of interactive workflows for science sometimes nothing beats batch processing. This service should, as closely as possible, mirror the interactive environment so moving analyses from one to the other is trivial.
“As an analyst, I want to customise my Jupyter instance with the tools and extensions I find useful, so I can be most productive.” We can’t predict all the tools people need, we want to enable a good level of self service/customisation.
“As an analyst, I want to customise my software stack so that I can access the most useful specialist tools.” The analysts are best placed to know what tools will be useful for them, so ideally they should be in control of their software environment, rather than a Sys Admin.
“As an analyst who provides information to others (be that scientific publication, or consultancy with decision makers), I need to be able to prove the chain of analysis, so that I can: evidence my conclusion; analyse my level of confidence; correct mistakes; apportion revenue up the chain.” Provenance of information is fundamental to whether you can trust it or not, as well as how to recreate it.
“As a research software engineer, I want to be able to seamlessly move between writing expressive interactive science analysis, and writing more traditional software so that I can develop high quality, powerful tools.” The notebook environment is great for research and exploration but it is not designed for writing high quality libraries and modules. The people who write the tools and libraries that our analysts depend on need to use more traditional IDEs and developer tools. Doing this on the same platform as the analysts they are serving and being able to interact seamlessly with the Notebook environment will result in the best tools.

Data generator: Someone who produces data and datasets, which are often very large. They want people inside or outside the organisation to find, understand and use their data. They understand their data and know how to work with and manipulate it.

“As a data generator, I need to be able to publish my datasets so that the appropriate people can discover and use them.”
“As a data generator, I need to be able to evidence what my data is being used for, so I can justify my funding.”
“As a data generator, I need tools for publishing many meaningless chunks of data as a meaningful dataset which can be used by other humans.”

System administrator: They design and maintain the system. They make sure it stays up, develop and improve it, make sure it’s safe, secure, cost effective and fit for purpose. It is their business to know and understand how complex systems fit together and work. They want to understand performance and cost of a system, and where these characteristics are coming from.

“As a system administrator, I want a back-stop on how much money my users spend, so that they don’t accidentally run up large bills.” These billing controls should be hierarchical. A backstop for the whole organisation, for departments, teams, individuals etc. Management of these budgets need to be assignable to the relevant person, not a management overhead for the system administrators.
“As an enterprise company with a large and complex existing data estate, I need access to a series of well decoupled services, so that I can combine them with existing services.” A lot of prior data platform offerings have been monolithic, which means it’s impossible to couple them to existing enterprise systems. For example there is no point in a provenance aaS (see 11.) if we can only address it from a particular cloud Jupyter Notebook aaS. We also need legacy systems (such as something obscure like a bespoke satellite post processing suite coded in Fortran running on an on-prem server) to be part of the same ecosystem.

Final Comments

Of all these user stories, there are some which are really crucial.

Fundamentally, we have to move towards a one-stop, effortless user experience for our analysts, where they spend as much time as possible doing their job — gaining insight from data. Learning any kind of cloud console etc is a blocker. Fortunately, most users are familiar with the Jupyter ecosystem and Jupyter Lab plug-ins can provide a nice way to offer new services or features with a low barrier to entry.

It’s also incredibly important that we create an ecosystem of services which play nicely together but stand on their own. For the foreseeable future, our organisation is likely to have specialist, home-made systems as part of our data estate, and we really want to be able to couple them into an ecosystem of tools.

Related to that, it’s becoming increasingly apparent that there is no such thing as “a Data Platform.” We regularly mix our data with data which is stored at peer-organisations, universities, other cloud providers, other supercomputing centres, etc. A data and tooling strategy that can consolidate all of these things would be very powerful. Ultimately, we need a discoverable interwoven network of data resources that can be shared and combined. This probably implies standards — indeed it’s not so far off the World Wide Web.

There is possibly also a role for social networks here. Corporate social networks and document management systems (e.g. Yammer, Sharepoint) are rapidly becoming established ways of organising groups of people, as well as sharing and finding things. Perhaps we need to come towards this world? Is a dataset very different from a word document? It has owners, users, versions; it needs to be published and found. Juypter Notebooks are even closer to traditional documents. Perhaps, by ceasing to treat data platforms as separate ecosystems we can piggy back on proven approaches.

Finally, there are things we haven’t touched upon here. One notable exception is to task of turning ad hoc research into robust production data pipelines, including the running of very heavy super-computer models…but that’s a task for another day!

What else do you think should go on the list? Let us know in the comments below.

with thanks to Theo McCaie and the rest of the Informatics Lab.

What do we want from a dream data platform as a Service?

The personas

Final Comments

Written by Niall Robinson