Lab on the Road

Informatics Lab at UKRI Cloud ‘Unconference’ 2019

Kevin Donkers
Met Office Informatics Lab
5 min readSep 27, 2019

--

Nothing like a canal-side cycle to burn off all those workshop calories

UK Research and Innovation (UKRI) have a Cloud Working Group which meets periodically to discuss, present and hack on all things cloud computing in UK research and academia. On 11 September 2019 we met at the University of Birmingham for a technical workshop ‘unconference’ — an event with no fixed agenda in which researchers, developers and devops could talk technical, give informal talks and share their experience of working in the parallel, scalable and cloud computing space.

Cloud vs ‘On Prem’ Computing

One of the key discussion points was on the debate of public cloud vs private cloud vs on premises compute infrastructure. While some would point out that this debate has gone on for years and bemoan the lack of movement, hearing the concerns and opinions of those working with these systems highlighted the nuanced situation that many research groups and institutes find themselves.

For clarity I’ll summarise the benefits and concerns for each technology and what overall solutions are being considered.

Public cloud

What:
Purchasing compute resource time from a ‘Big Tech’ provider like Amazon Web Services or Microsoft Azure. This involves buying absolutely no hardware. Seemingly “unlimited” resources available to the clients due to cloud providers working at huge scale. Access is exclusively over the internet.

Pros:

  • No hardware needs to be purchased which eliminates multi-year procurement-installation-testing cycles. Just rent the resource you need for the job you want to execute.
  • Elastically scalable compute resource (if configured correctly). If you have bursty workloads, you only pay for what you use.
  • Clearer billing — many on premises solutions are used by multiple users and projects simultaneously, and can be very difficult to allocate cost. Allocating cost to each user/activity on public cloud is easy because it is baked into the service.

Cons:

  • Vendor lock-in. Moving all your infrastructure to one provider makes you vulnerable to future price hikes, discontinuation of (bespoke) core services and companies going bust.
  • This is particularly scary if you store all your data in the public cloud.
  • Moving data to/between/from public cloud infrastructure is slow and potentially very expensive.
  • Requires a non-trivial change in funding models from capital expenditure (CAPEX) to operational expenditure (OPEX).

Private cloud

What:
Purchasing or renting a large cluster of compute resource hardware and providing a similar service to public cloud providers. The hardware can be on your premises or in a data centre elsewhere, but the hardware is owned by/assigned to just you. Resources can be made available within your organisation and to external collaborators/customers, but always through a private network rather than openly through the internet.

Pros:

  • Potentially less reliant on the services of another company, especially if they decide to discontinue a bespoke service.
  • Potentially closer to your data (if on premises), particularly if you have it archived. This can reduce time and expense of moving data.
  • More control and ownership over your resources. Private network enhances security.

Cons:

  • Potentially very large expense and lots of effort to maintain.
  • Often doesn’t come close to the scale of public cloud providers so can run up against scaling issues.

On premises

What:
More traditional compute resource inside an organisation. This includes data analysis clusters, HPC and data archiving. Often access from outside the organisation that owns this infrastructure is notoriously difficult.

Pros:

  • Maximum control, either through high security or simple obscurity.
  • Fits with traditional capital expenditure (CAPEX) funding model.

Cons:

  • Poor access from outside the organisation/premises. Frustrating to collaborate with.
  • Expensive and slow to update.
  • Often inefficient use of people’s time to maximise efficient use of compute resource e.g. by batch queuing jobs. This is often to justify cost.

Solutions

The above arguments would suggest an all or nothing approach but many of the best solutions are a hybrid of resources. Here are some of the key opinions:

  • Open Standards:
    Accept the risks of public cloud and mitigate by using open standards and technologies e.g. OpenStack. This allows easier shifting to another provider or to your premises if need be.
  • Hybrid/Federated Cloud:
    Use a combination of on premises, private cloud and public cloud that makes sense for your organisation. This can involve performing workloads on different systems where finances or data gravity are limiting factors. If organisations already have infrastructure (particularly data storage) in place, integrating it with a hybrid cloud setup could make much more efficient use of what an organisation already has rather than wholesale ‘lift-and-shift’ to a public cloud provider. For example IRIS, the Science and Technology Facilities Council (STFC) eInfrastructure project. Many industries have found solutions using this approach and consequently changed their funding models to OPEX or a blend of CAPEX and OPEX, where appropriate.

Tools to keep an eye on

As well as some healthy discussion over where the future of cloud sits within UK academia, there was some interesting tech to keep an eye on:

Billing

A theme that came up a number of times was how to effectively bill users of a platform for the time and resources they use. One technology to address this is CloudKitty, part of the OpenStack project. It allows the collection of metrics of resource use by each user, which in turn can be used with billing software to work out how much each user is spending on a platform.

Scientific reproducibility

Exactly how reproducible is modern, computer driven science? Different versions of software, it’s dependencies, data and even the infrastructure it runs on can lead to different results should experiments need to be reproduced. Factor in lost software and data and you can see that reproducibility is under strain. Projects like Zenodo for data archiving and Software Heritage for software archiving are key players in tackling this.

Distributed data

Should an organisation opt for a hybrid/federated cloud system, how do its users access data that is located in many different places around the world? Solutions to this can be found with Onedata and the CERN project Dynafed, which easily expose different (globally) distributed datasets in a user’s namespace.

Authentication

Authentication is not a new problem and has been solved in many ways but can be a very involved process to implement. Keycloak takes a lot of the hard work out of Identity and Access Management, with integration to existing LDAP or Active Directory servers and options to log in with a plethora of authentication methods (GitHub, Google, Kerberos, etc.). And it’s open source!

JASMIN+Pangeo

Pangeo is now available on NERC+STFC data analysis environment JASMIN! Instructions on how to get started can be found here.

Conclusion

The debate about cloud vs on-prem is still alive and kicking within the UK research community. However many solutions and examples are emerging and it paints a picture that sit on a spectrum between the extremes of fully on premises and fully on public cloud. One of the key struggles is working out how future funding models will work, whether moved towards operational expenditure (OPEX), a blend with capital expenditure (CAPEX) or something new entirely. UKRI are working on this and their findings will have a big impact on the future of UK research.

--

--

Kevin Donkers
Met Office Informatics Lab

Research technologist at the Met Office and PhD student with the Environmental Intelligence CDT, University of Exeter. Culinary adventurer and bicycle fiend.