Combine AWS cloud with OVH storage for handling sensitive EU data

Kris Peeters
datamindedbe
Published in
8 min readOct 5, 2020

Now that the Privacy Shield has been invalidated, there are some legal disputes whether European companies and government agencies can still store sensitive data on the big 3 cloud providers (AWS, Azure and GCP). They are owned by American companies, and as such, there are no guarantees that the US government won’t force these companies to give them access to that data.

https://ec.europa.eu/newsroom/just/item-detail.cfm?item_id=33392

Based on this, a recommendation was published by the Flemish Oversight Committee, basically advising against using AWS for storing certain sensitive education data. It’s in Dutch. If you read through it, the crux is this: US Intelligence Agencies are able to request data from these cloud providers, regardless of whether these servers are based in the US or not.

Finding an alternative

Let’s begin with a disclaimer: I am not a lawyer. I do not know the details of the Privacy Shield. I cannot give advice on which data you can store where. Seek advice from a legal counsel.

With that being out of the way, let’s look at some options. In the Oversight Committee advice, they point to Belgian cloud providers such as g-cloud. Which is like saying: “No sorry, you can’t participate in the race with a Porsche. But here’s a unicycle. Good luck”. We all know that these providers are nowhere near the capabilities of AWS / Azure / GCP. It’s not even a comparison. Of course not. It got me to throwing a complete fit on Twitter (in Dutch) about how stupid this advice was.

But it got me thinking? What if… What if we have to store data on servers owned by EU companies. How far would we be able to take this? Let’s look at the EU cloud market, by opening a Forrester report (I know…)

Forrester Wave to the rescue!

This is a report for private cloud. I’m not a big believer of those as you lose all the flexibility and scalability you get with public cloud. But it’s a starting point. Let’s go through the list:

  • Rackspace: Owned by a US company. No.
  • Orange: For a previous client, we investigated the Orange cloud a few years ago. It was a tyre fire. Maybe that has changed now. But I don’t feel like investigating it. Still coping with PTSD. No.
  • Centurylink: Never heard of it, sorry. No.
  • TCS: Not European. No.
  • Atos: *Shivers*. No
  • Sunguard: What? No

So, that leaves us with OVH, which is French: They have a public cloud offering and I had good experience with them in the past, for just renting a server. So, YES.

A peek into OVH Cloud

They seem to offer all the right services for a modern cloud-native stack.

https://www.ovhcloud.com/en-gb/public-cloud/

Of course, it’s nothing like AWS, Azure or GCP. The big 3 offer a million services, which they run for a million clients at a scale a million times bigger than OVH probably. That being said, I do believe there is some commoditisation at play in the cloud world.

  • Object storage: Really, that’s been around for ages. OVH offers it at a competitive price, and they even claim to have an S3-compliant API. Interesting!
  • Databases: Sure, Bigquery, Synapse and Redshift are awesome. But sometimes a Postgres db is good enough. As long as it is managed.
  • Containers: More and more workloads run on containers. They offer Managed Kubernetes, a container registry and some more extras. Let’s go!

If you can offer the above 👆together with some basic networking and descent Identity and Access Management, I think you can cater for 80% of the data solutions out there. They even offer some data&analytics specific services, but let’s keep them out-of-scope for now. Again: will it be as good, as smooth or as flexible as the big cloud providers? Definitely not. Will it be good enough? Let’s discover.

OVH Object storage

I registered on their website, and I immediately got charged EUR 1. Damn you! AWS gives stuff for free if you want to try it out. But ok. I can spare the euro I guess.

Cool kids define their infrastructure in Terraform, and there is a OVH provider for Terraform. But I didn’t feel like figuring that out just yet. Let’s start with the basics. Through the OVH UI it’s pretty straightforward to first create a project and then add a storage container, which is the equivalent of an S3 bucket. In OVH a container doesn’t have to have a globally unique name. It’s linked to your account.

Nice, let’s manually upload a file.

It immediately gives you a CNAME so you can have some nicer DNS names. But I don’t really care for that at the moment. Now I want to use a CLI or something to access this. This threw me down a rabbit hole, and the documentation was not helpful at all:

  • Apparently OVH is openstack behind the scenes
  • To manage an openstack cluster, you need to open the Horizon UI
  • To open the Horizon UI, you first need to create a user with appropriate rights.

Ok, done. I can login through the Horizon UI. Now what?

  • Now you download any of 3 different config files to use in a CLI
  • Then install the Swift client: pip install python-swiftclient
  • Then somehow you need to activate one of the above config files by basically running them. They are a bunch of EXPORT statements

🎉 Tadaa, we’re there. We can use the swift command-line which is fairly simple.

But I want S3 compatibility. Where is it? The docs of OVHCloud are really short, and more or less unusable. It’s better to read the Openstack docs directly. But this one is actually explained quite well in this blog. You need to create credentials first which you do by first installing another pyton library through pip install python-openstackclientand then running this command:

openstack ec2 credentials create

Weird. I don’t want ec2 credentials. What does that even mean? I think someone at openstack was confused. Or maybe that’s another ec2. Anyway, you get nice credentials back:

And yes, I deleted those before publishing this blog post :-)

Now I needed another pip install: pip install awscli-plugin-endpointto get this to run with the AWS CLI, which I already have running. By the way, do all of this pip installing in a virtual environment. There are probably many version conflicts if you don’t.

Then, I added this to ~/.aws/config :

[plugins]
endpoint = awscli_plugin_endpoint

And this to ~/.aws/credentials :

[ovh]
aws_access_key_id = 525b52908af844a697b35755d8ce9bbb
aws_secret_access_key = 306b3eed565143b8a429401db6d808e9
region = gra
s3 =
endpoint_url = https://s3.gra.cloud.ovh.net
signature_version = s3v4
addressing_style = virtual
s3api =
endpoint_url = https://s3.gra.cloud.ovh.net

And yes, again, I deleted those credentials before posting this blog :-) . The region “gra” is a place in the North of France by the way. We should be done. And indeed.

Pretty cool. If it walks like a duck and squeaks like a duck, it must be a duck.

How can you use this?

The fact that you can just add another profile to your AWS config, is big. It means it will also work in libraries like boto3 or AWS Wrangler. There is no reason why you wouldn’t be able to run a job on an AWS Kubernetes or EMR cluster and read the data from OVH, process it, and write it back.

Of course, there are a few drawbacks:

  • You move data around between cloud providers. That can become slow and expensive.
  • You have to manage your OVH credentials somehow. In AWS you typically work with IAM roles which have permission to access data where temporary credentials are managed behind the scenes for you.
  • You are still processing the data on a data center owned by a US company.

But it also brings benefits:

  • Your data is stored by a EU company. Although I have no clue how extensive their encryption support is.
  • If a US intelligence agency wants your data, they can’t just shop for it in S3. They would actively have to monitor your processing cluster on AWS and hack into that to steal the data in-flight. Not impossible. But not trivial either. Or steal the OVH keys that you have to store somehow somewhere in AWS. Again, I’m not too concerned about that, given the strict security measures AWS employs. But this entire exercise is based on the assumption that US intelligence agencies have access to your AWS account.

Conclusion and next steps

The conclusion so far is that yes, you can leave your data at a EU company and still do all your processing on your favourite cloud. With minimum disruption. Whether that’s good enough, really depends on your use case. I see three obvious next steps:

  1. It would be cool to actually set up a Kubernetes job on AWS that reads and writes from OVH object storage and from a regular S3 bucket seamlessly. But given that the API indeed seems to be compatible, I see no reason why that would not be the case.
  2. We can take this one step further. Let’s also explore the data processing offering of OVH. Then you could end up in a scenario where you store and process your most sensitive data on OVH. Only the “sanitized” version of that data, whatever that may mean, you push to AWS.
  3. As an extra bonus, imagine you built your data platform on Kubernetes (shameless plug: like we did with Datafy), then having workloads on AWS, on OVH or on another cloud provider should be a seamless experience to you. You simply schedule where you want the data to reside. And your data platform routes the compute to the right cloud. Disclaimer, no Datafy doesn’t support this (yet 😂). It’s easier said than done, but once up-and-running you can enjoy the full eco-system that AWS offers you for the bulk of your workload, while doing the sensitive workloads on EU providers.

Definitely let me know what you think and what your experiences are with OVH! Curious to learn more.

--

--

Kris Peeters
datamindedbe

Data geek at heart. Founder and CEO of Data Minded.