“Ways people use Sentinel” or what lessons we’re learning while sharing Open Data
An anomalous amount of data requests to the Sentinel AWS bucket got us thinking about sharing open data while keeping costs under control, as well as how one can most efficiently exploit such valuable resources in the cloud.
Recently we’ve noticed a weird anomaly in the usage pattern for the Sentinel Open Public Dataset on AWS:
Someone managed to rack up more than $10,000 in costs over two weeks with GET requests to S3. Using the cost of a single request, which is $0.00000043, we estimated that more than 30 billion reads were executed (the requests came from within the same AWS region, so there were no data transfer costs). Digging in a bit more, we found that the requests only accessed Sentinel-2 metadata files like this one. It seems that a script was building a database of Sentinel-2 tiles. We offer an OpenSearch service which should make database copies unnecessary, but we understand that having an own custom copy has its advantages. However, there are only 4 million tiles available, so the script must have read each individual JSON file about 10,000 times. After two weeks the activity stopped, so we are guessing that somebody noticed costs on the other side or that the script had finally run its course. One way or another, we wish we knew more about the intent of this processing, so we could help achieve the intended goal more efficiently if a similar need occurs in the future.
At Sinergise we do not actually pay the cost of the Sentinel Open Public Dataset — AWS kindly covers that for the benefit of the whole community. However, we are well aware that it is not pocket change, so we try to manage the budget as if it was our own. Events like these make us think, if only because our Sentinel Hub business depends significantly on data being freely available.
What can we learn from this experience?
Nothing is really free. Neither is Sentinel data. The European Union invested billions in Copernicus programmes. Hosting and distribution of the data also cost quite a bit, and as the volume of data rises, so do the costs of processing it. The cloud infrastructure is a novelty for the majority of us and we are not used to having supercomputer capabilities at our fingertips. It is easy to execute an image processing script on the global scale over a petabyte of data. It feels as simple as working with a couple of tiles on a desktop machine, and it doesn’t take much longer. But there is a huge difference when it comes to costs — a simple inefficiency or a bug that would hardly be noticeable before, can now have a real and heavy impact on the credit card. If we want to build sustainable systems working with such large datasets, we have to be especially careful and keep a very close eye on consumed resources.
At Sinergise we’ve managed to create a viable business model with Sentinel Hub, opening Sentinel data to just about anyone for free. This is not due to an unlimited amount of credits (Sentinel Hub costs are not covered by AWS) but because we have designed our system with smart use of cloud infrastructure in mind. Obviously, we make use of “free” data hosted in the cloud, but we only download the chunks of the images that a user actually requires —the users hardly ever look at 10,000 sq. km of imagery at full resolution. We do not store any images locally, not even temporarily. We create Spot instances as demand peaks and kill them when usage levels off. And we do not run millions of Lambda processes without real cause… In the end, all of this doesn’t just make the system cost efficient, it also makes it very fast, which is what our users love.
When the Sentinel Open Public Dataset was established two years ago, getting hold of Sentinel imagery was quite a challenge. It was distributed in unwieldy chunks and Copernicus SciHub had trouble managing the demand. And there was no other place to get it. At that time we made a decision to make the data available as easily as possible, using just basic HTTP requests. Things are changing now, with the collaborative ground segment running, four DIAS-es coming in a few months, and data generally being more easily accessible.
This is why, in a few months, we plan to direct the AWS Sentinel repository back towards its original purpose— processing of data in the cloud. The Sentinel-2 L1C bucket access will change to “requester pays”, the same policy that has been in place for the Sentinel-2 L2A and Sentinel-1 buckets for about a year. This should not introduce any changes to anyone using the data from the eu-central-1 (Frankfurt) region where it is hosted. For those in other regions or over the Internet there will be some additional data transfer costs involved, but they should be small and manageable for anyone using the data wisely (currently it’s $0.02 — $0.09 per transferred GB). On the other hand, it will reduce internal costs of running this public bucket and help avoid some unexpected charges like the one in the intro.
This will also allow us to get some new interesting datasets on board, the first of which we plan to announce early this summer.
P.S. The attitude of AWS folks after noticing this, like on many other occasions in the past two years, was outstanding. Instead of showing any kind of discomfort (I guess they still need to justify the costs of these Open data programs to somebody), they started thinking about how they could improve AWS services to spare the user from running the query 30 billion times…