Can Parquet file encryption make you a safer driver?
Eliot Salant (firstname.lastname@example.org), Gidon Gershinsky (email@example.com)
The research leading to these results has received funding from the European Community’s Horizon 2020 research and innovation programme under grant agreement n° 731678
The last ten years have seen a dramatic rise in the wireless transmission and use of automotive sensor data — commonly known as “telematics”. UPS, an early pioneer in telematics for delivery fleet management, collects over 1.25 billion telematics records per week, and through analysis of this data, is able to save nearly one million gallons of gasoline a year, as well as improve delivery times. (History and Evolution of Telematics, 2015).
With the expectation of more and more new cars to be produced with integrated global telematics over the next years, new business models are emerging to take advantage of this growing technology. Insurance companies, for example, now have several forms of telematic insurances, such as Pay-As-You-Drive (PAYD) and Pay-How-You-Drive (PHYD) and utilize collected vehicle-monitoring parameters for their insured drivers. The telematics transmitting devices typically monitor GPS location, speed, acceleration and time of day, amongst other parameters. Insurance premiums can then be set based on a specific driver’s driving habits, such as:
· The time of day or night for trips
· The speed driven on different sorts of roads
· A history of sharp braking or acceleration
· Whether or not the driver takes breaks on long journeys
· Total highway miles
· Total mileage
· The total number of journeys typically made in a period of time.
Drivers are incentivized to be safe drivers through the offer of lower premiums and, in some cases, given monthly safe driving bonuses based on the telematics data collected and the observed driving patterns (What is Black Box Car Insurance, 2019).
However, with great data comes great responsibility. Massive breaches of stored personal data have unfortunately become extremely prevalent. This year alone, a massive breach uploaded over 770 million unique emails and over 21 million decrypted passwords to the internet, and the data collectors who were compromised can be held legally responsible for damages. For example, while in the past, fines in the UK under the 1998 Data Protection Act for failing to adequately protect data were limited to £500,000, new GDPR legislation increases this penalty to €20 million (£17.6 million) or 4% of global annual turnover (whichever is higher). To illustrate, the breach of three billion user accounts at Yahoo in 2013–2014, had current GDPR regulations been in effect then, would have cost the company between $80-$160 million dollars in fines (Marr, 2018).
The problem of being able to securely store data becomes even more acute when one considers that these data breaches were of private data centers — storing data on public clouds makes it even more exposed to attacks.
Building on IBM-lead work on adding encryption to Apache Parquet files (See , , ) and our work on the European Union sponsored Horizon 2020 project, RestAssured (https://restassuredh2020.eu/), we have implemented a prototype of how an end-to-end, Cloud-based PHYD system leveraging the power of Apache Spark analytics while protecting the privacy rights of the data subjects, can be securely implemented.
The basic use-case scenario and implemented architecture can be seen in Figure 1. A secure hardware enclave, in our case, AMD Secure Encrypted Virtualization (SEV), (represented by a dotted blue box in the figure) is used to protect data-in-use. Virtual machines brought up with SEV protection enabled are isolated both from one another and from the hypervisor through memory encryption, with one key (managed by the AMD Secure Processor) being used per virtual machine. We installed all software which needs protection for data-in-use, such as the individual RestAssured components, Parquet encryption enabled Spark, Hashicorp Vault, and a Kafka broker inside SEV enclosures.
IBM has been leading the Open Source effort to introduce encryption to Apache Parquet files. Using Spark 2.3.0 with a modified build of Parquet 1.8.2, Parquet encryption features encryption at the column level, utilizing a separate a key per column, while still supporting column projection and predicate push-down to allow for highly efficient execution of big data queries even on encrypted columns. All encryption and decryption of stored data files is done on the Spark side, which means that neither unencrypted data nor sensitive data keys are exposed to the storage systems or its administrators. Stored data is now safe, even on public clouds.
In the implementation shown, Hashicorp Vault is used to manage all key secrets. Additionally, Vault offers user authentication utilizing trusted sources of identity, such as Active Directory, LDAP, Kubernetes etc.
We emulate the generation of telematics data from a driven car through a simulator which sends a real-time flow of synthetic data to a Kafka message queue using a secure communication protocol such as TLS.
Using Spark streaming, the received telematics data is received by a Data Gateway component running in a SEV enclave and written to encrypted Parquet files. Our current version of the Data Gateway is configured to specify not only a schema for the data, but also which columns should be encrypted. Encryption keys for the sensitive columns are pre-configured and stored in Vault, and our version of encrypted Parquet will access the keys for both encryption and decryption from there.
The telematics data has now been protected over the whole transmission-to-storage path — while in motion, data is secured by transmission protocols such as TLS; while in-use, data is encrypted by hardware enclaves; and the data at-rest is protected through Parquet encryption.
We have emulated the user side of insurance company applications demonstrating how insured customers can opt-in to the use of their data for telematics insurance, and how the insurance company can perform queries against customer data to determine policy rates which respect prescribed access rights.
In conformance with GDPR requirements, RestAssured has implemented Sticky Policies — data owner specified access rights which can be allocated on the individual parameter level, based on intent-of-use of the data.
All queries the telematics data must be accompanied by an expression of the intended purpose-of-use of the data and are routed to the Query Gateway.
The Query Gateway will then pass the requests to the Sticky Policies Data Gatekeeper which will analyze the query, and either approve or disallow it for the given data owner. In fact, the Sticky Policies mechanism is much more powerful than this, and will allow for the selective return of records, and even portions of records for a multitude of users based on the specified access rights, but this — and the accompanying use case — would be the subject for another blog…
A simple representation of how the price of an insurance policy could be set for a specific driver based on their driving habits is shown in Figure 2. In this illustration, the driver has authorized use of his telematics information for insurance pricing, and, unfortunately, has a history of speeding, both during the day and at night. The application returns to the insurance agent a suggested policy price which reflects both of these unsafe driving habits.
History and Evolution of Telematics. (2015, August 6). Retrieved from Omni m2m: https://omnim2m.com/history-and-evolution-of-telematics/
Marr, B. (2018, 06 11). GDPR: The Biggest Data Breaches And The Shocking Fines (That Would Have Been). Retrieved from Forbes: https://www.forbes.com/sites/bernardmarr/2018/06/11/gdpr-the-biggest-data-breaches-and-the-shocking-fines-that-would-have-been/#a5478406c109
What is Black Box Car Insurance. (2019). Retrieved from Insure The Box: https://www.insurethebox.com/telematics