Privacy Leakage Beyond GDPR

Oleg Kunitsyn
noleaks.eu
Published in
11 min readFeb 29, 2020

This article was written in September 2019 for Annual Privacy Forum organized by the European Union Agency for Cybersecurity. The paper was rejected by one of three reviewers with a non-objective come-off. And I publish the study here, regardless of the pleasantness of the findings.

Abstract

General Data Protection Regulation (GDPR) came into force on May 2018, helping European Union citizens obtain more control over personal data on the Web. How well is their privacy protected today? This paper presents an experiment that estimates privacy leakage in the EU by evaluating unauthorized tracking among national domain zones. The results are disappointing: unauthorized tracking vary from 11% to 38%, depending on the country. This paper demonstrates how ubiquitous embedding of external resources and services on the websites lead to privacy leakage. The experiment expands to the most popular websites in 27 national domain zones.

Introduction

HTTP Request and its Context

The Hypertext Transfer Protocol (HTTP) is a textual stateless application-layer protocol for the Web. Web browsers use HTTP to request hypertext and supporting content from the remote resources on the Internet. The semantics of HTTP request include request method, request headers and body [https://tools.ietf.org/html/rfc7231]. The Internet Protocol (IP) is a network-layer protocol for Internet behind HTTP. Each device on the Internet is uniquely identified by a numeric IP address. Thus, the originating IP address is mandatory delivered to a Web server in the form of context of each HTTP request.

Usually, the originating IP is not an IP address of the user’s device that initiated the browsing session, but the nearest to the user public IP address assigned by the Internet provider. Whole IP address space is managed globally by the Internet Assigned Numbers Authority and Regional Internet Registries. Global nature of the allocation allows to associate public IP address with a geographic location. For instance, GeoIP2 City Database is able to determine country, city, and postal code associated with IPv4 or IPv6 addresses worldwide with high degree of accuracy [https://www.maxmind.com/en/geoip2-city]. Additionally, the HTTP extension header, also known as Forwarded or X-Forwarded-For, allows to disclose the originating IP address of the request in proxied Internet connections [https://tools.ietf.org/html/rfc7239].

Therefore, each HTTP request, not even for the purpose of identification, discloses at least the originating IP address and its approximate geographic location. Overlay networks for anonymous communication e.g. Tor, bordering with the Internet, are subject to this fundamental flaw, too.

Direct Unique Identifiers

HTTP headers, context and body are capable to transfer the state among multiple requests. The properties of HTTP semantics, such as tokens, timestamps and stored objects, persist among browsing sessions that is a core feature to identify a user on the Web. Actual methods of storing a unique identifier in browsers are:

  • HTTP Cookie is a key-value pair set by the Web server and transmitted to the browser in the form of HTTP response header [https://tools.ietf.org/html/rfc6265]. Cookie is the most transparent and popular method for identification on the Web.
  • HTTP Strict Transport Security (HSTS) is a policy set by the Web server that protects encrypted HTTP transmission against downgrade to unencrypted one in the form of HTTP response header [https://tools.ietf.org/html/rfc6797]. Max-age property of HSTS header allows to tokenize the browser with an identifier, also known as HSTS Super Cookie, which persists in all privacy modes. Although each HSTS header is able to store just a single bit, several resources requested from different hosts are capable to set enough HSTS bits to assemble a unique identifier. For instance, the set of 30 transparent pixels is sufficient to identify 1 billion (2^30) browsers.
  • HTTP Etag is a token of a specific version of the resource and set by the Web server in the form of HTTP response header [https://tools.ietf.org/html/rfc7232]. The browser determines if the stored resource is the same by sending the HTTP request with If-None-Match header.
  • HTTP Last-Modified is a timestamp of the resource set by the Web server in the form of HTTP response header [https://tools.ietf.org/html/rfc7232]. The browser determines if the stored resource is the same by transmitting conditional HTTP request with If-Modified-Since or If-Unmodified-Since headers.
  • TLS Channel ID is a token of the encrypted HTTP that based on a cryptographic key pair, which is reused in the subsequent TLS connections associated with the browser [https://tools.ietf.org/html/draft-balfanz-tls-channelid-01].
  • HTML Web Storage is a JavaScript API for storing data in the browser in the form of key-value pairs [https://www.w3.org/TR/webstorage/].
  • HTML IndexedDB is another JavaScript API for storing data in the form of key-value collections [https://www.w3.org/TR/IndexedDB/].
  • Adobe Flash Local Shared Objects is a piece of data that Flash Player stores in the browser. Adobe has announced the end of support of the technology effective on December 2020 [https://theblog.adobe.com/adobe-flash-update/].
  • Silverlight Isolated Storage is a virtual file system that Silverlight creates in the browser. Microsoft has announced the end of support of the technology effective on October 2021 [https://support.microsoft.com/en-us/help/4511036/silverlight-end-of-support].

In 2010, Samy Kamkar published Evercookie, a JavaScript application that combines several storing methods to generate an identifier that is intentionally difficult to erase [https://samy.pl/evercookie/].

Embedded Resources

In terms of HTTP, an origin is a combination of URI scheme, host name, and port number of URL.The Same-Origin Policy (SOP) is a security restriction that prevents JavaScript on one webpage from obtaining access to another. This policy is implemented in the browsers as a cross-origin resource sharing (CORS) mechanism [https://www.w3.org/TR/cors/]. CORS allows restricted resources to be requested outside same-origin locations. Currently, SOP does not apply to HTML tags and webmasters are free to embed any external resources on a webpage.

Three of top ten OWASP security risks addressed to the embedded resources [https://www.owasp.org/index.php/Top_10-2017_Top_10]:

  • A1 Injections that is able to deliver malicious code to the browser as part of legitimate one.
  • A7 Cross-Site Scripting (XSS) that exploits improper validation of user-supplied data to execute malicious code.
  • A9 Using Components with Known Vulnerabilities that broads an attack surface for malicious code.

Since 2012 webmasters may specify allowed and restricted origins to mitigate OWASP A7 risks by using Content-Security-Policy (CSP) [https://www.w3.org/TR/CSP/]. CSP is HTTP response header set by the Web server that controls resources the browser is allowed to embed. Nowadays CSP supported by all major Web browsers.

In May 2018, Microsoft Edge, Mozilla Firefox, Apple Safari, Google Chrome and Opera adopted new security feature that mitigates OWASP A1 risks, Subresource Integrity (SRI) [https://www.w3.org/TR/SRI/]. Webmasters can specify a cryptographic hash of each resource in addition to the location. SRI enables browsers to verify that embedded resources were delivered without modifications. The resource is discarded if the hashes don’t match.

Google Safe Browsing (GSB), also known as Phishing and Malware Protection, is a blacklist service released in December 2005 [https://developers.google.com/safe-browsing/] that mitigates OWASP A9 risks. Once activated, the browser checks each URL against GSB list that contain resources with potential malware or phishing content. Resource found in the blacklist is considered unsafe and escalated to the user for blocking. In September 2019, Google Transparency reported more than 1.5 million websites deemed malicious [https://transparencyreport.google.com/safe-browsing/overview].

User-Defined Resource Controls

In 2005, Apple introduced the first Private Browsing mode in Safari [https://arxiv.org/ftp/arxiv/papers/1802/1802.10523.pdf]. Private browsing is a feature that creates an isolated temporary browsing session. Browsing history and stored data associated with the session are cleared when the session is closed. In 2008, private browsing mode followed by Google Chrome, and in 2009, by Microsoft Internet Explorer and Mozilla Firefox.

In 2009, additional controls over identification and tracking proposed by Christopher Soghoian [http://paranoia.dubfire.net/2011/01/history-of-do-not-track-header.html]. The Do Not Track is a feature of the browser which expresses a user’s consent regarding tracking. Once enabled, the browser will include DNT header into all HTTP requests [https://www.w3.org/TR/tracking-dnt/]. However, DNT is not legal or technological requirement and webmasters may either honour or ignore the header.

Furthermore, browsers exposed an extension API that allows independent developers to enhance the functionality. Extension, also known as add-on, is an application written in JavaScript, CSS and HTML, that a user may install from vendor’s repository or locally. Since extensions gain full access to the browsing history and webpage content, users are increasingly turning to blockers to avoid ads and tracking on the webpages, that are often perceived as annoying or an invasion of privacy [https://research.mozilla.org/files/2018/04/The-Effect-of-Ad-Blocking-on-User-Engagement-with-the-Web.pdf].

Methodology

In this work, I evaluate a correlation between external resources and unauthorized tracking among the most popular websites in national domain zones of the European Union.

Definition

According to GDPR [https://eur-lex.europa.eu/eli/reg/2016/679/oj], personal data is any information relating to an identifiable person, who can be identified by reference to location data or online identifier. Processing of the personal data covers any operation, including disclosure by transmission and dissemination. Processing of personal data is lawful only if the person has given prior consent which is clear affirmative action.

Hereinafter, privacy leakage is an external (by a third-party) unauthorized (without prior consent) processing (at least, disclosure) of personal data (at least, online identifier).

Scope

In this work, national domain zones of the European Union are such as: AT, BE, BG, CY, CZ, DE, DK, EE, ES, FI, FR, GR, HR, HU, IE, IT, LT, LU, LV, MT, NL, PL, PT, RO, SE, SI, SK. The most popular websites provided by the Research-Oriented Top Sites Ranking Hardened Against Manipulation [https://tranco-list.eu] and limited by the first 200 domains, whenever available.

Tracking is a practice of gathering information about a person. Each tracker requires persistent unique identifier to assemble behavioural dataset. Required identifier can be produced by direct methods above or statistically [https://amiunique.org/]. Thus, external tracking is perfect scope of privacy leakage on the Web.

Step 1

Detection of embedded external resources done by open-source NoLeaks extension for Google Chrome, Mozilla Firefox and Opera [https://github.com/noleakseu/extension]. Once installed, NoLeaks examines an origin of each HTTP request on the webpage. All found external resources reported to NoLeaks server in the form of anonymous affiliation graph: origin i.e. www.google.com, external resource i.e. ssl.gstatic.com, connection protocol i.e. https, blocking status, ID of the extension.

Step 2

NoLeaks server examined each external resource according to the tracking blacklist, a maintained by Mozilla list of known trackers [https://github.com/mozilla-services/shavar-prod-lists]. Due to the fact that the trackers in the list organized by their purpose, the server is able to tag each external resource with the name and the purpose of the tracker.

Step 3

Unauthorized tracking evaluated by an emulation of a user who interactively visits each website in given scope.

Evaluation dataset, September 30, 2019.

The emulation provided by Selenium framework, WebDriver for Firefox v0.22.0 [https://github.com/mozilla/geckodriver] in default privacy mode and installed NoLeaks extension. All findings reported, qualified and tagged as stated in Step 1 and Step 2. Tracking purposes, other than social, fingerprinting, advertising and analytics excluded from the datasets to conform the goal. Thus, the automated emulation produced two datasets, grouped by national domain zones: external resources and unauthorized tracking [table]. Pearson correlation coefficient (1) between external resources and unauthorized tracking equals 0.882.

(1)

Social: a tracker which allows a social networking service to reach user’s browsing session.
Fingerprinting: a tracker which abuses browser or device features in unintended ways to identify a user.
Advertising: a tracker which also displays ads or marketing offers.
Analytics: a tracker which builds a detailed profile based on online activity.

Conclusion

I find that unauthorized tracking in national domain zones of the European Union vary from 11% to 38% [table]. Although website owner and external tracker present on webpage as joint data controllers, GDPR enforcement has no effect. Continuous monitoring demonstrates that 68% of external resources on the websites are subject of tracking [http://noleaks.eu/reports.html].

Modern Web engineering practices recommend modular architectures and cloud services that help decrease expenditures and deliver digital products quickly. As well as DNT, implementation of CSP and SRI security layers is optional. Weak SOP and deficiency of explicit user-defined controls over content in runtime (such as pop-up blocker) allows webmasters freely embed any external resources on webpages.

According to GDPR, online identifiers expand to any identifiers provided by devices, applications, tools and protocols. In contrast to public IP address belonging to a router, a tracking identifier points at a browser, and in the case of personal device, at a person. Pseudonymisation and anonymization of gathered data help data processors comply with the regulations, but does not substitute user’s consent to identification.

In this paper, I present an experiment that estimates privacy leakage in the European Union by evaluating unauthorized tracking among national domain zones. I find strong positive correlation between external resources and unauthorized tracking. This empirical evidence supports the standpoint that ubiquitous embedding of external resources and services on the websites lead to privacy leakage.

Related Work

In 2017, Macbeth (Cliqz GmbH) published Ghostery study of online tracking [https://www.ghostery.com/study/]. The study covered 21 million page visits to over 350 000 websites. Similar to NoLeaks, Ghostery analyzed the global tracking dataset gathered from the browser extension from 200 000 German users. Ghostery detected at least one tracker around 77% of the tested webpages, before GDPR era.

In March 2019, Cookiebot reported third-party ad tracking on 89% government websites in the European Union [https://www.cookiebot.com/media/1121/cookiebot-report-2019-medium-size.pdf]. Cookiebot is a scanning service that enables full cookie compliance on websites according to GDPR regulations.

PrivacyScore is another scanning service that benchmarks wide range of potential privacy and security issues on websites [https://privacyscore.org]. PrivacyScore is the result of cooperation of many contributors from six German universities, was announced at the ENISA Annual Privacy Forum in June 2017.

Since April 2019 the European Data Protection Supervisor maintains Website Evidence Collector (WEC) aiming an automation of privacy and personal data protection inspections. [https://joinup.ec.europa.eu/solution/website-evidence-collector/about]. WEC is a command-line tool that collects evidence of personal data processing, such as cookies and requests to third-party websites, structured in a human- and machine-readable format.

International community of browser extension developers created range of tools that help users control external resources on webpages. A predecessor of NoLeaks, Privacy Badger is an open-source browser extension authored by the Electronic Frontier Foundation [https://github.com/EFForg/privacybadger]. The extension automatically blocks invisible trackers and inspires privacy researchers and software engineers since May 2014.

Disconnect is another browser extension that visualizes and blocks third-party trackers on websites [https://disconnect.me/disconnect]. The extension trusted by over 50 million people worldwide. The blocker relies on the blacklist maintained by the community.

In September 2019, Mozilla released Firefox 69.0 that enables content blocking by default [https://blog.mozilla.org/blog/2019/09/03/todays-firefox-blocks-third-party-tracking-cookies-and-cryptomining-by-default/]. In Strict Mode the browser blocks known tracking, cryptomining and fingerprinting resources, as well as third-party cookies. The content blocker built on top of Disconnect blacklist.

In contrast to community blacklists, Iqbal et al. presented a graph-based approach for detecting advertising and tracking resources on the Web [https://arxiv.org/pdf/1805.09155.pdf]. The tool analysed HTTP requests in combination with surrounding context. The evaluation on the Alexa top 10K websites replicated the human-generated blacklists with 95% accuracy.

Wu et al. presented another approach for detecting external Web trackers with high accuracy [Wu, Q., Liu, Q., Zhang, Y., Liu, P., & Wen, G. (2016). A machine learning approach for detecting third-party trackers on the web. In S. Katsikas, C. Meadows, I. Askoxylakis, & S. Ioannidis (Eds.), Computer Security — 21st European Symposium on Research in Computer Security, ESORICS 2016, Proceedings (pp. 238–258). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 9878 LNCS). Springer Verlag.]. By using the structural hole theory to preserve first-party trackers, third-party trackers were detected based on supervised machine learning. 98% of trackers from Ghostery blacklist were correctly replicated, and 35 unrevealed trackers found.

I assume that bridging empirical and machine methods can significantly improve protection against existing and upcoming types of unauthorized tracking on the Web.

Acknowledgements

Although this work is independent, I am grateful for the feedback from the European Union Agency for Cybersecurity and the European Data Protection Supervisor. The institutions play a critical role in the field of online privacy.

--

--