Freedom to Share: How Government’s Data Sharing Policies Concerning Publicly Available Data Impact Academic Research and Journalism in the Public Interest

Emine Ozge YILDIRIM
Creative Commons: We Like to Share
18 min readJan 11, 2023

Our societies rely on quality academic and scientific research and journalism to thrive and prosper. They also depend on them to find solutions to address common challenges, including the climate emergency, global public health crises such as the COVID-19 pandemic, and ongoing armed conflicts and humanitarian crises. A necessary condition for public-interest research and journalism is access to data unencumbered by unnecessary legal barriers. And while the value of research and journalism is vastly recognized as a principle and important pillar of free and democratic societies, several recent legal developments have tended to undermine such principles and raise barriers to access to data, including publicly available data. Examples include copyright law (including sui generis database rights) as well as privacy and data protection laws. Acknowledging the critical need to protect privacy as a fundamental right, we nonetheless recall the need to uphold equally important rights of access to information, unconstrained research, and media freedom, in a fair and balanced manner, guided by the public interest. Commemorating the ten-year anniversary of the tragic death of data access activist and digital rights champion Aaron Swartz, this article examines the policy legal landscapes in the European Union and in the United States, two jurisdictions with rapidly shifting norms and expectations regarding access to data. It offers recommendations to strike a necessary balance between privacy and data protection on the one hand, and research and journalism freedoms on the other, as a means to support better sharing of data in the public interest. It also calls for more examples from other jurisdictions, so as to enable a more complete understanding of the issue and a more global approach to supporting researchers and journalists on the international level.

European Union

In the European Union (EU), the value of using publicly available data for research purposes has been increasingly recognized under some current legislative and policy efforts, including a number of copyright exceptions allowing researchers and journalists to conduct text and data mining activities.[1] Nevertheless, these exceptions are limited, and they have a tendency to clash with pre-existing regimes such as data protection laws and copyright laws, including Sui Generis Database Right (SGDR) protection.

Data Protection

Where the extracted publicly available data constitutes personal data,[2] utilizing such data would be scrutinized under the General Data Protection Regulation (GDPR) regime.

In the case where personal data have not been obtained from the data subject, e.g. when a researcher gathers data from a publicly available website to which the data subject has submitted information, Art. 14 of the GDPR would be applicable. Art. 14 imposes on data controllers the obligation to provide information to the data subject. However, along with Art. 14(5), Recital 62 provides an exception to the obligation to provide information “where the provisions of information to the data subject proves to be impossible or would involve a disproportionate effort… in particular… where processing is carried out for archiving purposes in the public interest, scientific or historical research purposes or statistical purposes.” It is important to note that this is not a blanket exemption for research use of personal data. For example, in the Bisnode case, the Polish Data Protection Authority (DPA) fined a company for scraping data from publicly available resources under Art. 14, rejecting the company’s defense based on the exception of Art. 14(5), finding that the obligation to provide information did not require a disproportionate effort.[3]

Another relevant provision is Art. 85 of the GDPR, along with Recital 153, which allows exceptions for the processing of personal data solely for journalistic purposes or for academic, artistic, or literary expression. Although the exceptions seem very broad on paper, researchers or journalists are not entirely safeguarded from possible fines. For instance, The Belgian DPA has fined a researcher and the researcher’s organization, the NGO EU DisinfoLab, for publishing raw data, political profiling of tweets, and for not conducting a Data Protection Impact Assessment in a disinformation analysis study conducted on the possible political origin of tweets circulating concerning the Benella affair in France. Additionally, when personal data contains sensitive data, which is a special category of personal data, processing requires the consent of the data subject. However, that data could still be processed on the basis of “legitimate interest” if it relates to “personal data which are manifestly made public by the data subject.”(See Articles 6, 7, and 9 of the GDPR). It should also be noted that these exceptions do not remove the obligations and general principles imposed under the GDPR regarding any processing activity (i.e., data minimization, purpose limitation, etc.). (See Art. 5 of the GDPR).[4]

In sum, publicly available data containing personal data, regardless of whether they were shared by users voluntarily, fall within the scope of the GDPR. Despite the GDPR providing some exceptions for researchers and journalists, there are also several restraints and obligations to be followed in order for researchers and journalists to conduct their activities.

Copyright and Database Right Protection

Pre-existing Regimes

If a researcher or a journalist was lucky enough to pass through GDPR scrutiny, another crucial legal regime needs to be complied with: EU copyright and database protection regulations. In line with Art. 2(5) of the Berne Convention for the Protection of Literary and Artistic Works,[5] the Information Society (InfoSoc) Directive only extends copyright protection to works that are original. Regarding content that is not original in and of itself, like words, a sequence and a combination of them can be extended copyright protection, as long as such an arrangement meets the originality requirement.[6] In order for a database to be afforded such protection, the author’s own intellectual creation must be shown in the selection and arrangement. Thus, merely organizing a database in a non-original way, as is likely often the case with scientific databases organized using standard technical methods, would not meet the originality requirements necessary to warrant copyright protection.

Furthermore, one should also be aware of the existence of related (also called neighboring) rights for which the requirement of originality does not apply. On the contrary, in some cases, copyright and related rights protection can overlap and create layers of protection over the same material. This is the case with the press publishers’ rights, introduced under Article 15 CDSM, where journalistic publications can be both the subject of copyright protection and protection unbound by the concept of originality for ‘press publications.’

Additionally, regarding the SGDR, the EU, along with a few other jurisdictions, recognizes a sui generis right that extends to qualifying databases that might not be otherwise protected under copyright due to not meeting the originality requirement. Art. 7 of the Database Directive states, “the maker of a database that shows that there has been qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents” will enjoy this protection. For instance, in Ryanair v. PR Aviation, a case concerning ‘screen scraping’, the court analyzed the issue at hand in terms of both copyright and database protection requirements. The Court concluded that Ryanair could not enjoy either, as computer-generated airline schedules meet neither copyright’s originality threshold nor the substantial investment requirement under SGDR protection.

On the EU level, there are several legacy copyright exceptions[7] that can be potentially used for the purposes of text and data mining (TDM). However, these pre-existing exceptions only cover some aspects of TDM for scientific purposes — where acts of reproduction are transient or incidental and have no independent economic significance, when the activity is performed for research and non-commercial purposes, or when it is effected by physical persons for personal use. Furthermore, the Database Directive provides a special exception allowing for the use and re-use of databases or a substantial part of the database’s content, for the purposes of scientific research.

Recent and Upcoming Legal and Policy Developments

In 2019, recognizing the need for a more consistent approach, the CDSM introduced two provisions dedicated to TDM.

Art. 3 of the CDSM provides a mandatory exception imposed on member states allowing research organizations and cultural heritage institutions to make reproductions and extractions, in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access. The exception under Art. 3 cannot be overridden by contract or by Technical Protection Measures (TPMs). Nevertheless, Art. 2 of the CDSM defines research organizations so narrowly that it could be read almost as a ‘non-commercial purposes’ clause, as SMEs, individual researchers, think tanks, and journalists would fall outside the scope of this exception. It is also worth mentioning that the possibility to include public-private partnerships amongst the beneficiaries of the exception is mentioned in Recital 11 of the Directive.

Additionally, Art. 4 of the CDSM brings an exception concerning both commercial and non-commercial uses by any users, but the exception can be withdrawn by an express reservation by rightsholders.[8][9] Therefore, Art. 4 is subject to ‘contracting-out’ by rightsholders, also allowing platforms’ terms of service to create barriers to data extraction or any other text-and-data mining activities. Under Recital 18 of the CDSM, the rightsholders should reserve the rights to make reproductions and extractions for text and data mining “in an appropriate manner.” In the case of content that has been made publicly available online, it should only be considered appropriate to reserve those rights by the use of machine-readable means, including metadata and terms and conditions of a website or a service. Notwithstanding the reservation regime, the application of this exception under Art. 4 is safeguarded against override by TPMs.

Both exceptions (Articles 3 and 4 of the CDSM) contain a requirement for the beneficiary to have “lawful access” to the respective materials as a prerequisite of the permitted use. The concept of “lawful use”[10] and “lawful source”[11] in the EU acquis is a very complicated one. It requires, in order for the use under an exception to be lawful, that the subject matter was made available with the consent of the rightsholder. Although there is no express legal definition for “lawful access” in the CDSM Directive, according to Recital 14 “lawful access should also cover access to content that is freely available online.”

Finally, the long awaited Data Act Proposal was published in February 2022 with the objectives of harmonizing rules on fair access to and use of data and clarifying who can create value from data and under which conditions. (Also see CC’s Data Act Roundtable held in June 2022) Art. 35 of the proposed Data Act, along with Recital 84, states that in order not to hinder the access rights set out in Art. 4 and Art. 5,[12] the SGDR set forth in Art. 7 of the Database Directive “does not apply to databases containing data obtained from or generated by the use of a product or a related service”. Nevertheless, the proposal does not contain any further clarification regarding the SGDR and limitations imposed regarding text and data mining under the Database Directive. The EU seems to be missing an opportunity here, as it was expected that the ongoing controversy regarding the SGDR would be settled with the new data strategy. Overlooking this opportunity seems to be at odds with the objectives of the proposed Data Act as, according to the European Commission, the SGDR has proven not to contribute to the development and competitiveness of the EU database industry. As Margoni and Kretschmer point out, such database protection has also been rejected by many jurisdictions due to its “anti-competitive and anti-information effect.” With the Data Act proposal, while the clarification on Art. 35 was welcomed by many, a proper review of the SGDR, especially concerning research and journalistic activities, should have been examined. However, it is not too late to take action, as there is still some hope that policymakers would consider such clarification in the near future either in the Data Act or in other upcoming legislation.

To conclude, although the EU seems to be moving forward in a direction where the value of data is recognized as essential for innovation and academic research, there are still substantial barriers to access and utilize data that could potentially chill research and journalistic activities. Thus, a harmonized approach, both in the EU and at the member state level, on access to data and providing legal certainty for research and journalism for the public interest is critical.

United States

In the United States (US), research using publicly available data is subject to restriction by several different sources, but sometimes benefits from legal safeguards as well.

Perhaps the most chilling source of potential liability in the U.S. is the Computer Fraud and Abuse Act (CFAA), which provides for both civil and criminal liability for unauthorized access to networked computers. Operators of internet platforms have argued, sometimes successfully, that the CFAA prohibits access even to publicly accessible information if that access violates a platform’s terms of service or continues in the face of a cease and desist letter. But the trend in recent cases in the United States has been toward a more limited reading of the CFAA, to prohibit access only to information that is behind a technological gate (e.g. password protection), not to publicly available information that is being used in a way the platform owner does not like. In Van Buren v. United States, 141 S.Ct. 1648, 1658 (2021), for example, the Supreme Court endorsed a “gates-up-or-down inquiry” under which authorization turns on whether someone “can or cannot access a computer system.” . As Professor Orin Kerr has pointed out, the Supreme Court did not make it entirely clear what counts as a “gate.” But mere admonitions posted on publicly available websites are probably not enough. The Court of Appeals for the Ninth Circuit has observed that the Van Buren case “reinforces our conclusion that the concept of ‘without authorization’ does not apply to public websites,” hiQ Labs, Inc. v. LinkedIn Corporation, 31 F.4th 1180, 1199 (9th Cir. 2022), explaining that “giving companies like LinkedIn free rein to decide, on any basis, who can collect and use data — data that the companies do not own, that they otherwise make publicly available to viewers, and that the companies themselves collect and use — risks the possible creation of information monopolies that would disserve the public interest.” Id. at 1202.

Another source of potential liability under U.S. law for research access to publicly available data is the Copyright Act. Although extracting and reusing facts from copyrighted sources is not itself prohibited by copyright law, the activities involved in what is often referred to as text and data mining (TDM) can involve making and distributing copies of the underlying works themselves, thus raising the potential for copyright liability. Here too, the trend in U.S. law has been toward recognizing the value of research access to publicly-available data and to consider TDM activities as fair use. As documented by Professor Matthew Sag, a series of fair use decisions in the United States have permitted copying that is necessary to use the information embedded in the copied works for “non-expressive purposes” — that is, not to supplant the works themselves but to generate insights about the works. Furthermore, as Professor (and CC Advisory Council member) Michael Carroll argues, transitory copies made in the process of TDM may not implicate copyright owners’ exclusive rights at all.

There remains uncertainty under U.S. law — about the interpretation of the CFAA and the Copyright Act (including provisions of the Copyright Act prohibiting circumvention of technological protection measures (TPMs)) and also about state law theories including breach of contract and trespass to chattels. Some observers worry that these uncertainties continue to chill scholarly and journalistic research using publicly available data, even under circumstances when such research raises no serious privacy or other public policy concerns. See, e.g., Thomas Kadri, Platforms as Blackacres.

In sum, the trend in the United States is toward recognition of the value of research uses of publicly available data and away from broad readings of potentially restrictive laws like the CFAA and the Copyright Act. But there remains a thicket of other potential causes of action that could chill research even under circumstances in which there are no legitimate concerns regarding, e.g., data subject privacy. A clearer affirmative legal recognition of the right to access data for research purposes could help to cut through this thicket and reduce uncertainty for researchers.

Our Recommendations

The mapping of the current legal landscape demonstrates that both in the EU and the U.S., researchers, and journalists may benefit from a number of legal safeguards regarding the use of publicly available data. The trend also shows that the value of utilizing data for research purposes has started to be recognized by policymakers and the courts. Nevertheless, there are still numerous legal barriers that exist in both jurisdictions regarding the utilization of publicly available data, which could result in potentially imposing sanctions or liabilities on researchers and journalists, with a negative impact on society as a whole. As mentioned, such restrictions and the lack of legal certainty could chill those public-interest activities and infringe upon most essential fundamental rights such as freedom to access information and freedom to share. As a result, we offer the following recommendations to aid in enabling unchilled public-interest research and journalism access to data.

1. Citizens who are generating data should be empowered to control that data

Creative Commons has long stood for and facilitated authors’ autonomy over their creations –specifically the ability to generously license their works for subsequent use. This autonomy should be possible even when authors share their works via platforms owned by others. Similarly, individuals who share data via online platforms should not thereby cede all control over that data to platform owners. They should, for example, be able to freely share their data with third parties as the Data Act partly recognizes.

2. Privacy concerns need to be taken into account in a balanced and proportionate way

The fundamental right to privacy and data protection should be balanced against the fundamental rights to freedom of expression, access to information, press freedom, and academic research free of constraints.[13] The same applies to journalism concerning access to information and press freedom since text and data mining has become one of the essential tools for gathering information. While the protection of the rights of others, such as privacy, could be deemed “necessary in a democratic society,” a proportionate balance must be established where academic research and journalism in the public interest could be granted an ‘exemption’ for their activities conducted in good faith. Thus, such an exemption should entail justifiable limits to access data, avoiding abuses of overbroad application of data protection laws that could hinder and chill research and journalistic activities.

3. In using citizen data, researchers and journalists who uphold a duty of due care should not face unnecessary barriers

Researchers and journalists should not be hindered from utilizing citizen data that are publicly made available by users themselves when they conduct their activities with the duty of care. In line with the Guidance Note on potential misuse of research issued by the European Commission, such a duty of care could entail taking additional safety, security, and mitigation measures and adjusting the research design to avoid misuse of research that could potentially harm and have substantial negative impacts on humans, animals, or the environment. Thus, academic research and journalism for the public interest should not be curtailed unless it can be proven that reasonable steps of such duty of care were not taken. However, the duty of care requirement should be established under a standardized approach set forth by law, to avoid chilling research by leaving the door open for frivolous litigation.

4. Legal instruments should solidify the rights of researchers to use publicly available data, clarifying the affirmative right to conduct legitimate research free of constraint for any of the potentially applicable restrictions.

Although several European legal instruments recognize the value of data-intensive research, piecemeal exceptions leave researchers vulnerable and confused, chilling the legitimate research such exceptions aim to promote. The Data Act represents an unprecedented opportunity to take on this challenge by building on and expanding its current limited exceptions. Although there are fewer layers of regulation in the United States, there is remaining uncertainty about the interaction of various state and federal legal regimes. As recently recognized in Science, an international instrument could provide a uniform baseline exception for research using publicly available data.

Call to action: Are you aware of any barriers to access to publicly available data for research or journalism? We want to hear from you! Contact info@creativecommons.org and tell us about the situation in your jurisdiction.

Emine Özge Yıldırım, CC Copyright Platform Working Group Lead on Digital Sharing Spaces

Molly Shaffer Van Houweling, Creative Commons Board Chair

Ana Lazarova, Creative Commons Bulgaria

Brigitte Vézina, Director of Policy, Open Culture and GLAM, Creative Commons

The Working Group on Digital Sharing Spaces would like to thank Dr. Luca Schirru and Koen Vranckaert from KU Leuven Center for IT and IP Law for their research support in the drafting of this position paper.

[1] See for example Article 3 CDSM.

[2] Under Article 4 of the GDPR, ‘personal data’ means any information that can be used to identify a person or is about an identifiable person or identified person.

[3] According to Zuzanna Gulczynska, who is a doctoral researcher at the University of Ghent, “This decision was later partially overturned by the Voivodeship Administrative Court because the DPA failed to check whether Bisnode was actually capable of receiving the up-to-date contact details of those data subjects who no longer conduct business activity. Despite this partial overruling, the core rationale of the judgment follows the assessment of the DPA. It is, therefore, safe to say that the right to information stemming from art.14 GDPR has been given abroad interpretation, putting the interests of data subjects before any possible commercial interests of data controllers.” Gulczynska, Z. (2020). Scraping personal data from internet pages: a comparative analysis of the Polish Bisnode decision and the US hiQ Labs v LinkedIn Corp judgment. EUROPEAN LAW REVIEW, 45(6), 857–869. Available at: https://biblio.ugent.be/publication/8684757.

[4] To be able to process any personal data contained within the mined data, one needs an appropriate legal basis. Consent requires to be free, specific, and informed, and must be as easily withdrawn as it is given, which, on social platforms, is very difficult to gain from all data subjects. It must also be noted that some processing of some personal data (such as biometric data, data revealing one’s ethnic origin or political/religious/philosophical point of view) can only be processed with the data subject’s prior consent or if the personal data has been manifestly made public. The latter is often the case on public platforms, which thus leaves many opportunities for text and data mining. Platforms requiring log-in may not be considered public. The question is currently pending before the European Court of Justice.

[5] Art. 2(5) of the Berne Convention states: “Collections of literary or artistic works such as encyclopaedias and anthologies which, by reason of the selection and arrangement of their contents, constitute intellectual creations shall be protected as such, without prejudice to the copyright in each of the works forming part of such collections.”

[6] In Infopaq I, the European Court of Justice (CJEU) ruled that “an act occurring during a data capture process, which consists of storing an extract of a protected work comprising 11 words and printing out that extract, is such as to come within the concept of reproduction” under Art. 2. InfoSoc Directive.

[7] Firstly, mining could fall under Art. 5.1.1. InfoSoc, when the activity implies temporary acts of reproduction, which are transient or incidental and an integral and essential part of a technological process and whose sole purpose is to enable a transmission in a network between third parties by an intermediary, or a lawful use, and which have no independent economic significance. This possible application of the Art. 5.1.1. InfoSoc exception is expressly mentioned in Recital 18 of the CDSM Directive. Secondly, in some cases Art. 5.3.a. InfoSoc can allow for mining — when the activity is performed for research and non-commercial purposes. Thirdly, mining can be covered by Art. 5.2.b InfoSoc where it is effected by physical persons for personal use. This exception can, possibly be combined with the application of Art. 5.3.n InfoSoc, which allows libraries to make protected subject matter available to individual members of the public for research or private study. Lastly, under Arts. 6.2.b and 9.b of the Databases Directive, users can reproduce temporarily or permanently, translate, adapt, arrange, distribute and communicate, display or perform to the public, as well as extract or re-utilize a substantial part of the database’s content, for the purposes of illustration for teaching or scientific research.

[8] On the nature of the reservation in Art. 4, see Lazarova, A., Margoni, T., Matas, A., Pearson, S., Reda, F., Vézina, B., Walsh, K. and Wyber, S. (2021). ‘Creative Commons Statement on the Opt-Out Exception Regime / Rights Reservation Regime for Text and Data Mining under Article 4 of the EU Directive on Copyright in the Digital Single Market’. Available at: https://creativecommons.org/wp-content/uploads/2021/12/CC-Statement-on-the-TDM-Exception-Art-4-DSM-Final.pdf.

[9] Margoni, Thomas and Kretschmer, Martin, A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology (July 14, 2021). Available at SSRN: https://ssrn.com/abstract=3886695 or http://dx.doi.org/10.2139/ssrn.3886695.

[10] According to Recital 33 of the InfoSoc Directive, “A use should be considered lawful where it is authorised by the rightholder or not restricted by law.”

[11] The ‘lawful source’ concept was introduced by the CJEU. See Judgment of the Court (Second Chamber) of 26 April 2017 in the case C-527/15, Stichting Brein (Filmspeler) [2017] EU:C:2017:300, where the Court says that the use of hyperlinks to websites — that are freely accessible to the public — on which copyright-protected works have been made available without the consent of the right holders — is unlawful. See also Judgment of the Court (Fourth Chamber) of 10 April 2014 in the case C‑435/12, ACI Adam BV и др. срещу Stichting de Thuiskopie, Stichting Onderhandelingen Thuiskopie vergoeding [2014] ECLI:EU:C:2014:254. In § 38 the Court says that “national legislation, such as that at issue in the main proceedings, which does not draw a distinction according to whether the source from which a reproduction for private use is made is lawful or unlawful, may infringe certain conditions laid down by Article 5(5) of Directive 2001/29.”

[12] The right of users to access and use IoT data or to share such data with third parties.

[13] Art. 13 of the EU Charter clearly recognizes a right to academic research free of constraints, which only could be subjected to limitations authorised by Art. 10 of the European Convention on Human Rights (ECHR). (Also see. Articles 1, 8, and 10 of the EU Charter)

--

--