Twitter’s Developer Policies for Researchers, Archivists, and Librarians

Justin Littman
On Archivy
Published in
16 min readJan 7, 2019

--

I have long maintained that one of the most significant barriers to Twitter research and archiving are Twitter’s Developer Policies. This barrier takes the form of not only the restrictions contained in the policies, but the ambiguity of the documents themselves.

In addition to just being poorly worded, my read of the policies is that they are written primarily for Twitter’s business partners. As such, it is unclear how or if Twitter intends them to apply to research and archiving.

In my work supporting research and archiving using the Twitter API, my approach is to try to suss out the spirit of the policies (assuming Twitter’s good intentions), and balance it against the best interests of Twitter, the societal value of the research and archiving, and the agency of the content creators. It is my experience that colleagues in the research, library, and archives community do the same, though as you see below, the results can be quite varied.

In the rest of this post, I will unpack some key portions of Twitter’s Developer Policies that are relevant to research and archiving and offer my interpretation. In many cases, my interpretation is “I have no idea what this means.”

Agreeing to Twitter’s Developer Policies is required to get a developer account; a developer account is required to access the API. (Applying for a developer account recently switched from a pro forma registration to requiring an applicant to describe her application / use case for Twitter approval. However, we’ll leave that for another blog post.)

Just to be clear, I am not a lawyer and this is not a legal reading of the document. It is my recommendation that you be wary about sharing Twitter’s policies with your institution’s counsel, as their reading is likely to be problematic for your work.

Also, just to be clear, the ethical obligations and considerations of a social media researcher or archivist extend beyond any ethical considerations codified in Twitter policies.

Starting with the Developer Policy:

I.A.5. Remember, Twitter may suspend or revoke access to the Twitter API if we believe you are in violation of this Policy. Do not apply for or register additional API tokens if Twitter has suspended your account. Instead, contact us.

Losing access to the API is the primary penalty you will face for violating Twitter’s Developer Policies. This can be extremely frustrating because:

  • Based on complaints in the Twitter Developer Forum, it seems really difficult to get a non-automated response when you contact Twitter support.
  • If you are offering a service to a community (e.g., a university campus), this will disable your service for everyone.

Researchers and archivists are frequently concerned that they will face legal action if terms are violated; I’m not aware of Twitter suing anyone in academia or anything else similarly serious. (If you know of any cases, let me know and I’ll add here.)

While discussing implications for breaching Twitter policies, I should point out that it doesn’t matter how we read the policies, all that matters is how Twitter reads their own policies. Thus, finding a “loophole” in the wording does not mean Twitter cannot take action. This is why I advocate a focus on the spirit of the policies, rather than a legalistic parsing of the wording.

I.B.3. Do not modify, translate or delete a portion of the Twitter Content.

I read this as applying when you are using the API to update content. (For archiving and research, we are typically only retrieving content.)

I.B.7. Do not (and do not allow others to) aggregate, cache, or store location data and other geographic information contained in the Twitter Content, except as part of a Tweet or Periscope Broadcast. Any use of location data or geographic information on a standalone basis is prohibited.

The meaning of this is unclear and is typically ignored for Twitter research with geolocation data.

I.C.1. Get the user’s express consent before you do any of the following:

b. Republish Twitter Content accessed by means other than via the Twitter API or other Twitter tools.

I’m not sure what this means. Regardless, where possible, it is recommended that researchers get the consent from a tweet’s author before republishing it in a scholarly publication. (See, for example, the guidance in this article.)

It’s worth pointing out that it is extremely difficult to contact users on Twitter without an existing relationship. If your account isn’t followed by a user, the best you can do is @ mention him. I worked with a research team that reached out to some Twitter users to participate in a study. The response rate: 0%.

It continues:

d. Store non-public Twitter Content such as Direct Messages or other private or confidential information.

e. Share or publish protected Twitter Content, private or confidential information.

For Twitter research and archiving, we don’t typically deal with non-public or protected Twitter content. With limited exceptions, this isn’t available from the API methods that are used to acquire tweets.

I.C.3. If Twitter Content is deleted, gains protected status, or is otherwise suspended, withheld, modified, or removed from the Twitter Service (including removal of location information), you will make all reasonable efforts to delete or modify such Twitter Content (as applicable) as soon as reasonably possible, and in any case within 24 hours after a request to do so by Twitter or by a Twitter user with regard to their Twitter Content, unless otherwise prohibited by applicable law or regulation, and with the express written permission of Twitter.

These terms invoke tremendous angst in researchers and archivists. I expect that this policy is written for Twitter’s business partners who have access to Enterprise API methods that provide notification when content is deleted or protected. Social media analytic platforms such as Crimson-Hexagon do adhere to this policy. Those of us using the Basic API do not have access to a mechanism to deal with deleted tweets at scale, and even if we did the technical infrastructure required to make use of it would be prohibitive.

The spirit of this term suggests that researchers and archivists have obligations to be mindful of deleted tweets; however, these terms does not spell out that obligation in a relevant way. I’m not aware of any non-commercial parties that collect Twitter data that actively purge deleted tweets from their datasets. As explained below, the prohibitions on sharing datasets of complete tweets offers a significant “right to be forgotten” for users deleting tweets.

I.C.4. If your Service will display Twitter Content to the public or to end users of your Service, and you do not use Twitter Kit or Twitter for Websites to do so, then you must use the Twitter API to retrieve the most current version of the Twitter Content for such display. If Twitter Content ceases to be available through the Twitter API, you may not display such Twitter Content and must remove it from non-display portions of your Service as soon as reasonably possible.

I take this term as not applying to datasets collected for research and archiving, as they are not typically displayed on a website. For those displaying tweets on a website, using oEmbed is any easy way to conform with this restriction as it makes sure that the tweets are displayed according to Twitter rules and deleted tweets are not displayed. (This is the approach that I used for TweetSets to display sample tweets from a dataset.)

F. Be a Good Partner to Twitter

Agree!

1. Follow the guidelines for using Tweets in broadcast if you display Tweets offline and the guidelines for using Periscope Broadcasts in a broadcast if you display Periscope Broadcasts offline.

2. If you provide Twitter Content to third parties, including downloadable datasets of Twitter Content or an API that returns Twitter Content, you will only distribute or allow download of Tweet IDs, Direct Message IDs, and/or User IDs.

This is a key term that impacts the sharing of Twitter datasets for research and archiving. Each tweet has a unique id called a “tweet id.” Given a tweet id, a user can retrieve the complete tweet from Twitter’s API (called “hydrating”) unless that tweet has been deleted or the author’s account has been deleted, suspended, or protected.

This term requires that when sharing datasets with “third parties” (more on third parties shortly) only the tweet ids be exchanged (not the text of the tweet or other tweet metadata). (Here’s an example of GW Libraries sharing 280 million tweet ids for tweets related to the 2016 U.S. election.) The person who receives the tweet ids can then hydrate the dataset to get the complete tweets.

This has a number of key implications:

  • It adds friction to exchanging large Twitter datasets. Hydrating is limited to 360,000 tweets per hour.
  • Because tweets are regularly deleted and accounts are deleted, protected, or suspended, the hydrated dataset is likely to be missing tweets in the original dataset. While this provides a “right to be forgotten” for users, it is an impediment to reproducible research. It also complicates research on topics like bots or misinformation.

Essential to this term is determining who is and is not a “third party” in the context of archiving and research, since third parties can only be given tweet ids and presumably, first and second parties can be given the complete tweets. Twitter’s policies do not spell out what is a third (or other) party. Thus, this term has been interpreted in a variety of ways, ranging from the narrow (anyone who is not you is third party) to more liberally. At George Washington (GW) University Libraries, we (unofficially) interpreted this to allow sharing Twitter datasets that we collected with anyone affiliated with GW (including students, faculty, and other researchers) and their collaborators. (What constitutes a “collaborator” is, of course, ambiguous.) If someone from outside GW contacts the library about a dataset, only the tweet ids are shared.

[Clarification: Until summer of 2018, I was a member of the Social Feed Manager team at GW Libraries. I’m now at Stanford Libraries. Nothing in this post represents the official positions of either of these organizations.]

I’ve talked with others who have considered the same approach but applying it to a larger organization such as a consortium. Thus, you can imagine an organization like ICPSR making Twitter datasets available to researchers at its member institutions. (I’m not suggesting this is their policy). This would allow for broader sharing of Twitter datasets, though limiting distribution to an academic research context.

In addition, one question that regularly arises and is not addressed by this term is whether the exchange of aggregated data or analytics derived from datasets is allowed (e.g., tweet counts by user or top hashtags).

One of my criticisms of this policy is that it treats all tweeters the same, regardless of whether they are a private citizen, bot, politician, celebrity, or government. I’d suggest that the protections that are afforded to private citizens should not be fully extended to the tweets of public figures and institutions. This should include both the handling of deleted tweets and the ability to share those tweets in an academic research context (and possibly beyond). As an example, at GW Libraries we collected the Twitter accounts of U.S. government agencies; a dataset containing the complete tweets should be publicly shareable. It is also unclear how Twitter policy interacts (collides?) with public records laws or the works of the US government being in the public domain.

Similarly, unaccounted for by this term is the ability for a Twitter user to donate her tweets to an archive and for that archive to distribute those tweets. This is a particularly important use case for community-based archives or archives of social movements. (For more on this, see the work of the Documenting the Now project.)

I.F.2.a. You may, however, provide export via non-automated means (e.g., download of spreadsheets or PDF files, or use of a “save as” button) of up to 50,000 public Tweet Objects and/or User Objects per user of your Service, per day.

This is further explained in the “Redistribution of Twitter Content” section of the Restricted Uses of the Twitter APIs policy:

We permit limited redistribution of hydrated Twitter content via non-automated means. If you choose to share hydrated Twitter content with another party in this way, you may only share up to 50,000 hydrated public Tweet Objects and/or User Objects per recipient, per day, and should not make this data publicly available (for example, as an attachment to a blog post or in a public Github repository).

I believe this term is primarily intended to limit the export of complete tweets from social media analytic services such as Crimson-Hexagon or Sprout Social. However, it can also be read as allowing the sharing of small datasets, but not the public posting of those datasets.

I.F.2.b. Any Twitter Content provided to third parties remains subject to this Policy, and those third parties must agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy before receiving such downloads.

So make sure that anyone you share datasets with is aware of Twitter’s policies.

I.F.2.b.i. You may not distribute more than 1,500,000 Tweet IDs to any entity (inclusive of multiple individual users associated with a single entity) within any given 30 day period, unless you are doing so on behalf of an academic institution and for the sole purpose of non-commercial research or you have received the express written permission of Twitter.

When Twitter originally announced this policy, there was no exclusion for academic uses. However, in response to concerns raised by the library and academic communities, they added this exclusion.

I.F.2.b.ii. You may not distribute Tweet IDs for the purposes of (a) enabling any entity to store and analyze Tweets for a period exceeding 30 days unless you are doing so on behalf of an academic institution and for the sole purpose of non-commercial research or you have received the express written permission of Twitter, or (b) enabling any entity to circumvent any other limitations or restrictions on the distribution of Twitter Content as contained in this Policy, the Twitter Developer Agreement, or any other agreement with Twitter.

I’m not sure what this means but have comfortably ignored it because of the exclusion for academic uses.

We now turn to the Developer Agreement.

I.B. License from Twitter. Subject to the terms and conditions in this Agreement (as a condition to the grant below), Twitter hereby grants you and you accept a non-exclusive, royalty free, non-transferable, non-sublicensable, revocable license solely to:

1. Use the Twitter API to integrate Twitter Content into your Services or conduct analysis of such Twitter Content;

This seems to allow for academic research; archiving is a bit less clear.

II.B. Rate Limits. You will not attempt to exceed or circumvent limitations on access, calls and use of the Twitter API (“Rate Limits”), or otherwise use the Twitter API in a manner that exceeds reasonable request volume, constitutes excessive or abusive usage, or otherwise fails to comply or is inconsistent with any part of this Agreement. If you exceed or Twitter reasonably believes that you have attempted to circumvent Rate Limits, controls to limit use of the Twitter APIs or the terms and conditions of this Agreement, then your ability to use the Licensed Materials may be temporarily suspended or permanently blocked. Twitter may monitor your use of the Twitter API to improve the Twitter Services and to ensure your compliance with this Agreement and the Developer Terms.

Twitter imposes rate limits on how often you can make calls to the API with a specific set of credentials. In general, these are reasonably generous and allow collecting at scale, as long as your software is designed to take the limits into account and you have some patience. In our use of Social Feed Manager at GW Libraries, we encouraged each researcher to create her own credentials so that the collecting can be spread across them.

The rate limit that is particularly problematic for collecting at scale is the restriction that only one filter stream can run at a time for a set of credentials. To collect on a number of topics concurrently, it is desirable to run multiple filter streams, which would require multiple credentials.

It is tempting to create “dummy” Twitter accounts to circumvent rate limits. However, I would suggest that doing so at any significant scale violates the spirit of this term.

In my particular case, I have a handful of Twitter accounts that I use of various purposes. I do create a credential for each account and use them to collect multiple filter streams concurrently.

II.C. Geographic Data. Your license to use Twitter Content in this Agreement does not allow you to (and you will not allow others to) aggregate, cache, or store location data and other geographic information contained in the Twitter Content, except in conjunction with the Twitter Content to which it is attached. Your license only allows you to use such location data and geographic information to identify the location tagged by the Twitter Content. Any use of location data or geographic information on a standalone basis or beyond the license granted herein is a breach of this Agreement.

I’m unclear if and how this applies to academic research that involves geolocation.

IV.A. Ownership. The Licensed Material is licensed, not sold, and Twitter retains and reserves all rights not expressly granted in this Agreement. You expressly acknowledge that Twitter, its licensors and its end users retain all worldwide right, title and interest in and to the Licensed Material, including all rights in patents, trademarks, trade names, copyrights, trade secrets, know-how, data (including all applications therefor), and all proprietary rights under the laws of the United States, any other jurisdiction or any treaty (“IP Rights”). You agree not to do anything inconsistent with such ownership, including without limitation, challenging Twitter’s ownership of the Twitter Marks, challenging the validity of the licenses granted herein, or otherwise copying or exploiting the Twitter Marks during or after the termination of this Agreement, except as specifically authorized herein. If you acquire any rights in the Twitter Marks or any confusingly similar marks, by operation of law or otherwise, you will, at no expense to Twitter, immediately assign such rights to Twitter.

Content creators actually retain the copyright for their tweets and, presumably, related content such as images. I believe there was some legal ruling that 140 characters was too short to copyright, but I’m not sure if that changed when a tweet went to 240 characters. Regardless, you should keep in mind that tweets are (usually) produced by people, whose agency should be considered. Untangling copyright considerations is one area that would definitely benefit from some legal expertise.

VII.A. User Protection. Twitter Content, and information derived from Twitter Content, may not be used by, or knowingly displayed, distributed, or otherwise made available to:

1. any public sector entity (or any entities providing services to such entities) for surveillance purposes, including but not limited to:

a. investigating or tracking Twitter’s users or their Twitter Content; and,

b. tracking, alerting, or other monitoring of sensitive events (including but not limited to protests, rallies, or community organizing meetings);

2. any public sector entity (or any entities providing services to such entities) whose primary function or mission includes conducting surveillance or gathering intelligence;

3. any entity for the purposes of conducting or providing surveillance, analyses or research that isolates a group of individuals or any single individual for any unlawful or discriminatory purpose or in a manner that would be inconsistent with our users’ reasonable expectations of privacy;

These are terms that Twitter actively enforces. Those doing research that is funded by a government or a government contractor, especially in the defense or homeland security area, should be mindful of these terms.

The list continues:

4. any entity to target, segment, or profile individuals based on health (including pregnancy), negative financial status or condition, political affiliation or beliefs, racial or ethnic origin, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership, data relating to any alleged or actual commission of a crime, or any other sensitive categories of personal information prohibited by law;

5. any entity that you reasonably believe will use such data to violate the Universal Declaration of Human Rights (located at http://www.un.org/en/documents/udhr/), including without limitation Articles 12, 18, or 19.

See also the “Sensitive Information” section of Twitter’s Restricted Use Cases policy.

It is unclear whether the prohibition on “any entity to target, segment, or profile individuals” on a variety of protected classes or behaviors is intended to apply to research and archiving, though I expect in the wake of Cambridge Analytica, Twitter intends it to apply. There is a wide array of valuable research that is conducted in these areas and an outright ban could potentially have negative impact on society and those groups themselves. Typically research ethics require heightened scrutiny and review of research in these areas, but not an outright ban.

Also problematic about this term is there is some indication from the case of Sifter that Twitter expects a service to police its use by users to stop prohibited use cases. But again, this is unclear.

Weighing in on prohibited use cases is a relatively new part of Twitter’s policies. My general experience is that researchers either ignore these terms or interpret them as not applying to their work. Instead, they take guidance from the research ethics of their fields or policies of their institutions.

My observation has been that Twitter is aware of and sympathetic to the unique needs for research and archiving. That being said, it is notoriously difficult to get clarification on Twitter policies and, in general, the clarification is usually “no”. As a result, few bother to ask.

Because of the difficulty of getting clarity from Twitter, the ambiguity of Twitter’s policies, and our desire to do our work, most of us performing research and archiving with Twitter data make our own private interpretation. I say “private” because we are afraid to “say out loud” what that interpretation is for fear of Twitter clamping down. Thus, important discussions about research methodology, data sharing, and ethics do not occur and actual practices for Twitter research and archiving vary greatly.

In libraries and archives, it is almost impossible to have an official policy for Twitter data or for libraries to discuss and share model policies. As my colleague, Laura Wrubel explains:

I see this when I go to conferences and there is a lot of discussion and interest in this topic, but still little in the way of shareable policies and practices. One of the effects is extreme caution, which leads to no involvement in social media work by librarians and archivists, which is a risk in itself.

The result is that much work with Twitter data occurs “in the shadows” where the interests of Twitter, researchers, archivists, librarians, and users would benefit from greater discussion and transparency. I’d take it even further and suggest that this is one of the factors that lead to research ethics failures. And it is clearly leading to gaps in the historical record.

I hope that Twitter takes steps to clarify its policies as they apply to research and archiving. And, I would further suggest that if they reached out to the relevant communities they would find good partners willing to discuss those policies.

And for those of you doing or considering research and archiving with Twitter data, but find Twitter’s policies confusing and daunting, “you are not alone.” I would encourage you to assume good intentions on Twitter’s behalf, try to identify the spirit of Twitter’s policies until they provide adequate clarification, look to the ethical principles of your discipline, and start working to turn this into a public dialogue about how to be good partners.

Did I miss any part of Twitter’s policies that you think are salient? Disagree with my reading? Have relevant examples? Let me know.

Thanks to Laura Wrubel for reviewing this post.

--

--

Justin Littman
On Archivy

Software dev at Stanford @DigitalLib. Previously @gelmanlibrary & @librarycongress. Otherwise, I’m @justin_littman.