This is a brief for the research paper An Empirical Study of GraphQL Schemas, presented at ICSOC 2019. Erik Wittern led the work, with help from Alan Cha (implementation), myself (experimental design), Guillaume Baudart (theoretical analysis), and Louis Mandel (theoretical analysis). Most of these authors are affiliated with IBM, as part of IBM’s ongoing involvement with GraphQL as part of the GraphQL Foundation.
Since GraphQL is unfamiliar to many readers, I’ve included a bit more introductory material and illustrations than I usually do.
GraphQL is a query language for data that can be represented as a graph, and reportedly offers some advantages over REST-style APIs. GraphQL adoption is growing, but we know little about what GraphQL APIs look like in practice.
In this paper we built and studied a corpus of thousands of GraphQL schemas used in practice. Our study showed us both idioms and security problems. By identifying GraphQL idioms, service providers can learn how to structure their schemas to be easily understood by the community. By measuring the extent of GraphQL-related security problems, we can guide the design of API management tools to address real problems faced by practitioners.
Background and motivation
What is GraphQL?
GraphQL is a query language for graph-structured data . Facebook designed the language for internal use, and open-sourced it in 2015. Now it is maintained by the GraphQL Foundation under the aegis of the Linux Foundation.
GraphQL is designed for use in networked APIs. A GraphQL API has three components:
- The GraphQL queries issued by clients, illustrated in this figure.
2. The GraphQL schema describing the types and relations available for querying, illustrated in this figure.
3. The GraphQL back-end that maps queries into the corresponding data.
Why would you GraphQL?
The following content is not critical to this post. I include it for readers unfamiliar with GraphQL’s value proposition.
When a company offers a “web API” they usually mean a REST-ful API. In the REST paradigm, the API provider publishes a set of endpoints corresponding to the data the client can obtain. Each endpoint returns a fixed set of information about the entity the client requested. If a client wants to collate information about several entities, they issue a request for each entity, supplemented by extra queries to explore related objects.
The REST-ful API paradigm can lead to performance inefficiencies  and management headaches , motivating migration from the REST-ful paradigm to the GraphQL paradigm .
A GraphQL API can offer two performance benefits .
- A client may need to issue many requests to collect the information they want, and may need to wait for network roundtrips to determine each additional query to issue.
- The client may be interested in a subset of the data returned by each endpoint. Unused data wastes resources as it is marshaled, transmitted, and parsed.
In a GraphQL API, the client can issue only one query that encodes precisely the data they request (see first figure). The client will get its data in a single response, and the server, client, and network operator all save work by eliminating unused data.
The provider of a REST-ful API must manage an ever-growing set of endpoints to return the data requested by clients. As an API matures, more endpoints are added, and endpoints may become versioned to accommodate older and newer views into the data. Left unchecked by a disciplined engineering process, the result can be an unmanageable spaghetti of endpoints .
In a GraphQL API, there is only one endpoint to manage.
What we know about GraphQL in principle
GraphQL’s expressiveness is a strength and a weakness. A REST-ful API limits the information that can be queried in a single request by limiting the expressiveness of an endpoint. A GraphQL API supports arbitrary queries over the data, and so a single query can request a great deal of data. A GraphQL API is analogous to publishing the back-end’s database schema and offering clients an SQL console:
SELECT * FROM table1 JOIN table2 JOIN ...
In this light, it is not surprising that the response to a GraphQL query can be exponentially larger than the query itself . Exponential is a bad word when applied to back-end behavior! The result can be performance problems or denial of service attacks, reminiscent of the ReDoS problem for regexes.
What we know about GraphQL in practice
- From a practical perspective, we know that GraphQL is seeing adoption. Dozens of companies are using GraphQL through public or private APIs .
- From a research perspective, we know a bit about what GraphQL schemas look like . Kim, Consens, and Hartig studied about 3,000 GraphQL schemas taken from open-source software. These schemas commonly support both queries and mutations, and about 50% of them contain cycles that might lead to exponential worst-case behavior.
Research questions — What we don’t yet know
- RQ1: What does a “typical” GraphQL schema look like? How large is it, what GraphQL features does it use, what naming conventions does it follow?
- RQ2: How serious is the risk of denial of service to GraphQL service providers?
Other authors have also considered these questions . We contribute a new schema collection methodology, a comparison to commercial schemas, and a few new analyses.
To answer these questions, we needed three things:
- Data: A corpus of GraphQL schemas for study.
- RQ1: An understanding of GraphQL schema norms, both induced from our corpus and compared to “official” GraphQL schema recommendations.
- RQ2: A GraphQL worst-case analysis that could differentiate between varying degrees of worst-case behavior.
First I’ll explain how we built a GraphQL schema corpus, and then I’ll talk through our methods and findings for RQ1 and RQ2.
Data: A GraphQL schema corpus
To give as full a perspective of GraphQL practices as possible, we obtained GraphQL schemas from two sources: (1) commercial API providers, and (2) open-source schemas defined in GitHub projects.
GraphQL is growing in popularity, but most of the 100+ companies that have adopted GraphQL only use it internally. We identified the 16 public commercial APIs by referencing the list maintained by APIs.guru as of May 1st, 2019, and downloaded their schemas using introspection.
The open-source community has also begun to experiment with GraphQL, and we found many GraphQL schemas in projects hosted on GitHub. This is how we obtained schemas from GitHub projects:
- Find GraphQL files in GitHub projects (first column of flowchart).
- Apply schema stitching to piece together valid GraphQL schemas from GraphQL schema fragments scattered across files in the same project (second column of flowchart).
- Filter out duplicates (third column of flowchart).
This yielded 8,399 complete, valid, unique GraphQL schemas.
RQ1: What a typical GraphQL schema looks like
Typical schema sizes
The next figure shows the size of the commercial and open-source schemas in terms of the number of distinct types they define.
We noted that the commercial schemas tended to define many more types than the open-source schemas, with a median of 122 definitions (commercial) vs. 9 definitions (GitHub). This suggested to us that some of the schemas from GitHub projects might be “toys”. To enable comparisons between the commercial and GitHub schemas, we identified the schemas in the GitHub corpus that were similar to the commercial schemas on this measure. This yielded three sub-corpuses:
- Commercial: The 16 schemas mined from GraphQL API providers.
- GitHub-full: The 8,399 schemas mined from GitHub projects.
- GitHub-large: A subset of the GitHub-full schemas, these are the 1,739 schemas that contained at least as many distinct types as the first quartile of the commercial corpus.
Summary statistics on these corpuses
The next table summarizes these schemas in terms of their size and the GraphQL features they use:
This table shows clear differences among all three corpuses.
- Not surprisingly, commercial and GitHub-large schemas are larger, containing more object types (types available to query) and more input object types (user arguments).
- On a per-object basis, however, objects have similar sizes similar in all corpuses (median of 3 fields).
- In terms of feature use, commercial schemas apply interface types, union types, and custom directives most frequently, followed by GitHub-large schemas and then GitHub schemas. Conversely, GitHub-large schemas more frequently offer mutation and subscription support, followed by GitHub schemas, and then commercial schemas.
Naming conventions help developers understand new interfaces quickly and create interfaces that are easily understandable. If a community follows conventions, everyone benefits. So what are the naming conventions for GraphQL schemas? Let’s take a look:
GraphQL experts have recommended a set of naming conventions through written guidelines  as well as implicitly through the example schemas in the GraphQL documentation . These prescribed conventions are:
- Fields should be named in camelCase.
- Types should be named in PascalCase.
- Enums should be named in PascalCase.
- Enum values should be in ALL CAPS.
We tested for these conventions in our schema corpus. They are far from universally followed.
The next table shows the naming conventions followed by the schemas in our corpus. In addition to the prescribed conventions (top half of table), we found several organic naming conventions upon which the open-source community seems to have agreed (bottom half of table). For example:
- Prescribed convention #1: In 8.2% of the GH-large schemas, all field names used camelCase.
- Organic convention #1: In 68.2% of the GH-large schemas, all input objects were named with the postfix
Many schemas follow these conventions inconsistently. For example, the
snake_case convention for field names (which competes with the prescribed camelCase convention) is followed consistently by less than 1% of schemas. But around 37% of the schemas in the GitHub-full corpus use
snake_case for at least one field, and 30% of all of the field names from the complete corpus are in
The clearest distinction in this table is between the commercial schemas and the GitHub schemas. It appears that the open-source community is moving towards agreement on these conventions, while the authors of the commercial schemas are less likely to adhere to them.
RQ2: Typical worst-case GraphQL behavior
The size of a GraphQL response can be expressed in terms of the number of objects that might be returned by the query.
I mentioned earlier that the response to a GraphQL query can be exponentially larger than the query itself . In fact, we can categorize schemas as having linear, polynomial, or exponential worst-case response complexity size with respect to the size of a possible query.
This categorization may be easiest to understand using a few examples based on the schema shown next.
- The schema permits a Query for one Company based on its ID.
- Each Company has an address and a list of its Full-Time Employees (FTEs).
- Each FTE has a name, a list of interns they manage, and a list of coworkers.
- Each Intern has a name.
Now let’s think about the size of a response to a query for IBM’s address. The query in the next figure requests a primitive value from one company, so the response will contain at most the same number of objects as the query does.
Now let’s try a query for IBM’s address, a list of its employees, and each of their interns (next figure). Now, suppose that each of the
D full-time employees has
D interns. Then the response will obtain
O(D^2) objects. We’ve obtained a polynomial number of response objects by requesting nested lists of lists.
Now that we’ve seen polynomial behavior, let’s take a look at exponential behavior. The
D employees all work at IBM, so they are each other’s coworkers. In the next query we ask for the list of employees and obtain
D objects. Then we ask for each of their coworkers, and get
D^2 objects. When we ask for each of their coworkers, we get
D^3 objects. More generally, if our query asks for n levels of coworkers, the response will contain
D^n objects — an exponential number relative to the query size.
Achieving responses that are polynomially or exponentially larger than the query depends on having a schema with nested lists. The difference between polynomial and exponential behavior is the presence of a cycle in the schema graph.
- Polynomial: Nested queries for different lists
- Exponential: Nested queries for the same lists
Based on this intuition, we theorized and implemented a more formal analysis. Our analysis depends on the structure of the schema:
- Is it possible to query nested lists? (polynomial, degree dependent on the level of nesting)
- Among the nested lists, is there a type cycle such that we can nest a query for a list of the same type? (exponential)
Our analysis provides an upper bound on the worst-case response size. We may overestimate the response size based on two factors:
- The GraphQL server may limit the response size.
- The underlying data (i.e. the “graph” in GraphQL) may be sparse, providing a natural limit on the response size. From the example above, if IBM’s
Demployees had only one coworker each, we would obtain a response of size
O(D), a far cry from the
O(D^n)threatened by the schema’s structure.
Raw worst-case response sizes
The next table shows that exponential worst-case response size is typical in large schemas, affecting over 80% of the commercial and GitHub-large schemas.
This finding is not too surprising. GraphQL is about describing relations between types, and relations that lead to super-linear behavior seem natural to have in a schema. However, this finding implies that GraphQL providers and middleware services should plan to handle super-linear queries.
The GraphQL documentation recommends pagination as a defense against over-large response sizes: using either slicing or the connections pattern.
We tested our schemas for pagination.
- The use of slicing is easy to observe, as it is embedded in the schema
- To identify the connections pattern, we used heuristics on the name of the fields.
The next table shows what we found:
- No corpus consistently uses pagination patterns, raising the specter of worst-case response sizes.
- When pagination patterns are used, commercial and GitHub-large schemas tend to use the connections pattern, while slicing is not used consistently.
We urge practitioners to adopt pagination more widely.
GraphQL is an increasingly important technology. We provided an empirical assessment of the current state of GraphQL through our rich corpuses, novel schema reconstruction methodology, and novel analyses. Our key contributions are:
- RQ1: Our characterization of naming conventions can help developers adopt community standards to improve API usability.
- RQ2: We have confirmed the fears of practitioners and warnings of researchers about the risk of denial of service against GraphQL APIs. Most commercial and large open-source GraphQL APIs may be susceptible to queries with exponential-sized responses. We report that many schemas do not follow best practices and thus incompletely defend against such queries.
Our work motivates many avenues for future research, such as: refactoring tools to support naming conventions, coupled schema-query analyses to estimate response sizes in middleware (e.g. rate limiting), and data-driven backend design.
- The full paper is available here.
- The slides are available here.
- On Zenodo, you can find two artifacts associated with this project. We have packaged up a corpus of the schemas with permissive licenses as well as the tools we used to build our corpus.
 Brito, Mombach, and Valente. 2019. Migrating to GraphQL: A Practical Assessment
 Shrock, 2015. GraphQL Introduction.
 Wittern, Cha, and Laredo, 2018. Generating GraphQL-Wrappers for REST (-like) APIs.
 Hartig and Perez, 2018. Semantics and Complexity of GraphQL.
 GraphQL Foundation, 2019. GraphQL adopters.
 Kim, Consens, and Hartig, 2019. An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries.
 Apollo’s GraphQL style conventions.
 GraphQL Foundation’s Introduction to GraphQL.