An Empirical Study of GraphQL Schemas

Published in

The Startup

12 min readNov 2, 2019

This is a brief for the research paper An Empirical Study of GraphQL Schemas, presented at ICSOC 2019. Erik Wittern led the work, with help from Alan Cha (implementation), myself (experimental design), Guillaume Baudart (theoretical analysis), and Louis Mandel (theoretical analysis). Most of these authors are affiliated with IBM, as part of IBM’s ongoing involvement with GraphQL as part of the GraphQL Foundation.

Since GraphQL is unfamiliar to many readers, I’ve included a bit more introductory material and illustrations than I usually do.

Summary

GraphQL is a query language for data that can be represented as a graph, and reportedly offers some advantages over REST-style APIs. GraphQL adoption is growing, but we know little about what GraphQL APIs look like in practice.

In this paper we built and studied a corpus of thousands of GraphQL schemas used in practice. Our study showed us both idioms and security problems. By identifying GraphQL idioms, service providers can learn how to structure their schemas to be easily understood by the community. By measuring the extent of GraphQL-related security problems, we can guide the design of API management tools to address real problems faced by practitioners.

Background and motivation

What is GraphQL?

GraphQL is a query language for graph-structured data [1]. Facebook designed the language for internal use, and open-sourced it in 2015. Now it is maintained by the GraphQL Foundation under the aegis of the Linux Foundation.

GraphQL is designed for use in networked APIs. A GraphQL API has three components:

The GraphQL queries issued by clients, illustrated in this figure.

Sample GraphQL query (left) and response (right). The response contains exactly the data requested in the query, in the structure described by the query.

2. The GraphQL schema describing the types and relations available for querying, illustrated in this figure.

Sample GraphQL schema for queries and modifications (mutations) of data about companies and their offices.

3. The GraphQL back-end that maps queries into the corresponding data.

Why would you GraphQL?

The following content is not critical to this post. I include it for readers unfamiliar with GraphQL’s value proposition.

When a company offers a “web API” they usually mean a REST-ful API. In the REST paradigm, the API provider publishes a set of endpoints corresponding to the data the client can obtain. Each endpoint returns a fixed set of information about the entity the client requested. If a client wants to collate information about several entities, they issue a request for each entity, supplemented by extra queries to explore related objects.

The REST-ful API paradigm can lead to performance inefficiencies [2] and management headaches [3], motivating migration from the REST-ful paradigm to the GraphQL paradigm [4].

Performance

A GraphQL API can offer two performance benefits [2].

A client may need to issue many requests to collect the information they want, and may need to wait for network roundtrips to determine each additional query to issue.
The client may be interested in a subset of the data returned by each endpoint. Unused data wastes resources as it is marshaled, transmitted, and parsed.

In a GraphQL API, the client can issue only one query that encodes precisely the data they request (see first figure). The client will get its data in a single response, and the server, client, and network operator all save work by eliminating unused data.

Management

The provider of a REST-ful API must manage an ever-growing set of endpoints to return the data requested by clients. As an API matures, more endpoints are added, and endpoints may become versioned to accommodate older and newer views into the data. Left unchecked by a disciplined engineering process, the result can be an unmanageable spaghetti of endpoints [3].

In a GraphQL API, there is only one endpoint to manage.

What we know about GraphQL in principle

GraphQL’s expressiveness is a strength and a weakness. A REST-ful API limits the information that can be queried in a single request by limiting the expressiveness of an endpoint. A GraphQL API supports arbitrary queries over the data, and so a single query can request a great deal of data. A GraphQL API is analogous to publishing the back-end’s database schema and offering clients an SQL console: SELECT * FROM table1 JOIN table2 JOIN ...

In this light, it is not surprising that the response to a GraphQL query can be exponentially larger than the query itself [5]. Exponential is a bad word when applied to back-end behavior! The result can be performance problems or denial of service attacks, reminiscent of the ReDoS problem for regexes.

What we know about GraphQL in practice

From a practical perspective, we know that GraphQL is seeing adoption. Dozens of companies are using GraphQL through public or private APIs [6].
From a research perspective, we know a bit about what GraphQL schemas look like [7]. Kim, Consens, and Hartig studied about 3,000 GraphQL schemas taken from open-source software. These schemas commonly support both queries and mutations, and about 50% of them contain cycles that might lead to exponential worst-case behavior.

Research questions — What we don’t yet know

RQ1: What does a “typical” GraphQL schema look like? How large is it, what GraphQL features does it use, what naming conventions does it follow?
RQ2: How serious is the risk of denial of service to GraphQL service providers?

Other authors have also considered these questions [7]. We contribute a new schema collection methodology, a comparison to commercial schemas, and a few new analyses.

Research plan

To answer these questions, we needed three things:

Data: A corpus of GraphQL schemas for study.
RQ1: An understanding of GraphQL schema norms, both induced from our corpus and compared to “official” GraphQL schema recommendations.
RQ2: A GraphQL worst-case analysis that could differentiate between varying degrees of worst-case behavior.

First I’ll explain how we built a GraphQL schema corpus, and then I’ll talk through our methods and findings for RQ1 and RQ2.

Data: A GraphQL schema corpus

To give as full a perspective of GraphQL practices as possible, we obtained GraphQL schemas from two sources: (1) commercial API providers, and (2) open-source schemas defined in GitHub projects.

The schema corpuses used in this work. We obtained the commercial corpus from commercial GraphQL API providers. We mined the *GitHub Corpus from GitHub projects.*

Commercial sub-corpus

GraphQL is growing in popularity, but most of the 100+ companies that have adopted GraphQL only use it internally. We identified the 16 public commercial APIs by referencing the list maintained by APIs.guru as of May 1st, 2019, and downloaded their schemas using introspection.

GitHub sub-corpus

The open-source community has also begun to experiment with GraphQL, and we found many GraphQL schemas in projects hosted on GitHub. This is how we obtained schemas from GitHub projects:

Process by which we obtained GraphQL schemas from GitHub projects.

Find GraphQL files in GitHub projects (first column of flowchart).
Apply schema stitching to piece together valid GraphQL schemas from GraphQL schema fragments scattered across files in the same project (second column of flowchart).
Filter out duplicates (third column of flowchart).

This yielded 8,399 complete, valid, unique GraphQL schemas.

RQ1: What a typical GraphQL schema looks like

Typical schema sizes

The next figure shows the size of the commercial and open-source schemas in terms of the number of distinct types they define.

Distributions of schema complexity (number of definitions) in the GitHub, commercial, and GitHub-large schema corpuses. Whiskers show min and max values and the boxes show the quartiles.

We noted that the commercial schemas tended to define many more types than the open-source schemas, with a median of 122 definitions (commercial) vs. 9 definitions (GitHub). This suggested to us that some of the schemas from GitHub projects might be “toys”. To enable comparisons between the commercial and GitHub schemas, we identified the schemas in the GitHub corpus that were similar to the commercial schemas on this measure. This yielded three sub-corpuses:

Commercial: The 16 schemas mined from GraphQL API providers.
GitHub-full: The 8,399 schemas mined from GitHub projects.
GitHub-large: A subset of the GitHub-full schemas, these are the 1,739 schemas that contained at least as many distinct types as the first quartile of the commercial corpus.

Summary statistics on these corpuses

The next table summarizes these schemas in terms of their size and the GraphQL features they use:

Characteristics and features used in our schema corpuses.

This table shows clear differences among all three corpuses.

Not surprisingly, commercial and GitHub-large schemas are larger, containing more object types (types available to query) and more input object types (user arguments).
On a per-object basis, however, objects have similar sizes similar in all corpuses (median of 3 fields).
In terms of feature use, commercial schemas apply interface types, union types, and custom directives most frequently, followed by GitHub-large schemas and then GitHub schemas. Conversely, GitHub-large schemas more frequently offer mutation and subscription support, followed by GitHub schemas, and then commercial schemas.

Naming conventions

Naming conventions help developers understand new interfaces quickly and create interfaces that are easily understandable. If a community follows conventions, everyone benefits. So what are the naming conventions for GraphQL schemas? Let’s take a look:

“Official” recommendations

GraphQL experts have recommended a set of naming conventions through written guidelines [8] as well as implicitly through the example schemas in the GraphQL documentation [9]. These prescribed conventions are:

Fields should be named in camelCase.
Types should be named in PascalCase.
Enums should be named in PascalCase.
Enum values should be in ALL CAPS.

We tested for these conventions in our schema corpus. They are far from universally followed.

Actual practices

The next table shows the naming conventions followed by the schemas in our corpus. In addition to the prescribed conventions (top half of table), we found several organic naming conventions upon which the open-source community seems to have agreed (bottom half of table). For example:

Prescribed convention #1: In 8.2% of the GH-large schemas, all field names used camelCase.
Organic convention #1: In 68.2% of the GH-large schemas, all input objects were named with the postfix Input.

The proportion of schemas that consistently adhere to prescribed (upper part) and organic (lower part) naming conventions. In rows marked with a † we report percentages from the subsets of schemas that use any enums, input object types, or mutations, respectively.

Many schemas follow these conventions inconsistently. For example, the snake_case convention for field names (which competes with the prescribed camelCase convention) is followed consistently by less than 1% of schemas. But around 37% of the schemas in the GitHub-full corpus use snake_case for at least one field, and 30% of all of the field names from the complete corpus are in snake_case.

The clearest distinction in this table is between the commercial schemas and the GitHub schemas. It appears that the open-source community is moving towards agreement on these conventions, while the authors of the commercial schemas are less likely to adhere to them.

RQ2: Typical worst-case GraphQL behavior

Theory

The size of a GraphQL response can be expressed in terms of the number of objects that might be returned by the query.

I mentioned earlier that the response to a GraphQL query can be exponentially larger than the query itself [5]. In fact, we can categorize schemas as having linear, polynomial, or exponential worst-case response complexity size with respect to the size of a possible query.

This categorization may be easiest to understand using a few examples based on the schema shown next.

The schema permits a Query for one Company based on its ID.
Each Company has an address and a list of its Full-Time Employees (FTEs).
Each FTE has a name, a list of interns they manage, and a list of coworkers.
Each Intern has a name.

Sample schema for a Company with Employees. The illustration shows the possible relations you can query.

Now let’s think about the size of a response to a query for IBM’s address. The query in the next figure requests a primitive value from one company, so the response will contain at most the same number of objects as the query does.

A query for IBM’s address yields a response of the same size as the query.

Now let’s try a query for IBM’s address, a list of its employees, and each of their interns (next figure). Now, suppose that each of the D full-time employees has D interns. Then the response will obtain O(D^2) objects. We’ve obtained a polynomial number of response objects by requesting nested lists of lists.

A query that contains nested lists of lists can return a polynomial number of objects.

Now that we’ve seen polynomial behavior, let’s take a look at exponential behavior. The D employees all work at IBM, so they are each other’s coworkers. In the next query we ask for the list of employees and obtain D objects. Then we ask for each of their coworkers, and get D^2 objects. When we ask for each of their coworkers, we get D^3 objects. More generally, if our query asks for n levels of coworkers, the response will contain D^n objects — an exponential number relative to the query size.

A query that contains a cycle of lists can return an exponential number of objects.

Achieving responses that are polynomially or exponentially larger than the query depends on having a schema with nested lists. The difference between polynomial and exponential behavior is the presence of a cycle in the schema graph.

Polynomial: Nested queries for different lists
Exponential: Nested queries for the same lists

Based on this intuition, we theorized and implemented a more formal analysis. Our analysis depends on the structure of the schema:

Is it possible to query nested lists? (polynomial, degree dependent on the level of nesting)
Among the nested lists, is there a type cycle such that we can nest a query for a list of the same type? (exponential)

Our analysis provides an upper bound on the worst-case response size. We may overestimate the response size based on two factors:

The GraphQL server may limit the response size.
The underlying data (i.e. the “graph” in GraphQL) may be sparse, providing a natural limit on the response size. From the example above, if IBM’s D employees had only one coworker each, we would obtain a response of size O(D), a far cry from the O(D^n) threatened by the schema’s structure.

Findings

Raw worst-case response sizes

The next table shows that exponential worst-case response size is typical in large schemas, affecting over 80% of the commercial and GitHub-large schemas.

Worst-case response size based on type graph analysis, where n denotes the query size, and D the maximum length of the retrieved lists.

This finding is not too surprising. GraphQL is about describing relations between types, and relations that lead to super-linear behavior seem natural to have in a schema. However, this finding implies that GraphQL providers and middleware services should plan to handle super-linear queries.

Defensive posture

The GraphQL documentation recommends pagination as a defense against over-large response sizes: using either slicing or the connections pattern.

We tested our schemas for pagination.

The use of slicing is easy to observe, as it is embedded in the schema
To identify the connections pattern, we used heuristics on the name of the fields.

The next table shows what we found:

No corpus consistently uses pagination patterns, raising the specter of worst-case response sizes.
When pagination patterns are used, commercial and GitHub-large schemas tend to use the connections pattern, while slicing is not used consistently.

Use of pagination, through slicing or the connections pattern.

We urge practitioners to adopt pagination more widely.

Conclusions

GraphQL is an increasingly important technology. We provided an empirical assessment of the current state of GraphQL through our rich corpuses, novel schema reconstruction methodology, and novel analyses. Our key contributions are:

RQ1: Our characterization of naming conventions can help developers adopt community standards to improve API usability.
RQ2: We have confirmed the fears of practitioners and warnings of researchers about the risk of denial of service against GraphQL APIs. Most commercial and large open-source GraphQL APIs may be susceptible to queries with exponential-sized responses. We report that many schemas do not follow best practices and thus incompletely defend against such queries.

Our work motivates many avenues for future research, such as: refactoring tools to support naming conventions, coupled schema-query analyses to estimate response sizes in middleware (e.g. rate limiting), and data-driven backend design.

More information

The full paper is available here.
The slides are available here.
On Zenodo, you can find two artifacts associated with this project. We have packaged up a corpus of the schemas with permissive licenses as well as the tools we used to build our corpus.
We published a follow-up paper discussing how to estimate the cost of a GraphQL query. You can read about our approach and findings in this post.