Using Github’s GraphQL to retrieve a list of repositories, their commits and some other stuff — part 1

Image for post
Image for post
Github + GraphQL

I have recently started working on collecting data from Github as part of some tasks related to a personal project. During the process I learned a lot of tricks and decided it would be nice to share them with you all. If you have been struggling with Github’s GraphQL API, this article could be helpful for you.

So, let’s get started. My main objective was to retrieve a list of all public repositories from Github and their respective latest commits. My first step was to go to Github’s API documentation and start from there. The first thing I was faced with was a decision. Using the well known REST API or something I had never heard of called GraphQL.

You see, from the start I was planning to do dozens of sequential REST calls in order to get the data I wanted for each repository. One to discover the repository id, another to get the commits list URL, next a call to that endpoint, and so on and so forth. As we all know, REST calls can become a hell of HTTP requests. Nice explanation about this here.

So, after just reading a few paragraphs about GraphQL, I had no doubt I wanted to go with it. One HTTP call looked really attractive. But that also meant I had something new to learn.

I won’t go through the details about GraphQL here. Instead, I will just show the query I am using the retrieve a list of public repositories and the list of commits from its default branch. That should work as a nice “hello world” project for those starting with Github’s API and GraphQL (as myself!).

Ok, let’s cut the crap and get to it.

WARNING: as I said, I am also a beginner with GraphQL. So, if you have recommendations and better approaches to do it, I would gladly hear about it! If you need more info about Github’s API, look here; about GraphQL, look here.

The query looks like this:

query listRepos($queryString: String!){
rateLimit{
cost
remaining
resetAt
}
search(query:$queryString, type:REPOSITORY, first:20){
repositoryCount
pageInfo{
endCursor
startCursor
}
edges{
node{
... on Repository{
id
name
createdAt
description
isArchived
isPrivate
url
owner{
login
id
__typename
url
}
assignableUsers{
totalCount
}
licenseInfo{
key
}
defaultBranchRef{
target{
... on Commit{
history(first:10){
totalCount
edges{
node{
... on Commit{
committedDate
}
}
}
}
}
}
}
}
}
}
}
}

Now, let’s break it down piece by piece.

  1. Query definition
query listRepos($queryString: String!){

In GraphQL, we have basically two main operations: query and mutation. Query are used to read data (like most REST HTTP GET request), whereas mutation is used to write data (like most REST HTTP POST request). Since I want to read data, I am using a query. Then, listRepos is the name of my query and ($queryString: String!) represents a variable ($queryString ) declaration with its type (String!).

If you are not wondering from where does this variable comes from, you would before hitting the send button. In GraphQL, variables are send together with the query in the same request as a separate JSON object.

2. Rate limits

rateLimit{
cost
remaining
resetAt
}

Each request you make to the API consumes a certain and limited amount of resources. If you use it all, you won’t be able to access the API for a certain time. The rateLimit object is a field which belongs to the query object itself. By declaring it I receive back as a response the cost of my request, how many “credits” I still have and when they will be renewed. Really nice information if you are aiming to automate your queries without getting your account banned from flooding the API endpoint.

3. The search field

A query object is made of many fields. One of them already mentioned, rateLimit. The next really interesting field is called search . This is a field you can use to query the API for a certain resource. This is not always necessary, though! For example, if you wanted to find information about a certain repository with known name and owner, instead of doing a search you could query the API for that certain repository directly by doing so, instead of the search field:

repository(name:"hadoop", owner:"apache"){
id
name
description
}

In my case I had to use the search field because I didn’t want to query about a pre-defined repository, but I wanted to list them.

search(query:$queryString, type:REPOSITORY, first:20){
repositoryCount
pageInfo{
endCursor
startCursor
}
edges{
...
}
}

A search is made of many attributes. You can find a list of all of them here. The ones I have used are query, type and first.

  • query: the search string to look for;
  • type: which kinds of items I want to search;
  • first: how many elements of the results I really want to retrieve;

The first argument is limited to a certain amount (as of today, limited to 100). That makes total sense. By querying the API I discovered Github has more than 37.6 million public repositories registered. Imagine the size of the response with all data related to that humongous amount of repositories. So, yeah, there are limits.

Now, you see that my query field of my search has a value of $queryString. If you already forgot what I wrote above, this is the name of the variable I have defined in the first line. This variable values are written in a separated JSON object which I defined as below:

{
"queryString": "is:public archived:false created:<2017-07-15 pushed:>2017-12-15",
"refOrder": {
"direction": "DESC",
"field": "TAG_COMMIT_DATE"
}
}

So, what I am really doing is querying the API for repositories using this string: “is:public archived:false created:<2017–07–15 pushed:>2017–12–15”. These, in turn, returns all public repositories which are not archived, have been created before 15th of July 2017 and have a push done after the 15th of December 2017.

The search object has a few interesting fields:

  • repositoryCount: the amount of repositories your search returned;
  • pageInfo: an object with data that helps you paginating the results;
  • edges: these are objects which act as bridges between different objects; in this case, edges are bridging the search object with the repository object.

If the concept of edges is not clear, don’t worry, that was a hard one on me as well. Some objects in Github’s GraphQL API have so called connections. Think of these as lists of other objects related to the object in question. As Github’s documentation says, “It’s helpful to picture a graph: dots connected by lines. The dots are nodes, the lines are edges. A connection defines a relationship between nodes”.

So, as I was writing this story, I wanted to make it in just one piece. But it seems I like to write and this one got already quite big. So, the internal parts of the edge object and the next part of this story will be left for a second story! Don’t worry, I will start writing it now and should be ready in a few minutes.

See ya!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store