Leveraging OpenAPI (Part 1) — Creating and maintaining your API documentation

Where we come from

First, a bit of context. The OpenClassrooms’ platform is built on a React stack in the frontend, and a Symfony PHP stack in the backend. Frontend uses the API that the backend exposes. The backend project also performs other tasks, but its main use is to expose the API.

We’re using the API internally, but we also have business customers who use our API to integrate our catalog into their platform, so the need for clean documentation is vital.

Also, we are adept of the “code documents itself” approach, and we enforce clean coding practices, and give a lot of importance to the readability of code and its reusability.

But at one point it became necessary for this 2012-started project to have some sort of API documentation. So a Swagger documentation was created. It was manually maintained and regularly out of sync. Some developers didn’t even know it existed and we had no real process around it.

Back in 2020, we decided to go for the OpenAPI specification instead and planned to build a schema for our API, which was already large with over a hundred endpoints. Manually creating the schema was out of the question. One of our senior engineers took the task to code a utility command (in PHP), to extract an OpenAPI schema from the code base. This was no mean feat, but once this was done, it unlocked many new possibilities.

Building a monolith

Photo by Karthik Sreenivas on Unsplash

So, there we were with a 7k lines OpenAPI file, mostly auto-generated. It was not without issues. Auto-generation meant some fine tuning would be required to fix some readability issues, but still, it was solid enough.

We had a documentation now that could be used by our frontend engineers to refer to when we released new API endpoints, it could also be easily shared to quality engineers, managers, stakeholders, and to our business clients.

A couple of years passed, the company scaled-up, and here we were with 40+ engineers, multiple squads and teams, and new features on the platform arriving at a high pace, and by the end of 2022 we had about 240 endpoint definitions in our OpenAPI schema (and each endpoint often had several operations defined, as most CRUD endpoints do). It also contained 410 schemas and in the end it materialized as a single file of 32k lines, weighing 928 Kb, a YAML monstrosity. Once again, it became a pain to maintain, a pain to add endpoints, a pain to do code reviews on, a pain to refactor, a pain to just open in some editors which started to have trouble parsing it quickly. But we kept going.

Overview of our initial OpenAPI definition with over 32k lines
Our initial OpenAPI definition, with over 32000 lines of YML, and mostly inline schemas

Splitting the monolith

Photo by Nathalia Segato on Unsplash

So, our OpenAPI schema became the brain of our API (more on that in the next part of this series). But it had become unmaintainable. Some errors and inconsistencies started appearing in the documentation. Engineers were copy-pasting components, it became messy.

We actually had a very basic knowledge of the OpenAPI specification at first. After a core team was dedicated to OpenAPI matters, focusing on learning more and more about the specs, we could finally see all that was wrong with our initial schema. We barely leveraged reusable models, we even didn’t use the requestBodies or responses components. Everything was schemas and paths. The external reference mechanism was barely used.

The reference system in OpenAPI
Using references to files in the root OpenAPI definition

We had seen some nifty looking OpenAPI schemas, and we wanted to do something similar: have a main root file, and leverage the reference-to-file mechanism to split our massive file into hundreds of smaller, manageable definitions. This would open the path to better reusability. This would make code reviews much easier too. The tools were available to reassemble to a monolith if needed. And we were definitely not alone with APIs that large, APIs like those at Digital Ocean, or Box.com were a good source of inspiration.

Now, how could we split the monolith ? Refactoring manually was out of the question. Of course, we had done it before, we coded a tool to automate that !

Trials and tribulations

Photo by Nik Shuliahin 💛💙 on Unsplash

The initial proof of concept materialized quickly. Splitting into components based on the existing file was relatively easy. But we had a goal to not only split, but leverage the OpenAPI specification the best we could. This meant converting all the schemas to requests and responses based on heuristics, processing paths to extract url, header and query parameters… After a few weeks we were getting close, but our splitting command had itself outgrown the initial scope and it became much more complex. We had to refactor it. Cleaning the project allowed us to add new features much faster to our OpenAPI splitter, including detecting duplicate components, so that we could merge them into a single one. We could also detect unused models and remove them.

An overview of the new structure
Our new OpenAPI definition and its folder structure

Early on we wanted to be sure that the documentation we generated was valid OpenAPI 3.0, so we went searching for a linter. The PHP library we use (cebe/php-openapi) includes a CLI validation command, so we tried that first. It worked to some extent, but it was very limited, and we soon found out that it did not pick up some errors, and would also pop up errors that were valid OpenAPI 3.0 declarations! We probably tested all OpenAPI linters then… and every single one of them we had an issue with. They would not even pop up the same errors on the schema.

As we already used Stoplight, we settled for Spectral as a linter, it was not perfect either, but we could start defining our own custom ruleset, and hope that the issues would be fixed in future releases. And… we had a bad surprise. Some valid OpenAPI definitions again wouldn’t work with Spectral, which was probably related to our heavy use of references and definitions across multiple files. Browsing dozens of threads on GitHub we found we basically all encountered the same issues, and found ways around the limitations, forcing us to add extra steps in our build and linting processes. In our case it was simple. We bundle back the files into a monolith before passing it to third-party tools, and it works better.

As things stabilized we had to tackle another big task we never thought would take so much time. Naming conventions (hey, Phil Karlton was right!):

There are only two hard things in Computer Science: cache invalidation and naming things.

— Phil Karlton

As we were generating a completely new directory structure, with new files, for components that did not exist before, how should we name these so that it would be clean and meaningful to our engineers ?

Settling down

Photo by Katerina May on Unsplash

Well it took us quite a while, but we ended up writing lengthy guidelines describing naming conventions, and basically it ended up being a documentation on… how to write the API documentation. This file, as we do for all our ADRs, was committed to the project’s main repo.

Here we were, with a multi-files OpenAPI documentation. It was clean, not too much work was needed to make it work with other tools.

An overview of our new OpenAPI definition’s structure, only a couple of thousands lines long
Our new root OpenAPI definition structure, only a couple of thousands lines, and massive use of references to external files

We had a branch with our new documentation, now was the time to merge it. The biggest challenges here were communication and planning. It took some preparation to not cause too many conflicts in our colleagues’ branches. We had to coach them on how to use the new documentation format and make sure our guidelines were known and adopted. We also ended up writing a small generator CLI command for engineers to quickly create all template files needed to define a new endpoint.

The process went well, of course we had to support the engineers for a few sprints through pair sessions, tutorial videos, and documentation, but it mostly went well with no performance impact.

We now happily work with an OpenAPI Schema which defines, at the time of writing:

  • 324 paths
  • 199 reusable schemas
  • 173 request definitions
  • 419 response definitions
  • 212 parameters
  • 105 tags

and it’s all built from 1338 files in a clear directory hierarchy. Hurray ! 👍

In the next part of this series about OpenAPI, we will talk about how we leveraged our OpenAPI schema to test our API and generate code dynamically.

TL;DR;

Our OpenAPI story began with the need for API documentation, leading to the creation of a Swagger documentation manually. In 2020, we moved to the OpenAPI specification with the help of a custom utility command which generated the schema from our PHP code base. However, over time, our initial 7k lines OpenAPI file became unmaintainable reaching up to 32k lines and 240 endpoint definitions, leading us to split our monolith, leveraging OpenAPI specification’s reference mechanism, which drastically improved reusability and eased code reviews. Although we faced many challenges during this process, including finding the best naming conventions and also the lack of support from the ecosystem, or the lag behind the latest official specs, we have found the end result invaluable for API-first design and validation. We definitely recommend splitting your OpenAPI schema to multiple files before the number of endpoints you have becomes too important.

--

--