Why we rewrote Pydio in Golang

Pydio Team has recently announced Pydio Cells, enterprise-ready filesharing software written in #Golang.

This is a major milestone in a long story, started ten years ago with AjaXplorer, an open source software I developed for easily sharing music files with my reggae bandmates, at a time where cloud storage solutions like Dropbox did not exist. Since then renamed Pydio and turned a full-fledge Sync-and-Share platform, the application stuck to its initial technical stack: LAMP, as in Linux/Apache/MySQL/PHP. This choice led us to many creative tricks to make the most out of PHP, but at one point the developer team felt we were definitely hitting the limits.

This article will dive into the motivations of rewriting entirely in Go, introduce the new architecture and the choices behind it, and tell how that transition went along (a.k.a lessons learnt). This is the first of a series where we will expose the new architecture in more details.

AjaXplorer / Pydio versions timeline

PHP: a love/hate relationship

Don’t get me wrong: this article is not about PHP-bashing. PHP was and is still a great language, but as any language, it’s a perfect fit for certain tasks, and less for others. PHP community has been extremely active over the years, the last version (7) saw big performances in execution time, great frameworks like Symfony help with code maintainability and productivity, and PHP can definitely be the language of choice for developing a website or an evolving backend for a rich web application.

Simple to learn, weakly typed, “save-and-refresh” scripting approach, PHP is perfect to easily on-board developers, designers and open source contributors on a common repository. But at one point, managing files for sync-and-share is not just about “save-and-refresh” coding.

Pydio is deployed on-premises for fleets of thousands of users, each of them accessing to their files (that are getting bigger every day) from many devices (web, desktop, mobile), and these users are used to a Dropbox-like level of usability and features: ubiquitous access to files, easily sharing with internal or external users, real-time notification, metadata extraction, etc. 
On the Administrator side, it’s all about keeping the control of the company data, complying to the internal security policies, making sure that data and their access are always consistent. Not talking about #gdpr and #cloudact…

Implementing these expectations in PHP over the years ended up in a kind of dependency-crippled software:

  • The initial stack is still LAMP, but if you want it full-fledged, you soon have to install many additional tools: specific PHP extensions, CRON or equivalent as a scheduler, command-line PHP for launching background tasks, NSQ for messaging, Redis for efficient caching, etc.
  • Scaling out the PHP application would basically mean replicating the whole platform on many nodes, configuring load balancing and a dedicated backend for sharing session. Working but well, not really optimized. Requires again a bunch of external tool for monitoring.
  • Not to speak about the potential vulnerabilities that are so easy to introduce in the code…

So naturally came the time where we were just… fed up with PHP.

Looking at emerging technologies, we started small : in 2016, we introduced Pydio Booster, a Golang dedicated “companion” for the main platform, in charge of alleviating the burden on PHP shoulders. It handled downloads and uploads in a separate process, and provided an embedded messaging protocol and a websocket server.

But let’s face it, once you try it, you can hardly look back, and we soon got excited about taking over the whole code in Go.

Rewrite goals

So we took a step back: if we were to abandon the tons of LoC written in PHP, what were exactly our objectives?

  • Reproduce the existing features of Pydio
    Our experience with filesharing gave us one important asset: we knew the specifications in term of end-user features, almost by heart. And we knew what we wanted to improve, by the way.
  • Make it more easy to install
    As described above, most of our forum questions and support tickets would concern third-party dependencies, and rarely the code of Pydio itself. Even the expansion of VMs and Containers usage did not solved that.
  • Make it natively performant, and natively scalable
    For performance issues, our answer could not just be: throw more RAM and CPU at your VM! And even more optimized, a new platform would at one point be able to only support a given load (# of users, # of files), and would have to be scaled out. This should be made much easier.
  • Make it interoperable with third-party softwares
    Our experience with Pydio showed that it was very rarely deployed alone in a vanilla environment: enterprise IT’s comes with many existing softwares, from users directories to emailing platforms to even enterprise social networks. New product should easily speak with the outside world.
  • Decouple actual storage from Pydio workspaces
    This one, more technical, was a limitation of Pydio internal design that could hardly be be tackled in the existing code without major changes. Pydio initially focused on exposing your file system in real time (without indexation). But adding more and more layers of features showed the need for data virtualisation (to decouple data indexes from their actual storage location).

Golang Pros / Cons

With that spec in hand, we looked at Go with a fresh eye! The following aspects of the language were the key-factors in our choice:

  • A Go program compiles to a dependency-free binary, ready to run on any platform (we compile for each platform). 
    This is a HUGE win for us, as software editors. Remember that Pydio is not a SaaS-based solution, it’s open source software that people download and run on their own server. Which means a massively fragmented ensemble of users: different hardwares (amd, arm…), different OS’s (from all Linux flavors to Windows), different habits (“I like Nginx more than Apache...”). And this part constituted 80% of our support tickets. Now imagine a compiled binary that you download and start on your machine, and, well that’s it? Basically, Go fulfills the old promise of Java, without JVMs.
  • Go language is strongly typed and strongly opinionated.
    This can be a constraint at start, and even a barrier to entry for PHP-developers used to loosely assign their variable, and apply there own preferred formatting to their code. But in the end, it’s a huge step for raising the quality standards of the code. If it does not compile, it will not run. Running Golang core tools for formatting code, organizing imports, etc. makes the code easily readable or auditable by any gopher on the planet. The integrated tools makes unit testing a breeze, and we embraced automated testing as part of our development process at day one.
  • Go is a modern language and is very good at networking.
    Start a web server in 3 lines of code. Manage TLS certificates with the standard libraries. Serialize/unserialize data from strongly typed structure. Go concurrency model is just wonderful and is perfect to write massively concurrent programs, that can then be split up across cpu cores or over the network. And the list is far from exhaustive.

Of course, compared with other languages, Go still has some flaws:

  • It is still young, how will it be supported in the future?
  • Error handling is very verbose and has to be improved,
  • Third-party libraries imports are messy (basically pointing to git HEAD), and this requires a painful “vendoring” approach of these libs.
  • Switching from PHP: handling pointers is not that easy for developers used to scripting languages...

The Go community is very active, and in just the last couple of months, they introduced new versions or new specs for next version, to fix exactly the points listed above (Go module for dependencies, error handling to be reworked in Go2). So we guess the bet was a winning one!

Again, each language has its perfect usage, and I would currently not advise a web-agency to fully switch to Go. But for writing a super-performant backend for a REST Api, or an application to manage files over a network, Go does the job perfectly.

So after looking at other options as well, (Rust — too low-level, NodeJS — too weakly typed unless you use a “transpiled” layer, Java — no just kidding, …), Go was definitely our choice of heart.

Breaking the monolith

Once we settled on the language choice, we could go to the next step: designing an application that would meet our requirements!

While our PHP codebase was very decently organized, plugin-oriented, and had been refactored many times over the years, it was still monolithic: running any sub-feature of the application would require to “run” the code as a whole. When such an application grows in features, its complexity will inevitably grow along, and after some time this can lead to two major issues :

  • Inter-dependency between components makes the global model harder and harder to apprehend, and implementing innovative feature can prove very complex. Even with a very strict quality policy, side-effects will arise.
  • Horizontal scaling of the application is inefficient, as the whole code has to be replicated everywhere.

In the last years, multiple long-term trends in software engineering (SOA, Agile development, DevOps…) led to the concept of micro-services: instead of managing a huge project, all aspects of an application are split into many much smaller projects. Each brick is in charge of a very specific feature, and is specified to run as an independent application: it implements its own persistence layer, its own API for communicating with outside world, its own way of loading configurations, etc. Services can even be written in different languages, as long as the API contract is honored.

Communication via API’s strongly decouples the services definition from their actual implementation. Technical debt is under control, and code can easily evolve. By monitoring load on each service, bottlenecks are easily detected and horizontal scalability is performed on-demand.

Heavily promoted since 2015, there are plenty of articles out there about the micro-services architecture. Amongst other, see the patterns bible Microservices.io. It is worth noting that the Microsoft Azure documentation provides very explicative articles about micro-services and cloud-oriented patterns. Finally, working on the new architecture, we also decided to stick to the 12-Factor application methodology.

Pydio Cells architecture overview

Behold! The schema below shows how Pydio Cells is designed (click to make it bigger).

Cells General Overview

Although our final binary self-contains all micro-services, each one can be run as an independent process (on its own server, vm or container). They communicate with each other through various channels: GRPC (a performant RPC protocol using Protobuf serialization and running on HTTP/2) for synchronous or streaming requests, an Event Bus for PUB/SUB messaging, and standard HTTP REST apis. Starting from top, we can distinguish 4 categories:

  • Gateway services are just proxies dispatching incoming queries to underlying services, depending on their nature. The highest level proxy exposes all Apis on a unique HTTP endpoint.
  • Low-level services are ultra-specialized, in charge of basic CRUD-operations for one specific object (e.g “user”, “workspace”, “acl”, “metadata”, etc). They have their own persistence layer, and are accessed using GRPC. They are never accessed directly from outside, but by …
  • REST services: implementing more business logic than GRPC services, they are queried via REST APIs. This is called the Web Gateway pattern in a micro-service environment.
  • Generic GRPC services are providing configs management, logs aggregation, etc.. to all other services.

So we could now bring answers to each of our requirements:

  • Make it more easy to install
    After rewriting every existing layers using Go, we distribute a pre-compiled binary self-containing all features. Pydio Cells embeds an integrated Web Server (bye Apache!), all the micro-services (bye PHP!), a scheduler (bye CRON, command line tricks, etc!), a WebSocket server (bye Pydio Booster!), and much more. The only remaining dependency is an SQL database.
  • Make it natively performant, and natively scalable
    Each micro-service is fine-tuned for its task inherent performance, and GRPC communication between them is extremely fast. An automatic discovery mechanism makes it transparent for services wether they run on the same machine or not: scalability can now be done at a per-service level and is almost infinite.
  • Make it interoperable with third-party softwares
    Internal API’s between services are specified using Protobuf. But we also tried to never reinvent the wheel for API’s communicating with outside world. We chose proven standards for implementations: OpenID Connect for Authentication/Authorization, Amazon S3 protocol for data transfer, Rest APIs are described in OpenAPI (ex-Swagger), a format for easily generating SDKs in any languages, etc…
  • Decouple actual storage from Pydio workspaces
    Each concrete file storage locations are continuously indexed by a dedicated sync service, each location has its own index and these indexes are stored in a DB using super-performant SQL encoding (nested sets). Named “datasources”, these indexes are dynamically aggregated into a unique tree service that serves as a reference for the rest of the services.
  • Reproduce the existing features of Pydio
    We started rewriting each aspect of Pydio as micro-services. We ensured the features “iso-perimeter” would be respected by step-by-step integrating these new services directly inside the existing PHP application (using php SDK clients querying the new Rest APIs). We finally got to the point where the web interface would speak directly to the micro-services, and removed totally PHP.

Although this can looks frightening at first sight, at the end all services can be started on one machine with one simple command line :

$ ./cells start
2018-10-14T13:40:14.966+0200 INFO nats started
2018-10-14T13:40:14.975+0200 INFO pydio.grpc.log started
2018-10-14T13:40:14.982+0200 INFO pydio.grpc.data.objects started
2018-10-14T13:40:14.993+0200 INFO pydio.grpc.user-key started
2018-10-14T13:40:14.996+0200 INFO pydio.grpc.policy started
2018-10-14T13:40:14.997+0200 INFO pydio.grpc.acl started
2018-10-14T13:40:14.999+0200 INFO pydio.grpc.config started
2018-10-14T13:40:15.047+0200 INFO pydio.grpc.meta started
2018-10-14T13:40:15.062+0200 INFO pydio.grpc.user-meta started
2018-10-14T13:40:15.924+0200 INFO pydio.grpc.update started

Assuming you installed it on https://cells.yourcompany.com, opening this URL in your browser gives you a working instance.

Cells login screen

Mission accomplished!

Lessons learnt

Of course, this transition was a long journey, we made beginner’s mistakes and fixed them along the way, but now the whole team is really proud of this new product. Along with Continuous Integration and testing automations, we are pretty confident about the quality of the delivered code. Here are some lessons we learnt from this incredible adventure:

  • The functionality drives the architecture. Try hard to put yourself in your users shoes so you can define precisely what you want your product to be doing. This will pinpoint your current strength, what you’re aiming at, and what you need to do to achieve just that. In our example, our strength was to have the I/O operations directly available to the users, what we were aiming at in addition was a virtualisation of the files hierarchy, and to have all that, we needed to switch to Golang and implement a microservice architecture.
  • Micro-services architecture brings its own complexity. A standard Cells installation is already running ~60 services by default. Sending a request to any REST API will probably end up dispatching sub-requests all over the place to gather information from various services. This can become tricky to debug and to profile, and requires a very serious and dedicated tooling (tracing in particular).
  • It’s hard to find the right balance between agility and testing. QA is crucial for software development, and Go provides us with great integrated tools for unit testing. While developing a small dedicated lib would require a ~100% code coverage, such a platform can never be fully covered by unit tests, as interactions can prove complex between services. We tried to find this right balance between unit tests for low-level CRUD services, and integration tests for testing the platform as a whole.
  • Turning PHP developers into Gophers is not that hard. The concurrency model (goroutine/channels) is not evident at start, but the whole team finally switched in a decent time, and is now playing with pointers without any difficulties. Although we are commited to maintain the last PHP version of Pydio (Pydio 8) for security fixes, developing in Go is so pleasant that none of us could come back to PHP. Ever.

To be continued

In the next articles, I will try to go deeper in the architecture and show how we carefully designed each concerns of Pydio Cells. If you are interested in reading the code and eventually contributing, you’re welcome! It all starts onGithub (https://github.com/pydio/cells) as well as in our developer’s doc (https://pydio.com/en/docs/developer-guide)

Thanks for reading!