Come and Gone: one year operating Big Blue Button on a campus-scale

Published in

omi-uulm

10 min readMar 31, 2021

On 1st April 2021 Ulm University will turn off their 30 node Big Blue Button cluster and switch video conferencing for teaching and meetings to Zoom. Almost one year ago to the day the pandemic and the resulting political decisions forced the country in a lock-down and set lecturers and IT operations the task to run the semester entirely virtual. Big Blue Button was the software of choice at our university back then and getting it up and running before semester start was a tight race.

Introduction

An entirely virtual semester basically requires three types of software services: (i) an e-learning platform to organize teaching (for lecturers) and to serve as a single entry point for students. (ii) a platform to record, process, and ship teaching material produced asynchronously, typically screencasts of lectures. (iii) a video conferencing platform for synchronous teaching, be it for Q/A sessions, supervise Bachelor, Master, and PhD students, and running workshops. Needless to say, all three types of services need to scale for 10k students.

The approach followed at Ulm University was special in the sense that the CIO predicted an overload of network operators and any cloud services available for video conferencing. Hence, the decision was taken to host all necessary software in-house and several measures were taken to ensure the necessary network and compute capacities were available:

In-house network bandwidths in the data centre were increased by a factor of four leading to 40Gbps and 100Gbps server uplinks.
At BelWü, the federal state’s ISP for universities running a 2x100Gbit fiber backbone network across the federal state of Baden-Württemberg, capacities for peering and transit to commercial ISPs were tremendously increased.
For each of the above mentioned software services, task forces were established with members from the university’s Communication and Information Centre (KIZ), but also dedicated experts from research institutes and further associated departments.

Being member of the video conferencing task force got me rather quickly in the position of being responsible for preparing, scaling, and operating Big Blue Button as a “side task” to my regular position as a researcher; probably, I got the job, because I am good a fiddling with things I don’t have a clue about and because in our research we claim, we knew how to run things. From OMI, I got supported by Simon Volpert and Georg Eisenhart. BBB support came from Steffen Moser from SAPS. First and second level support was taken on by KIZ.

Big Blue Button

BigBlueButton (BBB) is a client-server video conferencing software released as open source software under a GNU Lesser General Public License (LGPL). It has been used in production at Ulm University since 2012 at the School of Advanced Professional Studies (SAPS). There, it is part of the students’ “virtual cloud-based desk” and it serves the purpose to enable a synchronous communication channel between lecturers and students.

BigBlueButton Architecture

Being a client server application, BBB software consists of an HTML5 client (available since version 2.2.0 released November 2019) and a multi-component server-side. A unique feature at client side is that BBB does not require the installation of additional tools, but can be used natively from almost any recent browser.

The set of sever-side components consists of custom components developed by the BBB team as well as existing open source tools. The most important of these are FreeSWITCH used for audio mixing and streaming (most CPU-demanding) and the Kurento WebRTC server used for video streaming and screen sharing (most network-demanding). Both of them have soft real-time requirements and failing to meet them will result in QoS degradation.

Scaling BBB

BBB’s distributed architecture principally enables distributing components across multi hosts. This, however, is not documented and explicitly discouraged. Instead, BBB developers and BBB community recommend to use the Ubuntu-based installation script.

The script depends on Ubuntu 16.04 and installs all required Ubuntu packages for all needed components be they BBB-specific or external such as nodejs, Redis, MongoDB, FreeSWITCH, Kurento … Overall, the dependencies between all of them are not documented which makes moving to a different distribution difficult. Even using a different Ubuntu version is considered to be non-trivial.

Due to this single-server set-up, the capacity of the server running BBB limits the maximum size of meetings that can be hosted by that server as well as the number of parallel meetings. What is more, even with vertical scaling, the components used by BBB do not scale to arbitrary extents.

Scaling horizontally is possible as using further servers increases the number of parallel meetings. Technically, the Scalelite Load Balancers provides the quickest way to set-up a BBB cluster.

Release Management

BBB components (for the current stable version 2.2) are exclusively released as deb packages for Ubuntu 16.04. A BBB release then consists of a set of versioned packages guaranteed to be interoperable. There is very little freedom for sys admins to change anything at that set-up.

While this is unpleasant, but can be accepted to some extent, the release management can be be described as chaotic. In particular in the beginning of the pandemic, the number of releases exploded leading to 1+ releases per week. In these releases, there was no separation between (i) fixed bugs, (ii) introduced new features. This led to cases where a new patch-level release fixed a critical bug, but also introduced a new feature. More than once, this new feature contained a new security problem requiring a new release which included some new features …

Operations

From lock-down to semester start, we had 12 days to provide a BBB cluster that would be used by (estimated) 5,000 students in parallel and more or less throughout the day. The SAPS had a single-server BBB installation running, so we could re-use their configuration as a starting point. Yet, they had no experience with running clusters.

Our estimate in the beginning was that we would need roughly 20 servers (indeed half a year later, we had 30 of them). We also knew that we would need to change things frequently in order to try things out and even update software while being in production (a terrible vision for a scholar like me). The hardware for running the services was provided by bwCloud, a regional OpenStack-based cloud offering run by several universities in a collaborative manner. Running BBB in this private cloud allows us to foster real-time aspects of BBB using custom scheduling rules for BBB VMs.

Operations Strategy

The overall goals of operations were the following:

Repeatably manage a large number of servers running in a cluster and providing a service that is being used for large portions of the day.
Not being experts for this software, frequent changes to the set-ups would be required. These include changes to the configuration, adding more servers, introducing the software releases.
Again, not being experts for the software, we would make mistakes and it should be very easy to roll back these mistakes; which is contradictory to the release management offered by BBB.

With little time available we decided to do what we know best. In this case, this meant to adapt an operation strategy developed in the RECAP project and used to run our own infrastructure. This operations strategy is heavily based on Immutability as described in a previous blog entry. This, in turn, strongly depends on (reproducible) automation, containerization, and container orchestration.

Major steps for operating BBB following that strategy are:

Containerization of BBB including BBB binaries and BBB configuration. There is no need to watch out for state in BBB, as there is none.
Describe the BBB deployment for an orchestration tool
Deploy virtual machines and orchestration tool
Add BBB deployment to orchestration tool

Operations Tooling

We relied on tools we have been using for some years (which is good), but unfortunately also had run out of support or will do so soon (which is bad)

Docker containers as containerization technology;
Rancher 1.6 as an container orchestrator (kubernetes does not work well with the UDP-based WebRTC demanded by BBB);
CoreOS’ container operating system as operating system for the virtual machines. Ignition files for configuring the operating systems. Rancher agents are running on each of them building the Rancher cluster;
Ansible for starting virtual machines and for gluing the different steps together

This article will not go into details about the deployment and orchestration, but exclusively focusses on the repeatable and versioned configuration of BBB within its container.

BBB Containerization

Containerization of BBB was not straight forward, but also not too complicated. We share our early scripts and Dockerfile here but did not find time to update the pages on a regular bases.

The major problem with the BBB releases is that you cannot go back. Once BBB 2.2.21 is out, there is no way to install 2.2.20 on a server. Hence, our primary concern was to build a versioned container for every release that became available. This base container was kept stable and immutable all the time.

BBB Configuration

A further problem with BBB containers is their size of 2+ GB which makes working with them slow, in particular for configuration changes. In consequence, when building the BBB containers, we ensure that all configuration files are at the upper layers of the container and that the latest container image would always be cached when configuration is updated. In retrospect, it would have been better to use configuration side-car containers that contain configuration files specific for our deployment scenario. This configuration container would exclusively consists of configuration files.

Most of the configuration files contain adapted, but static files (welcome message, supported codes, …). Yet, minor parts of the entire configuration needs to adapt to the host running that BBB instance. This is mainly the case for network configuration and dial-in support. For these cases, the files stored in the configuration container are templated and can be instantiated environment variables. The configuration is available here

Updates and Changes

There are three basic types of changes to be applied to a BBB installation: (i) changes to the operating system; (ii) changes to the BBB configuration; (iii) changes to the BBB version.

Following the concept of immutability, changes to the operating system were done by simply throwing away the virtual machine and re-installing the operating system. This kind of update was only necessary very seldom.

Changes to the configuration were applied by changing the respective configuration files in the git repository, building a new versioned container, and updating the BBB container through the orchestrator. We applies such changes several times a day in the first 8 weeks and very seldom afterwards.

Changes to the BBB version requires us to rebuild the BBB container from scratch, a task that takes ~1h to complete. Unfortunately, this new container could not immediately be shipped, as even patch level version can introduce new configuration options, change existing configuration options, and introduce new components that require new configuration files.

Due to that, after having built a container with a new version, we would scan the new container for configuration files and `diff` them against the configuration files of the previous version. Using the default configuration files in both cases, this process unveils all changes the programmers made to any configuration option. While this can be done automatically, migrating these changes to our custom BBB configuration files is a fully manual task just as deciding which values to chose for new configuration options.

When updating the production system, we usually did canary releases. That is, we removed one of the nodes from the load balancer, updated it, and ran some manual tests. In case of success, the node was put back into the load balancer and 2–3 more nodes were updated as well (the exact number depended on the time of the day and how many nodes were without a meeting). In case no increase in complaints and tickets at first level support were experienced, we updated up to 50% of all servers (automatically). The remaining 50% of all servers were updated at least 24 hours later.

Discussion

After being responsible for running BBB for one year I know more about operations that I had aimed for, more about BBB (and WebRTC in general) that I ever wanted to know, and more about users than … I do understand why first level support exists and why people are building up FAQs and Knowledge Bases.

Besides this, the last 12 months were successful in the sense that there was no major outage of BBB and only minor problems did show up with the platform. That fact that we were able to roll out new versions step-by-step and were able to easily roll-back broken versions helped a lot to minimize the number of people affected by a mistake on our side. Overall, we rolled out almost 100 versions of BBB during that time, 20 of which stemmed from new BBB releases; the others were changes in configuration or came from an increased feature set (IPv6 support, dial-in support, …).

A major weakness of our set-up is the load balancer that has originally been developed for schools with their very equal class sizes and hence meeting sizes. For university operation, Scalelite often failed to handle find good solutions for the large spread of meetings sizes ranging from 3 to 180+. Here, better scheduling algorithms are needed.

Yet, overall, the realization with the immutability concept worked fine and left us with roughly one day of work per week for BBB, at least after June 2020. Unfortunately, the tool chain we use has seen brighter days. All tools have run or will within a short time run out of support. It would have been “exciting” to port all functionality to new tools with the new semester just a few weeks away (we’ve been there). In the end, this situation motivated the political decision to move to zoom.

With a tear in the eye…

… we say good bye to Big Blue Button. Overall, I think, we were successful in bringing a 24/7 video conferencing installation to Ulm University’s lecturers and students. Thanks to all students participating in the early load testing. Kudos for the positive feedback we got here and there. And apologies for anyone unhappy about the service we offered. Be assured, we did our best.

It was a fun time (despite the strange pandemic times), but also a massive distortion to all research work and research plans. Thanks to my group for covering much of my unavailability and to our research partners for their understanding.

After shut-down, we keep the software artefacts available at http://release.bbb.uni-ulm.de/ hoping that they may be useful for someone else.

Note: An extended version of this blog entry will appear as a series of three articles in Red Stack, the magazine of the German Oracle User Group (DOAG).