Insights on the Re-Architecting to a single, Firmware, codebase, running on different Hardwares

Zhershkovitch
Machines talk, we tech.
9 min readFeb 27, 2023

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​Our company, Augury, develops and supplies sensing devices to be installed on machines to detect future malfunctions and alert before it breaks. This helps industries shift from breakdowns and unplanned maintenance to planned maintenance. The sensing and data collection, and some of the edge computing, is done on those edge devices. Which then transfers the data through a node to the cloud for further AI analysis and fault detection. This blog is going to discuss the re-architecting of the edge device’s firmware.

Single code base - what it is, what it is not and when it is a good fit for you
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​Let’s say you have a portfolio of devices with some hardware differences but with similar general functionality, so you have to manage a separate code base repository for each device, or devices portfolio. Let’s also say that you are not pleased with some of those code bases but it would be too tedious or maybe even impossible to redesign, refactor and update them because of old architecture, lack of compatibility, etc.

In Augury we have the new generation device’s code base build with zephyr RTOS (Real Time Operating System) on an nrf52840 SOC (System On Chip) and we have our older devices running on nrf52832 SOC and had a bare metal code base build with an old, uneatable, nrf SDK 11.0. We have some hardware differences in boards designs and components of our devices but all our devices are required to perform the same general functionality.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​So, when we think of the concept of Single Code Base (SCB), we think of a code architecture that will enable the application code part to be agnostic to the hardware it is running on. We chose to implement this with Zephyr RTOS, which uses a device tree module to describe the hardware, a binding mechanism and an APIs layer to achieve that hardware abstraction for the application.

For more in depth understanding of zephyr’s device trees you can follow these links:

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​Now, if you are reading this and thinking, Hey this really fits my needs, you first need to make sure it is actually feasible for your case, for example:
We had to make sure our deployed units around the world can be updated with the new code base OTA (Over The Air). This was not an obvious thing since our new code base has a different memory layout and bootloading mechanism. Our older devices have ¼ the RAM size and ½ CPU flash size compared with our new devices. The new devices have Nand flash, while the older ones have Nor flash which is slower to write on and can page program (write operation) ⅛ the buffer size compared to the Nand (meaning 8 times the write iteration compared to the new generation devices).

These questions regarding hardware limitations like RAM, flash size, performance etc. must be asked, risks must be taken under consideration. If there is doubt it is better to start with a POC (Proof Of Concept) first and eliminate the risks or discover that this solution can not work at the beginning and not after months of work.

How we did it in Augury - Road map and Architecture
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​As mentioned, our agenda in building the road map was to first deal with the risks and to show that the project is feasible ASAP.

By this guideline we started with the POC to prove we can DFU (Device Firmware Update) OTA the new code base to an old code based device. Next stage was to start building the SCB for the older devices, this stage was also driven by the approach of handling the risks first. Zephyr OS (Operating System) has a lot of overhead in comparison to a bare metal codebase. So first we configured our zephyr application to run without any sensors or functionality on the older device’s hardware. Next was functionality. Our devices carry many operations but most crucial is the ability to sample the sensors at a very high sampling rate, while streaming the collected data to external flash. These steps were at risk mainly because of RAM shortage on the older devices (64k bytes vs 256k bytes on the new devices) plus the external flash max write buff size and performance differences.

Once we got here, we were confident our devices could work and could move forward to complete the application. The previously mentioned POC took some steps that are not in our regular OTA DFU scenario. The POC and the deployment stage have a lot of content and will be described to details on a, hopefully, soon to come continues blog.

Challenging (and) decisions when re-architecte of a working application
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​Redesigning a working codebase of a continuously developing product carries the risk of messing something that works. Let’s look at three types of reasons for going, or not, into redesigning:

  • Must redesign - functionality demands it:
    ​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​As mentioned, RAM size was a serious issue to tackle with all the zephyr OS overhead and an application that was initially written with no regard to RAM limitations. One stack use reduction solution was to allocate buffers when needed instead of holding all of them on stack all the time. We could use this approach since our BLE data transfer operations and our sampling plus streaming to flash operations do not happen at the same time. Both of those operations require relatively large buffers. So, we created our own allocated memory “pool”, which is smaller than the sum of all needed buffers size, and allocate the buffers from it when needed.
    ​The sampling process of one sensor has to occur at high accuracy of sampling frequency. We used a Pickscope to profile our running processes. Without getting too much into details, we found we can reduce the high priority sampling thread’s running time by collecting several samples before calling a higher level of the application to handle it instead of calling them for each individual sample (at a very high frequency). That freed some CPU time for the flash command lower priority thread operations to happen.
  • Could, but should ? Genericity vs Performance, Project’s Time frame
    ​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​You can redesign to make it work and you can redesign to make it good and generic. But genericity has its costs, whether it is adding conditions or interfaces and function calling to functions etc. that may affect performance and space, or the extra longer time a good redesigning takes to consider everything, backwards compatible and forward readiness as well. On the other hand, genericity can add a lot of value when enabling easy configuration, modification, removal or addition of features.

    For us, performance dictates. When performance is not an issue, it becomes a matter of how much time redesigning will take, considering this time is influenced by how much the process to be redesigned is tangled.

    When deciding to not redesign, as a meanwhile solution, we sometimes use #ifdefs ’ to include or exclude code parts according to the configured board. This is obviously easier but it is not really a common code, cause you actually hold several codes at the same place and though only the relevant parts will be compiled it is really not the goal and the right way to do but a compromise of timeframes and necessity.

    I Advice - Be focused on the project goal and scope
    - While wandering in the code, you may see things like unused functions or variables, functions that do the same basic things on different modules, unnecessarily exposed modules, spaghetti code etc. it’s like if you take a walk in the park and pick any trash you see on the way. At the end you will have a clean and beautiful park but this walk will take you much longer than you’ve planned for. Remember, any redesign has the potential of unexpectedly mess something that works, especially if you already see “trash”. Yet, a better organised and encapsulated code is easier to maintain and debug. So, it’s up to you to decide how much time and to what scale you wish to redesign things.
  • Can’t /wan’t
    When redesign of something forces you to make changes where you did not plan. The BLE Communication between Nodes and edge devices has defined command and data packet structures. If we make a change on one size, we would have to adjust the other side as well. We choose to avoid changes in the edge devices side that will force us to make big changes in the node side and maybe even in the cloud. This limitation was sometimes hard as it forced a design to maintain some design and work flows that were not so good.

Pre-conclusion - some personal and high level perspectives for such project:
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​While working on SCB, or any re-architecting of a constantly developing codebase, You may experience the feeling of constant chase. Each new feature, bug fix, or other addition or modification that is done in the code base, if not initially designed to support all devices and not just the new one, is just another thing one will have to align to (=redesign) later on.
To handle this you should define some end point for the project.
We decided that the older products must, at list, do everything it did before. Yet, it was not so easy to do because we also want to stay aligned with the developing codebase to avoid handling too many conflicts later on.
Align your team members and don’t forget to align them again.
We also added verification of the build process for all boards, using CircleCI which is a good way to maintain some alignment between all product’s development.

​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​On an organisational level, you should think of the day after SCB exists. If separate teams each works on a different products and each has its own KPIs (Key Performance Indicators) that are product specific, it may become hard to maintain the code as a SCB cause each team will eventually care mostly for their product’s development and the code might be divided again to several code base after some time if team’s agendas are not managed to preserve the state of SCB.

To Conclude - what we have at end of the SCB journey
​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​​ ​ ​​We are now at the stage of testing, following is the deployment procedure.
For this blog we conclude the first step in our road map:
We have an application running on all our boards with the same code base and we manage it in one repository (after deployment will be complete we could be disposed of the older code base repositories).
We earned additional features from the new generation products, now working also on the older ones, which is a great win. Some of these new features are edge computing that reduce the amount of data we send which enable us to reduce OTA transfer volume and battery consumption. We also found some bugs and bottlenecks in the new generation product’s code on the way to make it’s application run well on all our boards, so a by-product of the SCB work was also improvement of the new generations product performances.
Finally, Looking ahead, every new feature will firstly be considered to work for all our sensing products, not just the new ones which is a great win for Augury as a machine health service provider with over 60k “old” generation products deployed out there.

--

--