A lesson from the past: building real-time systems is hard

Amar Mahmutbegovic
2 min readSep 27, 2022

--

In July of 1997 Mars Pathfinder mission successfully landed on Mars. It worked flawlessly in its early day: the landing using airbags was successful, so it was the deployment of Sojourner rover, data collecting, and sending it back to Earth.

But a few days into the mission, the lander started experiencing system resets, resulting in data loss. The cause of the system reset will turn out to be the most famous example of priority inversion problems.

VxWorks is a real-time operating system (or kernel) used for the Pathfinder mission. It provides preemptive scheduling of threads. Tasks on Pathfinder were assigned relative priorities.

Information bus management task was assigned high priority. It ran frequently and moved data from one place to another. The meteorological data gathering task was run infrequently and was given low priority.

Meteorological data gathering task would acquire mutex to write data in information bus. If the information bus management task entered the Ready state while the Meteorological data gathering task was holding the mutex, it would wait for this task to release it. But if another task with a priority higher than the meteorological data task became Ready, it would start running while the high-priority task would still be waiting for the mutex to be released. In this scenario, the watchdog timer would trigger (as it wasn’t reset by the high-priority information bus management task) and would cause the system to reset.

Photo: NASA

VxWorks mutex objects had the ability to be created with priority inheritance in mind which would bring up lower priority task’s priority up to the high priority task and would help mitigate the problem. This feature wasn’t used for the creation of mutex used by information bus management and meteorological data task. Fixing this and deploying the patch to Pathfinder solved the issue.

You may wonder how it’s possible that NASA would send a multi-billion mission to Mars without all bugs being solved. The answer is rather simple: we are all humans. Engineers working on the Pathfinder mission were focused on parts of code responsible for landing, which were crucial for the mission’s success. The watchdog was put in place to reset system if something suspicious happened, and it did it’s job. Real-time systems are complex. It’s hard to test all possible cases. The good thing is they had the option to upload patches from Earth, which is why remote access and firmware update are essential for most of the devices on the market. You can quickly deploy a fix on a server or provide updates for a mobile app, but if your embedded device doesn’t support OTA or USB DFU, you are in big trouble.

--

--

Amar Mahmutbegovic

Head of Engineering at Semblie, Embedded Software Developer and author of Modern C++ in Embedded Development