Software Bugs in a Particulate Matter Sensor Library

A look at the care required to robustly parse a simple serial protocol.

Kevin J. Walters
Nerd For Tech
14 min readFeb 5, 2021

--

First two bytes of serial transmission from a PMS5003.

This article looks at some of the bugs discovered in a library for the Plantower PMS5003 particulate matter sensor. This device uses a simple serial interface to send data at low speed to another device. The library sounds trivial to design and implement but it’s remarkably easy to produce a library with protocol parsing flaws.

The Pimoroni Enviro+ FeatherWing board has a connector for the PMS5003. The example code includes a plotting program which can graph the particulate data over time at different rates. The combined plotter program typically terminated with a ChecksumMismatchError exception after a minute or two when executed with the shortest interval of one second. A logic analyser confirmed there was no corruption of the data sent by the PMS5003. This bug was investigated together with additional ones that were postulated by reading about issues that other users had experienced. Further potential bugs were noted while considering different types of failure and developing unit tests.

This work originated after writing the article Instructables: Using the Pimoroni Enviro+ FeatherWing With the Adafruit Feather NRF52840 Express and fixing bugs in the pimoroni_pms5003 library. The library is written in CircuitPython, a fork of MicroPython, a version of Python 3 for microcontrollers. The unit tests are written in Python 3 and can be executed by CPython.

Four of the new unit tests are shown in this article together with animated visualisations of the serial FIFO buffer during execution of the revised code. These were created with a small Python program to instrument the serial object in the unit tests. It uses Graphviz to help create the frames for the animations.

Graphviz might be seen more often being used for graphing static data flows but it can be a powerful tool to aid the rapid development of dynamic visualisation of data flows. In this case, the state was extracted from executing unit tests but other sources can be processed to extract state over time, for example, detailed log files.

Disclosure: I write articles for Adafruit occasionally.

Plantower PMS5003 Particulate Matter Sensor

Plantower PMS5003 particulate matter sensor with a breadboard adapter. Copyright Adafruit.

The PMS5003 is an affordable, small sensor for measuring particulate matter in air based on diffraction analysis. It is commonly used by hobbyists interested in environmental monitoring and air pollution. It offers uncalibrated measurement of 2.5um (PM2.5) and 10um (PM10) particles and is the sensor used inside the PurpleAir IoT device.

It has a 9600 baud “TTL” serial interface to transmit data to another device. The serial interface only has TX (transmit) and RX (receive) pins, there are no additional pins for hardware flow control. This type of serial interface is inherently asynchronous. The bytes are encoded in the popular lsb-first 8n1 format.

PMS5003 Protocol

Plantower data frame definition from PMS5003 data sheet. Middle section omitted.

The PMS5003 sends data about particulates over the serial interface encoded in data frames described in the the data sheet and shown above with some of the fields omitted for brevity. This form of human readable table is very useful to describe the data and its meaning but an interface definition language (IDL) of some kind is a more precise way to describe the format of the data.

The first two characters form a header to recognise the start of a data frame. This allows the data frame to be found even if reading starts half way through a data frame or it is intermingled with minor line noise. The "Check code” is a basic checksum to verify the payload is correct. These two parts of the data frame are useful for parsing and verifying the integrity of the transmission. The data frame size is cited as 32 bytes which is also useful. An open question is whether this specific frame size is a rigid part of the specification, i.e. might the data frame be extended in the future in a firmware revision or a revised product.

Data Rate

Logic Analyser showing PMS5003 data frame intervals of 908.5ms and 838.8ms.

It is very easy to assume that the device emits data frames once a second from casual testing. It’s common to see libraries and programs which make this assumption.

A logic analyser, like the Ikalogic SQ25 used here with ScanaStudio software, shows this is not quite the case. The transmission of each data frame causes many low/high transitions resulting in the yellow “blocks” on the trace. Shortly after power-up, the PMS5003 will deliver data frames either every 908.5 milliseconds or 838.8ms. This sounds very close to one second but at the faster rate across one day this results in almost half an hour of extra data produced relative to the expected volume. This is problematic if a program (or library) assumes data is sent exactly every second. A typical simple program is shown below.

This code is probably intended to read the sensor every second. The execution time of the read() method and print() command is going to ensure it runs slightly less frequently. Even a while loop with just a time.sleep(1) will not run exactly once a second due to the small overhead of the loop and any background tasks in the interpreter. A simple, compiled language like C will be much closer but this will still be subject to interrupts. The accuracy and drift of the processor’s clock will also have an effect.

The logic analyser shows us the task of matching the read() rate (consumer of data) to the PMS5003 send rate (producer of data) is challenging. An empirical analysis is also not guaranteed to give a complete picture of the PMS5003 behaviour. The data sheet is useful in providing more detail on this.

There are two options for digital output: passive and active. Default mode is active after power up. In this mode sensor would send serial data to the host automatically.

The “active mode” is described in further detail.

The active mode is divided into two sub-modes: stable mode and fast mode. If the concentration change is small the sensor would run at stable mode with the real interval of 2.3s. And if the change is big the sensor would be changed to fast mode automatically with the interval of 200-800ms, the higher of [sic] the concentration, the shorter of [sic] the interval.

A program or library reading streaming data from this sensor in active mode must deal with this adaptive/variable rate somehow. If it consumes data at a lower rate than the production rate then data will be buffered and eventually discarded when the buffer fills. These type of buffers typically have a fixed maximum size and do not grow dynamically with demand. This will clearly happen very rapidly for a program which reads the data less frequently, i.e. once a minute.

A common tactic with problematic serial communication is to “make everything bigger”. A naive approach here is to do a quick test with a substantially increased receive buffer size in CircuitPython on the Feather board. A very large buffer may well appear to eliminate the aforementioned, mysterious ChecksumMismatchError exceptions for a quick test run. However, it will cause a new subtle problem — the data will no longer be current as it will gradually fall behind as data is read from the growing pile of stale data frames in the buffer.

Reading the Header

There were no bugs observed with parsing the header but it’s very easy to unintentionally create them. The frame header is two bytes so why not read two bytes and then verify them?

This test with a rogue 5 bytes of zeros injected between data frames shows why reading/parsing the bytes one at a time is essential.

Unit test checking the correctness of parsing for two byte header.

A tempting implementation might be one that reads two bytes and then checks them both to see if they match the 424d header. That would work sometimes but if an odd number of bytes is read before the header then it would fail as neither of ??42 and 4d?? match 424d despite it being present in the data. Another good test would be parsing 42 followed by a good data frame.

This test is relatively simple and it’s easy to understand its execution and purpose. The visualization helps a little to emphasise the rigid size of the receive buffer.

This bug did exist in the partner library intended for use with Python on the Raspberry Pi, see GitHub: pimoroni/pms5003-python: Read SOF byte-by-byte to allow resync #5.

Reading the Length Field

The length field occurs after the two bytes which indicate the start of a data frame. This value appears to be immediately usable as a value to determine how many more bytes need to be read.

Unit test checking a large, bogus length field is handled properly.

This test shows one problem with trusting the length field hasn’t been corrupted by line noise and is genuine. In this case, the fixed buffer size has helped cause a truncation of a data frame and a subsequent data frame has combined with that fragment to result in a junk data frame in the buffer.

The buffer length of 34 has been intentionally chosen for this test to easily reproduce this bug but it can occur even with a buffer length that is a multiple of the data frame length. The exception below is from testing the original example code enhanced to catch any checksum exceptions. The number 16974 is distinctive and may be familiar — it’s the start of frame bytes (424e) accidentally parsed as a (big endian) length field!

This exception occurred in CircuitPython which has a default receive buffer size of 64 bytes on UART objects — exactly twice the size of a data frame. The processors and boards CircuitPython runs on have limited memory and the MemoryError exception is probably occurring due to a mixture of a large program and some degree of memory fragmentation. If the library did try to read 16974 bytes it potentially is going to try to read 530 data frames — at the lowest data frame rate this would take over 20 minutes to complete.

An improvement here is to sanity check the length field and look for values which are outside the specification. The revised code raises a new exception, FrameLengthError, for the case of the length being longer (or shorter!) than the expected value. An implementation which retains some flexibility if the data frame size changes in the future is useful. In general, it’s fairly common for data fields to change over time, often growing in size.

One might question how this bad data frame can occur if the receive buffer size is 64 and the data frame size is 32. It looks like the buffer will hold a maximum of precisely 2 data frames, precluding partial truncated data frames. This is explored in the next section.

Serial Buffers

16550AFN UART — an enhanced version of the 16450 from the IBM PC/AT. Photograph by hmvh, Wikipedia.

In an asynchronous protocol, data can be sent at any moment and the receiver needs to store this somewhere. The main processor typically gets some assistance from a universal asynchronous receiver-transmitter (UART). The IBM PC range originally used the 8250 and 16450 chips for this which could buffer a single byte. The 16550AFN shown above has a 16 byte buffer which reduces the chance of the main processor missing the interrupt signifying new data. If the interrupt is not serviced in time then data loss will occur as subsequent bytes overwrite the data in the UART. Microcontrollers tend to have integrated UARTs, the Nordic Semiconductor nRF52840 used on the Adafruit Feather nRF52840 Express board has one with a 6 byte buffer.

A hardware buffer is supplemented by an additional buffer in the operating system or in this case in the CircuitPython interpreter. As mentioned previously, the receive buffer defaults to a size of 64 bytes. Since this is a multiple of the data frame size it’s likely to reduce the frequency of ChecksumMismatchError exceptions, but manual testing will show they still occur unpredictably. The hardware buffer size could also have an effect here depending on how it’s configured and read by the CircuitPython interpreter.

A 32 byte data frame sent at 8n1 equates to 320 bits. At the rate of 9600 baud this takes 33.3ms which means there’s a significant chance of reading data from the buffer during a transmission. If the program is reading data just after it’s sent by the PMS5003 then there shouldn’t be a problem. If it’s reading at a lower rate then this is guaranteed to eventually cause problems and these will have the appearance of occurring randomly.

Logic analyser showing the program reading data frames at a lower rate than the PMS5003 send rate. The program’s second read starts during the second data frame transmission.

The example above shows two library reads indicated by the falling and rising edges of the Red Toggle (red) line. This trace was captured at a fairly low sample rate (10kHz) in order to extend the capture time. This low sample rate prevents decoding of the bytes in the data but the overall length of each data frame transmission can still clearly be seen.

The first read (maker 1) occurs well before the streaming data frame transmission. The second read (marker 2) occurs during a transmission. The production (transmission) rate here is 1.10Hz and the consumption rate is 0.752Hz which will clearly lead to a default CircuitPython UART receive buffer being full after a few seconds. The second read could easily free up some buffer space but not an entire data frame’s worth. This can then cause the partial end part of the second data frame to end up in the buffer.

The essence of this issue is the unsynchronised concurrency in sending and receiving the data. This is the nature of this form of communication, therefore the program or library need to cope with this. This is a challenging area to test with conventional unit tests as they tend to have a sequential nature to them. Even if true concurrency is achievable it may be unwise to use that in a unit test as the behaviour is not likely be precisely predictable. This could lead to a test that occasionally mysteriously fails.

A library read starting in the middle of a transmission could also theoretically occur when the library first reads from the sensor. The chance of this is almost nil if the sensor is reset as the library can clear the receive buffer and then wait for the sensor to initialise and start sending data.

Partial Read of a Data Frame

This test shows three data frames totalling 96 bytes trying to squeeze into an 80 byte buffer.

Unit test checking a truncated partial data frame is handled properly.

This is an easy way to test the code for a truncation and one where the remaining data has not appeared. The library in this case would be expected to timeout which is what the unit test looks for in the form of the SerialTimeoutError exception. A library which patiently waits for an eternity sounds broken!

Retries

The library could pass errors back to the program but could also offer a simple retry mechanism to skip over any errors. This functionality was added to the library, the test below shows how it skips a corrupted data frame.

The animation of the execution doesn’t show much and has been omitted. The key part to this test is the second read() succeeding without an exception and returning the expected data. A final call to data_available() is a nice addition to verify that nothing’s left.

A few unit tests need to check that exceptions are generated where expected — these pass retries=0 to the constructor to disable the feature.

Reading the Data Fields

No bugs were observed from testing or code inspection but typical problems would be:

  • precise message structure, definition is only in data sheet;
  • care over endianness, a common source of problems;
  • getting the units correct for each field.

Requesting Data in Passive Mode

Plantower command definition from PMS5003 data sheet.

The PMS5003’s default mode is to stream data continuously in “active” mode. A command frame can be transmitted to the PMS5003 to switch it to “passive” mode — in this mode a read command must be transmitted to request a data frame. This can be seen in the logic analyser trace below.

Logic Analyser showing three reads from PMS5003 in active mode then three in passive mode.

The reset (green) line is briefly pulsed low at 5.4 seconds (marker 1) to reset the device. At about 7.3 seconds the PMS5003 raises the RX (yellow) line and at 8.7 seconds the PMS5003 sends the first data frame in active mode. The program is toggling the value of an output (red line) to show when the three read() statements occur. The data frames on the RX line are clearly not synchronised with the first three reads.

The reset line pulses at 12.0 seconds (marker 2) to reset the device again. At 14.4 seconds the first data frame is sent indicating a healthy PMS5003 and the library code sends a command at 14.6 seconds to change to passive mode observable on TX (blue) line. Each read() statement now causes a read command to be sent and a data frame to be sent back immediately in response. This dramatically simplifies the buffering and correct parsing of the response in the library code.

Some caution is still required as the command to switch to passive mode gets a response but a data frame with unfortunate timing could sneak in ahead of that, confusing a simple implementation. This can occur even if the serial receive buffer is cleared immediately before the sending of the switch mode command.

Adafruit have their own CircuitPython library for the PMS5003 which has a pending PR to add support for passive mode. This was used as the basis for an enhancement to the Pimoroni library. The default mode was left as active to avoid changing the behaviour for existing programs but passive mode is likely to be the better choice for the majority of programs.

Conclusion

  • Parsing a serial protocol correctly and robustly may be harder than it first appears, especially for streaming data at variable rates.
  • A logic analyser is a useful tool for capturing TTL-style serial traffic. This not only allows inspection of the behaviour and precise timing but also decoding of the data.
  • Testing to observe the behaviour is useful but reference information like the data sheet must also be consulted. Any discrepancies or omissions should be discussed with the manufacturer.
  • Surveying existing issues and discussions on problematic behaviour is useful research to aid bug hunting, the development of unit tests and guide manual testing.
  • While fixing bugs or adding features serious consideration is required around library interface changes. This includes adding exceptions which occur where none was previously raised.

Resources

--

--