Watch Those Toes!
CAN vs SPI: Because reliable data transmission matters when you live with a robot (part 2)
It may seem surprising, but something as low-level as a communication protocol can have extremely high-level effects. With the right protocol for the environment, data is transmitted accurately and in a timely fashion. And in the case of you and your robot, you can interact in a way that feels natural and rewarding.
With the wrong protocol, you and your robot might not end up on such good terms. Lost or untimely data can lead to anything from an unresponsive robot to squashed toes from a robot that has failed to get the message from its sensors that your foot is in the way of its treads.
In part 1 of this post, we talked about how tricky it is to reliably transmit data over the relatively large and electrically noisy distances inside a robot. We did the world’s quickest introduction to communication protocols, discussed the basic requirements for choosing a protocol, compared some possible candidates for use on a robot, then narrowed them down to two contenders. Now it’s time to dig into the details on our two prospective communication protocols and select one.
Our possible choices are SPI and CAN. They have little in common aside from their both being serial communication protocols, meaning each is designed to send data sequentially over a single line. A preference for serial protocols is not surprising, given that they have a clear advantage in the space-constrained environment of a robot, as serial protocols require many fewer wires than do parallel ones.
I Spy SPI
First off, let’s tackle the Serial Peripheral Interface or SPI. Developed in the 1980s by Motorola, SPI is widely used in embedded systems. For our robot, we’re most interested in the following aspects of SPI:
- Synchronous. As with other synchronous serial communication protocols, data sent via SPI is synchronized by a clock signal. This avoids the overhead typically required by asynchronous protocols, where some amount of data being transmitted is simply for control purposes. This can generally mean that synchronous protocols send more relevant data in the same amount of time taken by a similar asynchronous protocol, while maintaining an acceptable level of reliability. It’s not unheard of to have 40 Mbps baud rates with a SPI protocol, although distance is a big factor in how fast the communication actually is.
- Master-Slave. The SPI protocol uses a master/slave communication model and is designed to have only one master controller and only one slave peripheral communicating at any given time. It’s possible to add multiple peripherals to the configuration, but because the master controller can only communicate with a single peripheral at a time, adding peripherals creates additional hardware requirements, as we’ll see below.
- 4-Wire. SPI requires a minimum of 4 wires for a typical single slave peripheral implementation: Master-Out-Slave-In (MOSI), Master-In-Slave-Out (MISO), Serial Clock (SCLK), and Chip (or Slave) Select (CS or SS). It’s possible to have more than one slave peripheral, but each peripheral needs its own CS/SS line to know when to pay attention to the SCLK and MOSI lines. This requirement to add a wire for each additional slave peripheral means the total number of wires for SPI-based designs can quickly get out of hand.
- Bit In Bit Out Buffer. When the master controller sends a clock pulse, the slave peripheral shifts a bit out of its buffer, onto the MISO line. Once the controller has sent 8 pulses on the clock, the byte that the peripheral had loaded into its buffer is completely shifted out. Simultaneously, the controller is sending bits to the peripheral, filling its buffer again.
- Physical layer only. Because SPI is only defined on the physical layer, each SPI peripheral is required to have its own protocol for anything above the physical layer. That means that the information contained in each byte, and the order they are sent, are determined by the engineer making the peripheral. This is true even when the SPI slave is also a microcontroller, as sometimes occurs. Very often the engineer must determine any overhead requirements as well, including header information such as data type/opcode, sequence numbers, and footer information such as checksums. Bottom line: Better allocate some time for added work.
- Master control over data transmission timing. With SPI, the master controller sending a clock pulse determines when the slave peripheral sends data. When a microcontroller is used as a SPI slave, it’s important to ensure that data is loaded into the buffer before the master device requests it. This type of synchronizing requires extra bytes to be sent from the master (or an extra GPIO between the two chips), so that the slave microcontroller can tell the master controller when data is ready to be retrieved.
The physical constraints of our robot do make SPI a bit of a challenge. As shown above, we have three boards (one master, two slaves) that need to communicate, but we have physical pinch points that really limit the number of wires we can run through the robot. So, to give SPI a try, rather than adding a CS/SS line for both of the boards, we daisy-chain the two slave boards, so that the master sends all the information to the first slave only. The first slave then forwards any commands intended for the second slave to it on a second SPI setup that runs between the two slaves. Thus, the first slave is both a slave to the master and a master to the second slave. A bit awkward, but it works.
CAN do?
Another creation of the 1980s, the Controller Area Network, or CAN, was originally developed for use in the automotive industry, where reliable communication among distributed subsystems was essential. That said, current uses range from FIRST Robotics teams to Johns Hopkins University for sensor-intensive prosthetics, so we’re hardly alone in considering CAN for a robot. Let’s take a look at what makes CAN of interest.
- Asynchronous. As an asynchronous serial communication protocol, CAN has no clock signal to keep transmitted data synchronized, which means its maximum data rate will be lower than SPI. In addition to that, some amount of data must be sent simply for control purposes. Having to send this “overhead” data lowers this protocol’s effective data rate. The CAN 2.0B standard (which is currently employed in our test setup) is limited to 1 Mbps, which includes the control bits; this is still plenty for Misty. However newer modules using CAN FD (Flexible Data-rate) can go as high as 8 Mbps. In addition to a the faster data rate, CAN FD creates flexible payloads, thus reducing the percentage of overhead bits per packet.
- 2-wire differential pair. The CAN bus has two wires to pass the signal along: CAN HI and CAN LO, which work in unison. A logic HIGH bit, which is defined as a “recessive” bit in the CAN protocol, will have both HI and LO at about ½ of VCC. A logic LOW bit, which is defined as a “dominant” bit in the CAN protocol will have the HI line go to VCC, and the LO line go to GND. Rather than ‘1’s and ‘0’s being defined by an absolute voltage as in most other protocols, in CAN they are defined by a relative voltage. So a ‘1’ is defined by the HI and LO being close to the same voltage, and a ‘0’ is defined by HI and LO being at least 1.5 V apart. Any inductive noise (such as that from crosstalk from a motor line) will result in both lines spiking in the same direction, but the relative difference remains the same. Thus any inductive noise will be ignored by the nodes on the CAN network, which is critical in a noisy environment like a robot.
- Any node can transmit or receive. In a SPI implementation, the slave device cannot initiate a transmission, which creates the unfortunate need to have buffers pre-loaded and ready whenever the master starts sending. In a CAN implementation, there are no masters or slaves. Every node can transmit on the bus at any time. So our lower-end nodes can send data they collect when they collect it and don’t have to coordinate with the head. In order to prevent multiple nodes from transmitting at the same time, the CAN protocol has built-in arbitration.
- Multi-layer architecture. Unlike SPI, in terms of the OSI Model, the CAN bus protocol is defined not just on the lowest, physical layer but also on the next-level-up, data-link layer. This makes CAN more complex, but also more robust, as a bunch of very smart engineers figured out how to build solid fault tolerance into the protocol.
- Built in CRC and ACK. The hardware modules of CAN use a built-in checksum ability, which frees the software from having to make that calculation. However, with CAN a software acknowledgement is required if the sending node needs to be certain that the desired receiving node correctly received the data. (Running a robot, you absolutely want to make sure that if you tell the robot to stop, the robot will stop.) Software ACK packets can be shorter than the command packets they’re acknowledging, though, so it doesn’t necessarily double the amount of data being sent.
- Built-in hardware filters. CAN is a “mailbox” structured protocol. Message IDs are based on the type of data within the message, not on the ID of the transmitting or receiving node. In a complex system, most of the nodes will want to ignore most of the messages being passed on the bus. Rather than requiring the firmware of each node to parse every message to determine whether it’s one in which it is interested, the hardware has built-in filters. These filters can be set up to pass individual message IDs, or a range of IDs. In either case, it can significantly reduce the amount of time each node must spend processing messages. For real-time systems such as motor controllers, this allows the node to spend more time running the motors and less time parsing messages on the bus.
In terms of the architecture of the system, CAN allows some flexibility, and has the following advantages:
- Reduced wire count: The CAN implementation can have many nodes on the bus and use only two wires (three if you include the GND line). SPI requires at least 2 more wires, and that number increases linearly as nodes get added to the bus.
- No daisy chaining: A SPI implementation can only keep our wire count low (4 wires) if daisy-chaining is employed. CAN avoids daisy-chaining.
- Ease of adding nodes: One feature of CAN that’s really helpful is the ability to inject new nodes onto the bus without any significant effort. Combined with CAN’s immunity to noise over longer distances, we can take advantage of this and put nodes right next to the devices they’re meant to control. This allows us to reduce the length of sensitive communication wires, reducing crosstalk, and making the system more reliable as a whole.
Testing the Protocols
Below are the testing procedures and the results.
Test 1: Sending commands from the main node to the sensor and motor control nodes
Every 30 mS, the main processor sent commands to the 5 torso motors (3 head and 2 arm), the drive system, and the chest NeoPixel LED. Each of those commands was spaced by 1 mS, which ensured the receiving nodes had time to process the prior command.
Over the course of the test, a total of 200 commands were sent to each system: 200 for each torso motor, 200 drive, and 200 NeoPixel. Each of these commands were followed by a software acknowledgment back up to the main processor.
In addition, one “Start Data” message at the beginning and one “Stop Data” message at the end were issued. These messages instruct the sensor and motor control node to begin transmitting data messages to the main node. The start and stop data messages do not have a software acknowledgment, as the data itself serves that purpose.
At the end of the test a “halt” command was issued, stopping every motor.
Test 2: Sending data from sensor and motor control nodes to the main node
Our second test evaluated the reliability of data transmission in the opposite direction. Transmissions were sent from the time of flight sensors, accelerometer, IMU (for the composite Euler data), and the distance encoders.
The distance encoder data is the one area in which the testing varied between SPI and CAN. For the CAN test, “Distance Encoder” data was transmitted from the motor control board up to the main processor. However, with the SPI test, data from the motor control board needed to be passed through the sensor node in order to get to the main processor, because the nodes were daisy chained. Therefore on the SPI implementation, the “Distance Encoder” data did not reach the main processor. This isn’t a failure of SPI in general, but rather a lack of that functionality being written into the firmware, so it shouldn’t be considered a negative against SPI.
All data from the sensor board was transmitted to the main processor in both cases.
Results
The SPI and CAN tests were both run half a dozen or more times, and the 3 best runs were selected and the results averaged. Note that because both torso and drive motors are commanded from the same node, those two message types are not distinguished by that node and thus the received amounts are combined. Results are displayed in the tables below.
Based on the results above, it’s pretty clear that CAN has terrific reliability. CAN successfully received 100% of the messages sent in either direction. With SPI, only 87.7% of the command messages sent from the main processor to the sensor and motor control nodes were successful. Additionally with SPI, the data going in the other direction — from the sensor node to the main processor — had an even lower success rate. Approximately 80% of the time-of-flight sensor data and 50% of the accelerometer and gyroscope data packets were received.
It’s beyond the scope of this article to talk about the reason that data being sent to the main node had such worse performance than data being sent from the main node, other than to say that it had much to do with the problem of synchronizing the microcontroller SPI slave with the master. Hardening the firmware would have bumped the success rate closer to that observed with the downward-flowing commands.
We should also mention that SPI had much closer to 100% reliability without the motors running. However, we wanted the tests to use running motors, as this was much more indicative of real-world operation conditions.
Finally, we should note that with the implementation of software ACKs and retransmits, SPI could be made to be 100% reliable. It would be doable, but some effort would be required, and there’s still a risk of getting into a state in which commands are dropped sequentially and numerous retries would be required before the robot responds.
Conclusion
The results are in. SPI has greater bandwidth, but higher bandwidth isn’t required for our situation. Native support also means SPI is cheaper. However the problem of synchronizing the master and slave nodes created significant reliability issues that might be solvable, but could potentially cause continued issues and work.
CAN, while being more costly due to the additional hardware requirements, has superior results in reliability, can handle the data rate required for the amount of data that needs passed around, is built for noisy environments and long wire runs, keeps the number of wires passing through the robot to a minimum, and eases the amount of time the firmware must engage in the communication process.
CAN hits the sweet spot — being both state of the art and proven/reliable. For a long time CAN was limited to expensive industrial use, but now Misty gets to be on the leading edge of bringing this robust physical layer protocol into consumer electronics.