I make thousands of mistakes a day, mistakes typing, mistakes coding software, mistakes driving, mistakes walking, forgetting to order my sandwich without mayo, etc. Most of the time they are immediately obvious – a red squiggly line under a word I mistyped, a compiler spewing an error message on line #42, a stubbed toe, my gps suggesting a u-turn at the next intersection, etc.
But what happens when the mistake isn’t obvious, isn’t noticed immediately, and doesn’t cause everything around me to immediately fail? Often these mistakes can have a long lifespan. Often we discover them when we are looking for something else.
Mistakes from the Trenches.
I wanted to write about a few subtle unnoticed mistakes that lurked in the AuraUAS code for quite some time.
Temperature Calibration #Fail
AuraUAS has a really cool capability where it can estimate the bias (error) of the accelerometers during flight. The 15-state EKF does this as part of it’s larger task of estimating the aircraft’s attitude, location, and velocity. These bias estimates along with the corresponding IMU temperature can be used to build up a temperature calibration fit for each specific IMU based on flight data over time. The more you fly in different temperature conditions, the better your temperature calibration becomes. Sweet! Calibrated accelerometers are important because accel calibration errors directly translate to errors in initial roll and pitch estimates (like during launch or take off where these values can be critical.) Ok, the EKF will sort them out once in the air, because that is a cool feature of the EKF, but it can’t work out the errors until after flying a bit.
The bias estimates and temperature calibration fit are handled by post-flight python scripts that work with the logged flight data. Question: should I log the raw accel values or should I log the calibrated accel values. I decided I should log the calibrated values and then use the inverse calibration fit function to derive the original raw values after the flight. Then I use these raw values to estimate the bias (errors), add the new data to the total collection of data for this particular IMU, and revise the calibration fit. The most straightforward path is to log calibrated values on board during flight (in real time) and push the complicated stuff off into post processing.
However, I made a slight typo in the property name of the temperature range limits for the fit (we only fit within the range of temperatures we have flight data for.) This means the on-board accel correction was forcing the temperature to 27C (ignoring the actual IMU temperature.) However, when backing out the raw values in post processing, I was using the correct IMU temperature and thus arriving at a wrong raw value. What a mess. That means a year of calibration flight data is basically useless and I have to start all my IMU calibration learning over from scratch. So I fixed the problem and we go forward from here with future flights producing a correct calibration.
Integer Mapping #Fail
This one is subtle. It didn’t produce incorrect values, it simply reduced the resolution of the IMU gyros by a factor of 4 and the accels by a factor of 2.
Years ago when I first created the apm2-sensor firmware – that converts a stock APM2 (atmega2560) board into a pure sensor head – I decide to change the configured range of the gyros and accels. Instead of +/-2000 degrees per second, I set the gyros for +/-500 degrees per second. Instead of +/-8 g’s on the accels, I set them for +/- 4 g’s. The sensed values get mapped to a 16 bit integer, so using a smaller range results in more resolution.
The APM2 reads the raw 16 bit integer values from the IMU and converts this to radians per second. However, when the APM2 sends these values to the host, it re-encodes them from a 4-byte float to a 2-byte (16-bit) integer to conserve bandwidth. Essentially this undoes the original decoding operation to efficiently transmit the values to the host system. The host reads the encoded integer value and reconverts it into radians per second for the gyros (or mps^2 for the accels.)
The problem was that for encoding and decoding between the APM2 and the host, I used the original scaling factor for +/-2000 dps and +/-8g, not the correct scaling factor for the new range I had configured. This mistake caused me to lose all the resolution I intended to gain. Because the system produced the correct values on the other end, I didn’t notice this problem until someone asked me exactly what resolution the system produced, which sent me digging under the hood to refresh my memory.
This is now fixed in apm2-sensors v2.52, but requires a change to the host software as well so the encoding and decoding math agrees. Now the IMU reports the gyro rates with a resolution of 0.015 degrees per second where as previously the resolution was 0.061 degrees per second. Both are actually pretty good, but it pained me to discover I was throwing away resolution needlessly.
This one is also very subtle; timing issues often are. In the architecture of the AuraUAS flight controller there is an APM2 spitting out new sensor data at precisely 100 hz. The host is a beaglebone (or any linux computer) running it’s own precise 100 hz main loop. The whole system runs at 100 hz throughput and life is great – or so I thought.
I had been logging flight data at 25hz which has always been fine for my own needs. But recently I had a request to log the flight data at the full 100 hz rate. Could the beaglebone handle this? The answer is yes, of course, and without any trouble at all.
A question came up about logging high rate data on the well known PX4, so we had a student configure the PX4 for different rates and then plot out the time slice for each sample. We were surprised at the huge variations in the data intervals, ranging from way too fast, to way too slow, and rarely exactly what we asked for.
I know that the AuraUAS system runs at exactly 100hz because I’ve been very careful to design it that way. Somewhat smugly I pulled up a 100hz data set and plotted out the time intervals for each IMU record. The plot surprised me – my timings were all over the map and not much better than the PX4. What was going on?
I took a closer look at the IMU records and noticed something interesting. Even though my main loop was running precisely and consistently at 100 hz, it appeared that my system was often skipping every other IMU record. AuraUAS is designed to read whatever sensor data is available at the start of each main loop iteration and then jump into the remaining processing steps. Because the APM2 runs it’s own loop timing separate from the host linux system, the timing between sending and receiving (and uart transferring) can be misalligned so that when the host is ready to read sensor data, there might not be any yet, and next time there may be 2 records waiting. It is subtle, but communication between to free running processor loops can lead to issues like this. The end result is usually still ok, the EKF handles variable dt just fine, the average processing rate maybe drops to 50hz, and that’s still just fine for flying an airplane around the sky … no big deal right? And it’s really not that big of a deal for getting the airplane from point A to point B, but if you want to do some analysis of the flight data and want high resolution, then you do have a big problem.
What is the fix? There are many ways to handle timing issues in threaded and distributed systems. But you have to be very careful, often what you get out of your system is not what you expected or intended. In this case I have amended my host system’s main loop structure to throw away it’s own free running main loop. I have modified the APM2 data output routine to send the IMU packet at the end of each frame’s output to mark the end of data. Now the main loop on the host system reads sensor data until it receives an IMU packet. Then and only then does it drop through to the remaining processing steps. This way the timing of the system is controlled precisely by the APM2, the host system’s main loop logic is greatly simplified, and the per frame timing is far more consistent … but not consistent enough.
The second thing I did was to include the APM2 timestamp with each IMU record. This is a very stable, consistent, and accurate timestamp, but it counts up from a different starting point than the host. On the host side I can measure the difference between host clock and APM2 clock, low pass filter the difference and add this filtered difference back to the APM2 timestamp. The result is a pretty consistent value in the host’s frame of (time) reference.
Here is a before and after plot. The before plot is terrible! (But flies fine.) The after plot isn’t perfect, but might be about as good as it gets on a linux system. Notice the difference in Y-scale between the two plots. If you think your system is better than mine, log your IMU data at 100hz and plot the dt between samples and see for yourself. In the following plots, the Y axis is dt time in seconds. The X axis is elapsed run time in seconds.
dt using host's timestamp when imu packet received.
dt when using a hybrid of host and apm2 timestamps.
Even with this fix, I see the host system’s main loop timing vary between 0.008 and 0.012 seconds per frame, occasionally even worse (100hz should ideally equal exactly 0.010 seconds.) This is now far better than the system was doing previously, and far, far better than the PX4 does … but still not perfect. There is always more work to do!
These mistakes (when finally discoveded) all led to important improvements with the AuraUAS system: better accelerometer calibration, better gyro resolution, better time step consistency with no dropped frames. Will it help airplanes get from point A to point B more smoothly and more precisely? Probably not in any externally visible way. Mistakes? I still make them 1000’s of times a day. Lurking hidden mistakes? Yes, those too. My hope is that no matter what stage of life I find myself in, I’m always working for improvements, always vigilant to spot issues, and always focused on addressing issues when they are discovered.
2016-07-23 12:38:59 -0500 - Written by curt