GNU Radio / OS X / ZigBee

Yesterday, I started off fixing the unit tests of the GNU Radio ZigBee module, which were broken since the CSS phy got merged. However, opening the flow graphs in GNU Radio Companion, I soon realized that the module didn’t work at all on my Mac. (Whoot?!?)

I immediately checked on a Linux machine and—luckily—it worked perfectly fine. The symptoms on OS X were rather strange: I started the flow graph in GRC and absolutely nothing happened. It looked like it would run, only that no window appeared and no error message was shown.

At first, I thought that GRC hides some errors and started the flow graph in a terminal, but it showed the exact same behaviour.

I went on with trying the unit tests and—strangely—some of them worked. Therefore, the problem seemed to be related to specific blocks.

I spend some hours in GRC enabling and disabling blocks; but without success. At some point, I started a new flow graph and realized that it worked as soon as I got rid of WX (one of the GUI toolkits that GNU Radio uses).

OK, so one step back. The module works on Linux, but causes a deadlock on OS X. Moreover, the deadlock is caused when the module is loaded and not when a certain block is instantiated; and that only in conjunction with WX. (Funny enough, the module doesn’t contain any graphical blocks.)

For me, that was a clear indicator that I hit one of those nasty OS X bugs where an application is linked against incompatible libraries, i.e., I linked against something that did not work well with WX. In the past, I found that this can be hard to get right, since there are native libraries, libraries that are compiled with homebrew, and libraries that I compiled; and when linked into one binary—strange things may happen.

Since I didn’t use the module for about 2 months, I had no idea where I should start looking, so I gave gdb a try. The backtraces were, hoever, not really helpful.

(gdb) info threads
  Id   Target Id         Frame
  7    Thread 0x1813 of process 18215 0x00007fff8f6fc386 in mach_msg_trap () from /usr/lib/system/libsystem_kernel.dylib
  6    Thread 0x1703 of process 18215 0x00007fff8f7026de in __workq_kernreturn () from /usr/lib/system/libsystem_kernel.dylib
  5    Thread 0x1603 of process 18215 0x00007fff8f7026de in __workq_kernreturn () from /usr/lib/system/libsystem_kernel.dylib
  4    Thread 0x1503 of process 18215 0x00007fff8f7026de in __workq_kernreturn () from /usr/lib/system/libsystem_kernel.dylib
  3    Thread 0x1403 of process 18215 0x00007fff8f702ff6 in kevent_qos () from /usr/lib/system/libsystem_kernel.dylib
  2    Thread 0x1247 of process 18215 0x00007fff8f7026de in __workq_kernreturn () from /usr/lib/system/libsystem_kernel.dylib
* 1    Thread 0x1103 of process 18215 0x00007fff8f6fc386 in mach_msg_trap () from /usr/lib/system/libsystem_kernel.dylib

(gdb) bt
#0  0x00007fff8f6fc386 in mach_msg_trap () from /usr/lib/system/libsystem_kernel.dylib
#1  0x00007fff8f6fb7c7 in mach_msg () from /usr/lib/system/libsystem_kernel.dylib
#2  0x00007fff9b881624 in __CFRunLoopServiceMachPort () from /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
#3  0x00007fff9b880aec in __CFRunLoopRun () from /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
#4  0x00007fff9b880338 in CFRunLoopRunSpecific () from /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
#5  0x00007fff9331b935 in RunCurrentEventLoopInMode () from /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox
#6  0x00007fff9331b76f in ReceiveNextEventCommon () from /System/Library/Frameworks/Carbon.framework/Versions/A/Frameworks/HIToolbox.framework/Versions/A/HIToolbox
#7  0x00007fff9331b5af in _BlockUntilNextEventMatchingListInModeWithFilter ()
[...]

According to Google this call stack is perfectly fine and means that the application is waiting for something to happen, i.e., a classical deadlock.

OK, so maybe I linked against the wrong libraries… My WiFi module worked so I compared them with otool

basti@tronn ~/usr/lib >> otool -L libgnuradio-ieee802_11.dylib
libgnuradio-ieee802_11.dylib:
        /Users/basti/usr/lib/libgnuradio-ieee802_11.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/homebrew/opt/boost/lib/libboost_filesystem-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/homebrew/opt/boost/lib/libboost_system-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-runtime.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-pmt.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-digital.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-fft.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-filter.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/homebrew/lib/libitpp.8.dylib (compatibility version 8.0.0, current version 8.2.1)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1226.10.1)
basti@tronn ~/usr/lib >> otool -L libgnuradio-ieee802_15_4.dylib
libgnuradio-ieee802_15_4.dylib:
        /Users/basti/usr/lib/libgnuradio-ieee802_15_4.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/homebrew/opt/boost/lib/libboost_filesystem-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/homebrew/opt/boost/lib/libboost_system-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/homebrew/opt/boost/lib/libboost_thread-mt.dylib (compatibility version 0.0.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-runtime.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-pmt.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libgnuradio-blocks.3.8git.dylib (compatibility version 3.8.0, current version 0.0.0)
        /Users/basti/usr/lib/libvolk.1.1git.dylib (compatibility version 1.1.0, current version 0.0.0)
        /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 120.1.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1226.10.1)

Looks good so far. However, since ZigBee depends on boost threads and volk I suspected these libraries and removed the blocks that depended on them. I recompiled and checked, but no luck.

Next, I copied the cmake modules from the working WiFi module and diffed CMakeCache.txt files to assert that both modules use the same compiler with the very same configuration. No luck.

Next escalation step: I diffed the whole module directories. meld is a great tool here.

There were some differences and I adapted the ZigBee module accordingly, but—again—no luck.

[beer, sleep, smashing my head against the wall…]

Final escalation step: I deleted all blocks except for one, removed all dependencies from the cmake files, deleted all apps, examples, utils, and python files. Then, after resetting it to a module skeleton it finally worked.

Step by step, I started to put the parts back. It deadlocked again when I added the python files that implement the hierarchical blocks for the CSS phy. Wooot!?!?!

Looking at those files, I found that they start with

import css_constants
import css_phy
import numpy as np
import matplotlib.pyplot as plt

Wait. Matplotlib? WTF?!?!

Of course it was not used, but, seemingly, a left-over from when the block was implemented. I deleted the useless import and—tatatadaaa—it worked again.

The fix is already online.

OK, so what happened? It looks like with OS X, my default matplotlib backend is based on WX. So when the ZigBee module was used, it loaded matplotlib (for no reason), blocked WX, and, thus, the GUI of the flow graph, resulting in a deadlock.

36 hours later; another lesson learned :-/

Ah, btw, I also fixed the unit test.