In the last post, we setup a environment to conduct stable and reproducible performance measurements. This time, we’ll do some first benchmarks. (A more verbose description of the experiments is available in . The code and evaluation scripts of the measurements are on GitHub.)
The simplest and most accurate thing we can do is to measure the run time of a flowgraph when processing a given workload. It is the most accurate measure, since we do not have to modify the flowgraph or GNU Radio to introduce measurement probes.
The problem with benchmarking a real-time signal processing system is that monitoring inevitably changes the system and, therefore, its behavior. It’s like the system behaves differently when you look at it. This is not an academic problem. Just recording a timestamp can introduce considerable overhead and change scheduling behavior.
In the first experiment, we consider flowgraph topologies like this.
They allow to scale the number of pipes (i.e., parallel streams) and stages (i.e., blocks per pipe). All flowgraphs are created programmatically with a C++ application to avoid any possible impact from Python bindings. My system is based on Ubuntu 19.04 and runs GNU Radio 3.8, which was compiled with GCC 8.3 in release mode.
GNU Radio supports two connection types between block: a message passing interface and a ring buffer-based interface.
Message Passing Performance
We start with the message passing interface. Since we want to focus on the scheduler and not on DSP performance, we use PDU Filter blocks, which just forward messages through the flowgraph. Our messages, of course, do not contain the key that would be filtered out. (Update: And this is, of course, where I messed up in the initial version of the post.) I, therefore, created a custom block that just forwards messages through the flowgraph and added a debug switch to make sure that this doesn’t happen again.
Before starting the flowgraph, we create and enqueue a given number of messages in the first block.
Finally, we also enqueue a
done message, which signals the scheduler to shutdown the block.
We used 500 Byte blobs, but since messages are forwarded through pointers, the type of the message does not have a sizeable impact.
The run time of
top_block::run() looks like this:
Note that we scaled the number of pipes and stages jointly, i.e., an x-axis value of 100 corresponds to 10 pipes and 10 stages. The error bars indicate confidence intervals of the mean for a confidence level of 95%.
The good news are: GNU Radio scales rather well, even for a large number of threads. In this case, up to 400 threads on 4 CPU cores. While it is worse than linear, we were not able to reproduce the horrendous results presented at FOSDEM. These measurements were probably conducted with an earlier version of GNU Radio, containing a bug that caused several message passing blocks to busy-wait. With this bug, it is reasonable that performance collapsed, once the number of blocks exceeds the number of CPUs. We will investigate it further in the following posts.
We, furthermore, can see the that difference between normal and real-time priority is only marginal.
Not so for the buffer interface. Here, we used a Null Source followed by a Head block to pipe 100e6 4-byte floats into the flowgraph. To focus on the scheduler and not on DSP, we used Copy blocks in the pipes, which just copy the samples from the input to the output buffer. Each pipe was, furthermore, terminated by a Null Sink.
Like in the previous experiment, we scale the number of pipes and stages jointly and measure the run time of
Again, good news: Also the performance of the buffer interfaces scales linearly with the number of blocks. (Even though we will later see that there is quite some overhead from thread synchronization.)
What’s not visible from the plot is that even with the CPU governor performance, the CPU might reduce the frequency due to heat issues. So while the confidence intervals indicate that we have a good estimate of the mean, the underlying individual measurements were bimodal (or multi-modal), depending on the CPU frequency.
What might be surprising is that real-time priority performs so much better. For 100 blocks (10 pipes and 10 stages), it is about 50% faster. Remember, we created a dedicated CPU set for the measurements to avoid any interference from the operating system. So there are no priority issues.
To understand what’s going on, we have to look into the Linux process scheduler. But that’s topic for a later post.