In the last post, we setup a environment to conduct stable and reproducible performance measurements. This time, we’ll do some first benchmarks.
The simplest and most accurate thing we can do is to measure the run time of a flow graph when processing a given workload. It is the most accurate measure, since we do not have to modify the flow graph or GNU Radio to introduce measurement probes.
The problem with benchmarking a real-time signal processing system is that monitoring inevitably changes the system and, therefore, its behavior. It’s like the system behaves differently when you look at it. This is not an academic problem. Just recording a timestamp can introduce considerable overhead and change scheduling behavior.
In the first experiment, we consider flow graph topologies like this.
They allow to scale the number of pipes (i.e., parallel streams) and stages (i.e., blocks per pipe). All flow graphs are created programmatically with a C++ application to avoid any possible impact from Python bindings. My system is based on Ubuntu 19.04 and runs GNU Radio 3.8, which was compiled with GCC 8.3 in release mode.
GNU Radio supports two connection types between block: a message passing interface and a ring buffer-based interface.
Message Passing Performance
We start with the message passing interface. Since we want to focus on the scheduler and not on DSP performance, we use PDU Filter blocks, which just forward messages through the flow graph. (Our messages, of course, do not contain the key that would be filtered out.)
Before starting the flow graph, we create and enqueue a burst of messages in the first block.
Finally, we also enqueue a
done message, which signals the scheduler to shutdown the block.
A burst size of zero, therefore, means that we only enqueue the shutdown message and measure the time it takes to propagate it through the flow graph.
We used 500 Byte blobs.
But since messages are forwarded by pointers, the type of the message does not have a sizeable impact.
The run time of
top_block::run() looks like this:
Note that we scaled the number of pipes and stages jointly, i.e., an x-axis value of 100 corresponds to 10 pipes and 10 stages. The error bars indicate confidence intervals of the mean for a confidence level of 95%.
The good news are: GNU Radio scales linearly, even for a large number of threads. In this case, up to 400 threads on 4 CPU cores. In particular, we were not able to reproduce the horrendous results presented at FOSDEM. These measurements were probably conducted with an earlier version of GNU Radio, containing a bug that caused several message passing blocks to busy-wait. With this bug, it is reasonable that performance collapsed, once the number of blocks exceeds the number of CPUs.
The not so good news: The fact that the burst size has a relatively small impact on the result suggests that the overhead from thread synchronization dominates over the actual message processing. However, since we are just forwarding messages, this is also not unreasonable. We will investigate it further in the following posts.
While these measurements were conducted with real-time priority, we also tested normal priority but did not see a large difference.
Not so for the buffer interface. Here, we used a Null Source followed by a Head block to pipe 100e6 4-byte floats into the flow graph. To focus on the scheduler and not on DSP, we used Copy blocks in the pipes, which just copy the samples from the input to the output buffer. Each pipe was, furthermore, terminated by a Null Sink.
Like in the previous experiment, we scale the number of pipes and stages jointly and measure the run time of
Again, good news: Also the performance of the buffer interfaces scales linearly with the number of blocks. (Even though we will later see that there is quite some overhead from thread synchronization.)
What’s not visible from the plot is that even with the CPU governor performance, the CPU might reduce the frequency due to heat issues. So while the confidence intervals indicate that we have a good estimate of the mean, the underlying individual measurements were bimodal (or multi-modal), depending on the CPU frequency.
What might be surprising is that real-time priority performs so much better. For 100 blocks (10 pipes and 10 stages), it is about 50% faster. Remember, we created a dedicated CPU set for the measurements to avoid any interference from the operating system. So there are no priority issues.
To understand what’s going on, we have to look into the Linux process scheduler. But that’s topic for a later post.