Not in Love with the GNU Radio Scheduler (Part 2)

06 June 2016 GNU Radio / Scheduler

This is the second post about my current work on the GNU Radio scheduler. The first post was mainly about shutting down run to completion flow graphs, often used in simulations and unit tests. The patch set, which also includes the changes proposed in this post, is still under review.

Blocking Wait is Always Bad

The most important change is the removal of the delete_head_blocking() function that could be used to block while waiting for messages. At least since the introduction of the system port, blocking wait is always a bad idea. If a block is waiting for a message on one port, it is deaf to any other notifications and it’s not possible to, for example, shut it down by sending a done message to its system port.

The only reason this function was introduced, was to work around shortcomings of the scheduler. The problem is that the scheduler does not support blocks with message inputs and stream outputs. Currently, it’s not possible to schedule these blocks only when a message has arrived.

Instead, busy-waiting has to be applied, wasting one CPU core for every block that has to check for messages in the work function; like PDU to Tagged Stream block, for example. See also this discussion on the GNU Radio mailing list from 2013.

To minimize the effects of busy-waiting, delete_head_blocking() was extended with a timeout. It doesn’t solve the problem, but allowed something like throttled busy-waiting.

Also the code acknowledges the problem in the pdu_to_tagged_stream block, for example.

/* FIXME: This blocking call is far from ideal but is the best we
 *        can do at the moment
 */
pmt::pmt_t msg(delete_head_blocking(PDU_PORT_ID, 100));
  if (msg.get() == NULL) {
    return 0;
}

This problem is what is supposed to be addressed by the current patch set. It removes the delete_head_blocking() function and extends the scheduler to call blocks only if samples or messages are available. That means that the block’s work function checks for messages and either processes them directly or returns immediately without blocking.

The only place to block is the scheduler, since it allows to wait for samples, output buffer space, and messages all at once.

Thread Safety

Another problem is thread safety. The tpb_thread_body often checks whether new messages are available using code like:

// wait for input or message
if(!d->d_tpb.input_changed && block->empty_handled_p()) {
  d->d_tpb.input_cond.wait(guard);
}

The point is that, as far as I see, neither the tpb_thread_body nor empty_handled_p() acquire the lock of the block, which is supposed to serialize access to the queues. This might cause problems that happen rarely and are, therefore, hard to reproduce and debug. See the PMT Smasher bug, for example.

The patches remove this check and use the readily available input_changed variable to signal that something new is available. This can now be either samples or message.

set_done – Shut Down Just a Bit

set_done() is a function that I find really confusing. At first, I thought it is to shutdown a flow graph, but that involves:

Marking input and output buffers as done
Notifying stream neighbors to check their buffers
Notifying message neighbors by sending done to their system ports

set_done does just the first step; but, in addition, sets the block’s member variable done to true and hopes the scheduler will perform the remaining two steps. The signaling through the variable is even duplicate/redundant, as the block_executor also returns done and, thus, already informs the scheduler to shut down.

 were_done:
   LOG(*d_log << "  were_done\n");
   d->set_done (true);
   return DONE;

It looks like this code already led to confusion, as the first thing that the block_executor does is check if done is already set.

if(d->done()){
  assert(0);
  return DONE;
}

Misusing set_done

Currently, set_done can also be exploited to terminate flow graphs in message port handlers using

detail().get()->set_done(true);

This is supported by the scheduler through:

case block_executor::BLKD_IN:        // Wait for input.
{
  gr::thread::scoped_lock guard(d->d_tpb.mutex);
  while(!d->d_tpb.input_changed) {

    [...]

    // handle all pending messages
    BOOST_FOREACH(basic_block::msg_queue_map_t::value_type &i, block->msg_queue) {
    [...]
    }
    if (d->done()) {
      return;
    }
  }
}

Note that this also does not shutdown message neighbors and, therefore, is a bad idea either way. The possibility to shut down block like this is removed with the proposed patches. If a block wants to terminate, it can send itself a done message:

post(pmt::mp("system"), pmt::cons(pmt::mp("done"), pmt::PMT_T));

Fun Fact: Lisp Nostalgia

The impression that the scheduler is a special piece of code is underlined by its naming conventions. It is full of functions like cons, car, cdr, sink_p, empty_handled_p, which all are Lisp-style names that are—let’s say—uncommon nowadays.

I wonder how many of developers know where cons and car come from; or that sink_p returns the predicate that describes whether the block is a sink. I guess most would simply call it is_sink.