Convert docs/devel/rcu.txt to rST format. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 20240816132212.3602106-6-peter.maydell@linaro.org
		
			
				
	
	
		
			395 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			395 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
Using RCU (Read-Copy-Update) for synchronization
 | 
						|
================================================
 | 
						|
 | 
						|
Read-copy update (RCU) is a synchronization mechanism that is used to
 | 
						|
protect read-mostly data structures.  RCU is very efficient and scalable
 | 
						|
on the read side (it is wait-free), and thus can make the read paths
 | 
						|
extremely fast.
 | 
						|
 | 
						|
RCU supports concurrency between a single writer and multiple readers,
 | 
						|
thus it is not used alone.  Typically, the write-side will use a lock to
 | 
						|
serialize multiple updates, but other approaches are possible (e.g.,
 | 
						|
restricting updates to a single task).  In QEMU, when a lock is used,
 | 
						|
this will often be the "iothread mutex", also known as the "big QEMU
 | 
						|
lock" (BQL).  Also, restricting updates to a single task is done in
 | 
						|
QEMU using the "bottom half" API.
 | 
						|
 | 
						|
RCU is fundamentally a "wait-to-finish" mechanism.  The read side marks
 | 
						|
sections of code with "critical sections", and the update side will wait
 | 
						|
for the execution of all *currently running* critical sections before
 | 
						|
proceeding, or before asynchronously executing a callback.
 | 
						|
 | 
						|
The key point here is that only the currently running critical sections
 | 
						|
are waited for; critical sections that are started **after** the beginning
 | 
						|
of the wait do not extend the wait, despite running concurrently with
 | 
						|
the updater.  This is the reason why RCU is more scalable than,
 | 
						|
for example, reader-writer locks.  It is so much more scalable that
 | 
						|
the system will have a single instance of the RCU mechanism; a single
 | 
						|
mechanism can be used for an arbitrary number of "things", without
 | 
						|
having to worry about things such as contention or deadlocks.
 | 
						|
 | 
						|
How is this possible?  The basic idea is to split updates in two phases,
 | 
						|
"removal" and "reclamation".  During removal, we ensure that subsequent
 | 
						|
readers will not be able to get a reference to the old data.  After
 | 
						|
removal has completed, a critical section will not be able to access
 | 
						|
the old data.  Therefore, critical sections that begin after removal
 | 
						|
do not matter; as soon as all previous critical sections have finished,
 | 
						|
there cannot be any readers who hold references to the data structure,
 | 
						|
and these can now be safely reclaimed (e.g., freed or unref'ed).
 | 
						|
 | 
						|
Here is a picture::
 | 
						|
 | 
						|
        thread 1                  thread 2                  thread 3
 | 
						|
    -------------------    ------------------------    -------------------
 | 
						|
    enter RCU crit.sec.
 | 
						|
           |                finish removal phase
 | 
						|
           |                begin wait
 | 
						|
           |                      |                    enter RCU crit.sec.
 | 
						|
    exit RCU crit.sec             |                           |
 | 
						|
                            complete wait                     |
 | 
						|
                            begin reclamation phase           |
 | 
						|
                                                       exit RCU crit.sec.
 | 
						|
 | 
						|
 | 
						|
Note how thread 3 is still executing its critical section when thread 2
 | 
						|
starts reclaiming data.  This is possible, because the old version of the
 | 
						|
data structure was not accessible at the time thread 3 began executing
 | 
						|
that critical section.
 | 
						|
 | 
						|
 | 
						|
RCU API
 | 
						|
-------
 | 
						|
 | 
						|
The core RCU API is small:
 | 
						|
 | 
						|
``void rcu_read_lock(void);``
 | 
						|
        Used by a reader to inform the reclaimer that the reader is
 | 
						|
        entering an RCU read-side critical section.
 | 
						|
 | 
						|
``void rcu_read_unlock(void);``
 | 
						|
        Used by a reader to inform the reclaimer that the reader is
 | 
						|
        exiting an RCU read-side critical section.  Note that RCU
 | 
						|
        read-side critical sections may be nested and/or overlapping.
 | 
						|
 | 
						|
``void synchronize_rcu(void);``
 | 
						|
        Blocks until all pre-existing RCU read-side critical sections
 | 
						|
        on all threads have completed.  This marks the end of the removal
 | 
						|
        phase and the beginning of reclamation phase.
 | 
						|
 | 
						|
        Note that it would be valid for another update to come while
 | 
						|
        ``synchronize_rcu`` is running.  Because of this, it is better that
 | 
						|
        the updater releases any locks it may hold before calling
 | 
						|
        ``synchronize_rcu``.  If this is not possible (for example, because
 | 
						|
        the updater is protected by the BQL), you can use ``call_rcu``.
 | 
						|
 | 
						|
``void call_rcu1(struct rcu_head * head, void (*func)(struct rcu_head *head));``
 | 
						|
        This function invokes ``func(head)`` after all pre-existing RCU
 | 
						|
        read-side critical sections on all threads have completed.  This
 | 
						|
        marks the end of the removal phase, with func taking care
 | 
						|
        asynchronously of the reclamation phase.
 | 
						|
 | 
						|
        The ``foo`` struct needs to have an ``rcu_head`` structure added,
 | 
						|
        perhaps as follows::
 | 
						|
 | 
						|
            struct foo {
 | 
						|
                struct rcu_head rcu;
 | 
						|
                int a;
 | 
						|
                char b;
 | 
						|
                long c;
 | 
						|
            };
 | 
						|
 | 
						|
        so that the reclaimer function can fetch the ``struct foo`` address
 | 
						|
        and free it::
 | 
						|
 | 
						|
            call_rcu1(&foo.rcu, foo_reclaim);
 | 
						|
 | 
						|
            void foo_reclaim(struct rcu_head *rp)
 | 
						|
            {
 | 
						|
                struct foo *fp = container_of(rp, struct foo, rcu);
 | 
						|
                g_free(fp);
 | 
						|
            }
 | 
						|
 | 
						|
        ``call_rcu1`` is typically used via either the ``call_rcu`` or
 | 
						|
        ``g_free_rcu`` macros, which handle the common case where the
 | 
						|
        ``rcu_head`` member is the first of the struct.
 | 
						|
 | 
						|
``void call_rcu(T *p, void (*func)(T *p), field-name);``
 | 
						|
        If the ``struct rcu_head`` is the first field in the struct, you can
 | 
						|
        use this macro instead of ``call_rcu1``.
 | 
						|
 | 
						|
``void g_free_rcu(T *p, field-name);``
 | 
						|
        This is a special-case version of ``call_rcu`` where the callback
 | 
						|
        function is ``g_free``.
 | 
						|
        In the example given in ``call_rcu1``, one could have written simply::
 | 
						|
 | 
						|
            g_free_rcu(&foo, rcu);
 | 
						|
 | 
						|
``typeof(*p) qatomic_rcu_read(p);``
 | 
						|
        ``qatomic_rcu_read()`` is similar to ``qatomic_load_acquire()``, but
 | 
						|
        it makes some assumptions on the code that calls it.  This allows a
 | 
						|
        more optimized implementation.
 | 
						|
 | 
						|
        ``qatomic_rcu_read`` assumes that whenever a single RCU critical
 | 
						|
        section reads multiple shared data, these reads are either
 | 
						|
        data-dependent or need no ordering.  This is almost always the
 | 
						|
        case when using RCU, because read-side critical sections typically
 | 
						|
        navigate one or more pointers (the pointers that are changed on
 | 
						|
        every update) until reaching a data structure of interest,
 | 
						|
        and then read from there.
 | 
						|
 | 
						|
        RCU read-side critical sections must use ``qatomic_rcu_read()`` to
 | 
						|
        read data, unless concurrent writes are prevented by another
 | 
						|
        synchronization mechanism.
 | 
						|
 | 
						|
        Furthermore, RCU read-side critical sections should traverse the
 | 
						|
        data structure in a single direction, opposite to the direction
 | 
						|
        in which the updater initializes it.
 | 
						|
 | 
						|
``void qatomic_rcu_set(p, typeof(*p) v);``
 | 
						|
        ``qatomic_rcu_set()`` is similar to ``qatomic_store_release()``,
 | 
						|
        though it also makes assumptions on the code that calls it in
 | 
						|
        order to allow a more optimized implementation.
 | 
						|
 | 
						|
        In particular, ``qatomic_rcu_set()`` suffices for synchronization
 | 
						|
        with readers, if the updater never mutates a field within a
 | 
						|
        data item that is already accessible to readers.  This is the
 | 
						|
        case when initializing a new copy of the RCU-protected data
 | 
						|
        structure; just ensure that initialization of ``*p`` is carried out
 | 
						|
        before ``qatomic_rcu_set()`` makes the data item visible to readers.
 | 
						|
        If this rule is observed, writes will happen in the opposite
 | 
						|
        order as reads in the RCU read-side critical sections (or if
 | 
						|
        there is just one update), and there will be no need for other
 | 
						|
        synchronization mechanism to coordinate the accesses.
 | 
						|
 | 
						|
The following APIs must be used before RCU is used in a thread:
 | 
						|
 | 
						|
``void rcu_register_thread(void);``
 | 
						|
        Mark a thread as taking part in the RCU mechanism.  Such a thread
 | 
						|
        will have to report quiescent points regularly, either manually
 | 
						|
        or through the ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
 | 
						|
 | 
						|
``void rcu_unregister_thread(void);``
 | 
						|
        Mark a thread as not taking part anymore in the RCU mechanism.
 | 
						|
        It is not a problem if such a thread reports quiescent points,
 | 
						|
        either manually or by using the
 | 
						|
        ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
 | 
						|
 | 
						|
Note that these APIs are relatively heavyweight, and should **not** be
 | 
						|
nested.
 | 
						|
 | 
						|
Convenience macros
 | 
						|
------------------
 | 
						|
 | 
						|
Two macros are provided that automatically release the read lock at the
 | 
						|
end of the scope.
 | 
						|
 | 
						|
``RCU_READ_LOCK_GUARD()``
 | 
						|
         Takes the lock and will release it at the end of the block it's
 | 
						|
         used in.
 | 
						|
 | 
						|
``WITH_RCU_READ_LOCK_GUARD()  { code }``
 | 
						|
         Is used at the head of a block to protect the code within the block.
 | 
						|
 | 
						|
Note that a ``goto`` out of the guarded block will also drop the lock.
 | 
						|
 | 
						|
Differences with Linux
 | 
						|
----------------------
 | 
						|
 | 
						|
- Waiting on a mutex is possible, though discouraged, within an RCU critical
 | 
						|
  section.  This is because spinlocks are rarely (if ever) used in userspace
 | 
						|
  programming; not allowing this would prevent upgrading an RCU read-side
 | 
						|
  critical section to become an updater.
 | 
						|
 | 
						|
- ``qatomic_rcu_read`` and ``qatomic_rcu_set`` replace ``rcu_dereference`` and
 | 
						|
  ``rcu_assign_pointer``.  They take a **pointer** to the variable being accessed.
 | 
						|
 | 
						|
- ``call_rcu`` is a macro that has an extra argument (the name of the first
 | 
						|
  field in the struct, which must be a struct ``rcu_head``), and expects the
 | 
						|
  type of the callback's argument to be the type of the first argument.
 | 
						|
  ``call_rcu1`` is the same as Linux's ``call_rcu``.
 | 
						|
 | 
						|
 | 
						|
RCU Patterns
 | 
						|
------------
 | 
						|
 | 
						|
Many patterns using read-writer locks translate directly to RCU, with
 | 
						|
the advantages of higher scalability and deadlock immunity.
 | 
						|
 | 
						|
In general, RCU can be used whenever it is possible to create a new
 | 
						|
"version" of a data structure every time the updater runs.  This may
 | 
						|
sound like a very strict restriction, however:
 | 
						|
 | 
						|
- the updater does not mean "everything that writes to a data structure",
 | 
						|
  but rather "everything that involves a reclamation step".  See the
 | 
						|
  array example below
 | 
						|
 | 
						|
- in some cases, creating a new version of a data structure may actually
 | 
						|
  be very cheap.  For example, modifying the "next" pointer of a singly
 | 
						|
  linked list is effectively creating a new version of the list.
 | 
						|
 | 
						|
Here are some frequently-used RCU idioms that are worth noting.
 | 
						|
 | 
						|
 | 
						|
RCU list processing
 | 
						|
^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
TBD (not yet used in QEMU)
 | 
						|
 | 
						|
 | 
						|
RCU reference counting
 | 
						|
^^^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Because grace periods are not allowed to complete while there is an RCU
 | 
						|
read-side critical section in progress, the RCU read-side primitives
 | 
						|
may be used as a restricted reference-counting mechanism.  For example,
 | 
						|
consider the following code fragment::
 | 
						|
 | 
						|
    rcu_read_lock();
 | 
						|
    p = qatomic_rcu_read(&foo);
 | 
						|
    /* do something with p. */
 | 
						|
    rcu_read_unlock();
 | 
						|
 | 
						|
The RCU read-side critical section ensures that the value of ``p`` remains
 | 
						|
valid until after the ``rcu_read_unlock()``.  In some sense, it is acquiring
 | 
						|
a reference to ``p`` that is later released when the critical section ends.
 | 
						|
The write side looks simply like this (with appropriate locking)::
 | 
						|
 | 
						|
    qemu_mutex_lock(&foo_mutex);
 | 
						|
    old = foo;
 | 
						|
    qatomic_rcu_set(&foo, new);
 | 
						|
    qemu_mutex_unlock(&foo_mutex);
 | 
						|
    synchronize_rcu();
 | 
						|
    free(old);
 | 
						|
 | 
						|
If the processing cannot be done purely within the critical section, it
 | 
						|
is possible to combine this idiom with a "real" reference count::
 | 
						|
 | 
						|
    rcu_read_lock();
 | 
						|
    p = qatomic_rcu_read(&foo);
 | 
						|
    foo_ref(p);
 | 
						|
    rcu_read_unlock();
 | 
						|
    /* do something with p. */
 | 
						|
    foo_unref(p);
 | 
						|
 | 
						|
The write side can be like this::
 | 
						|
 | 
						|
    qemu_mutex_lock(&foo_mutex);
 | 
						|
    old = foo;
 | 
						|
    qatomic_rcu_set(&foo, new);
 | 
						|
    qemu_mutex_unlock(&foo_mutex);
 | 
						|
    synchronize_rcu();
 | 
						|
    foo_unref(old);
 | 
						|
 | 
						|
or with ``call_rcu``::
 | 
						|
 | 
						|
    qemu_mutex_lock(&foo_mutex);
 | 
						|
    old = foo;
 | 
						|
    qatomic_rcu_set(&foo, new);
 | 
						|
    qemu_mutex_unlock(&foo_mutex);
 | 
						|
    call_rcu(foo_unref, old, rcu);
 | 
						|
 | 
						|
In both cases, the write side only performs removal.  Reclamation
 | 
						|
happens when the last reference to a ``foo`` object is dropped.
 | 
						|
Using ``synchronize_rcu()`` is undesirably expensive, because the
 | 
						|
last reference may be dropped on the read side.  Hence you can
 | 
						|
use ``call_rcu()`` instead::
 | 
						|
 | 
						|
     foo_unref(struct foo *p) {
 | 
						|
        if (qatomic_fetch_dec(&p->refcount) == 1) {
 | 
						|
            call_rcu(foo_destroy, p, rcu);
 | 
						|
        }
 | 
						|
    }
 | 
						|
 | 
						|
 | 
						|
Note that the same idioms would be possible with reader/writer
 | 
						|
locks::
 | 
						|
 | 
						|
    read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | 
						|
    p = foo;                        p = foo;
 | 
						|
    /* do something with p. */      foo = new;
 | 
						|
    read_unlock(&foo_rwlock);       free(p);
 | 
						|
                                    write_mutex_unlock(&foo_rwlock);
 | 
						|
                                    free(p);
 | 
						|
 | 
						|
    ------------------------------------------------------------------
 | 
						|
 | 
						|
    read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | 
						|
    p = foo;                        old = foo;
 | 
						|
    foo_ref(p);                     foo = new;
 | 
						|
    read_unlock(&foo_rwlock);       foo_unref(old);
 | 
						|
    /* do something with p. */      write_mutex_unlock(&foo_rwlock);
 | 
						|
    read_lock(&foo_rwlock);
 | 
						|
    foo_unref(p);
 | 
						|
    read_unlock(&foo_rwlock);
 | 
						|
 | 
						|
``foo_unref`` could use a mechanism such as bottom halves to move deallocation
 | 
						|
out of the write-side critical section.
 | 
						|
 | 
						|
 | 
						|
RCU resizable arrays
 | 
						|
^^^^^^^^^^^^^^^^^^^^
 | 
						|
 | 
						|
Resizable arrays can be used with RCU.  The expensive RCU synchronization
 | 
						|
(or ``call_rcu``) only needs to take place when the array is resized.
 | 
						|
The two items to take care of are:
 | 
						|
 | 
						|
- ensuring that the old version of the array is available between removal
 | 
						|
  and reclamation;
 | 
						|
 | 
						|
- avoiding mismatches in the read side between the array data and the
 | 
						|
  array size.
 | 
						|
 | 
						|
The first problem is avoided simply by not using ``realloc``.  Instead,
 | 
						|
each resize will allocate a new array and copy the old data into it.
 | 
						|
The second problem would arise if the size and the data pointers were
 | 
						|
two members of a larger struct::
 | 
						|
 | 
						|
    struct mystuff {
 | 
						|
        ...
 | 
						|
        int data_size;
 | 
						|
        int data_alloc;
 | 
						|
        T   *data;
 | 
						|
        ...
 | 
						|
    };
 | 
						|
 | 
						|
Instead, we store the size of the array with the array itself::
 | 
						|
 | 
						|
    struct arr {
 | 
						|
        int size;
 | 
						|
        int alloc;
 | 
						|
        T   data[];
 | 
						|
    };
 | 
						|
    struct arr *global_array;
 | 
						|
 | 
						|
    read side:
 | 
						|
        rcu_read_lock();
 | 
						|
        struct arr *array = qatomic_rcu_read(&global_array);
 | 
						|
        x = i < array->size ? array->data[i] : -1;
 | 
						|
        rcu_read_unlock();
 | 
						|
        return x;
 | 
						|
 | 
						|
    write side (running under a lock):
 | 
						|
        if (global_array->size == global_array->alloc) {
 | 
						|
            /* Creating a new version.  */
 | 
						|
            new_array = g_malloc(sizeof(struct arr) +
 | 
						|
                                 global_array->alloc * 2 * sizeof(T));
 | 
						|
            new_array->size = global_array->size;
 | 
						|
            new_array->alloc = global_array->alloc * 2;
 | 
						|
            memcpy(new_array->data, global_array->data,
 | 
						|
                   global_array->alloc * sizeof(T));
 | 
						|
 | 
						|
            /* Removal phase.  */
 | 
						|
            old_array = global_array;
 | 
						|
            qatomic_rcu_set(&global_array, new_array);
 | 
						|
            synchronize_rcu();
 | 
						|
 | 
						|
            /* Reclamation phase.  */
 | 
						|
            free(old_array);
 | 
						|
        }
 | 
						|
 | 
						|
 | 
						|
References
 | 
						|
----------
 | 
						|
 | 
						|
* The `Linux kernel RCU documentation <https://docs.kernel.org/RCU/>`__
 |