 90655d815a
			
		
	
	
		90655d815a
		
	
	
	
	
		
			
			Convert docs/devel/rcu.txt to rST format. Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Message-id: 20240816132212.3602106-6-peter.maydell@linaro.org
		
			
				
	
	
		
			395 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			395 lines
		
	
	
		
			15 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| Using RCU (Read-Copy-Update) for synchronization
 | |
| ================================================
 | |
| 
 | |
| Read-copy update (RCU) is a synchronization mechanism that is used to
 | |
| protect read-mostly data structures.  RCU is very efficient and scalable
 | |
| on the read side (it is wait-free), and thus can make the read paths
 | |
| extremely fast.
 | |
| 
 | |
| RCU supports concurrency between a single writer and multiple readers,
 | |
| thus it is not used alone.  Typically, the write-side will use a lock to
 | |
| serialize multiple updates, but other approaches are possible (e.g.,
 | |
| restricting updates to a single task).  In QEMU, when a lock is used,
 | |
| this will often be the "iothread mutex", also known as the "big QEMU
 | |
| lock" (BQL).  Also, restricting updates to a single task is done in
 | |
| QEMU using the "bottom half" API.
 | |
| 
 | |
| RCU is fundamentally a "wait-to-finish" mechanism.  The read side marks
 | |
| sections of code with "critical sections", and the update side will wait
 | |
| for the execution of all *currently running* critical sections before
 | |
| proceeding, or before asynchronously executing a callback.
 | |
| 
 | |
| The key point here is that only the currently running critical sections
 | |
| are waited for; critical sections that are started **after** the beginning
 | |
| of the wait do not extend the wait, despite running concurrently with
 | |
| the updater.  This is the reason why RCU is more scalable than,
 | |
| for example, reader-writer locks.  It is so much more scalable that
 | |
| the system will have a single instance of the RCU mechanism; a single
 | |
| mechanism can be used for an arbitrary number of "things", without
 | |
| having to worry about things such as contention or deadlocks.
 | |
| 
 | |
| How is this possible?  The basic idea is to split updates in two phases,
 | |
| "removal" and "reclamation".  During removal, we ensure that subsequent
 | |
| readers will not be able to get a reference to the old data.  After
 | |
| removal has completed, a critical section will not be able to access
 | |
| the old data.  Therefore, critical sections that begin after removal
 | |
| do not matter; as soon as all previous critical sections have finished,
 | |
| there cannot be any readers who hold references to the data structure,
 | |
| and these can now be safely reclaimed (e.g., freed or unref'ed).
 | |
| 
 | |
| Here is a picture::
 | |
| 
 | |
|         thread 1                  thread 2                  thread 3
 | |
|     -------------------    ------------------------    -------------------
 | |
|     enter RCU crit.sec.
 | |
|            |                finish removal phase
 | |
|            |                begin wait
 | |
|            |                      |                    enter RCU crit.sec.
 | |
|     exit RCU crit.sec             |                           |
 | |
|                             complete wait                     |
 | |
|                             begin reclamation phase           |
 | |
|                                                        exit RCU crit.sec.
 | |
| 
 | |
| 
 | |
| Note how thread 3 is still executing its critical section when thread 2
 | |
| starts reclaiming data.  This is possible, because the old version of the
 | |
| data structure was not accessible at the time thread 3 began executing
 | |
| that critical section.
 | |
| 
 | |
| 
 | |
| RCU API
 | |
| -------
 | |
| 
 | |
| The core RCU API is small:
 | |
| 
 | |
| ``void rcu_read_lock(void);``
 | |
|         Used by a reader to inform the reclaimer that the reader is
 | |
|         entering an RCU read-side critical section.
 | |
| 
 | |
| ``void rcu_read_unlock(void);``
 | |
|         Used by a reader to inform the reclaimer that the reader is
 | |
|         exiting an RCU read-side critical section.  Note that RCU
 | |
|         read-side critical sections may be nested and/or overlapping.
 | |
| 
 | |
| ``void synchronize_rcu(void);``
 | |
|         Blocks until all pre-existing RCU read-side critical sections
 | |
|         on all threads have completed.  This marks the end of the removal
 | |
|         phase and the beginning of reclamation phase.
 | |
| 
 | |
|         Note that it would be valid for another update to come while
 | |
|         ``synchronize_rcu`` is running.  Because of this, it is better that
 | |
|         the updater releases any locks it may hold before calling
 | |
|         ``synchronize_rcu``.  If this is not possible (for example, because
 | |
|         the updater is protected by the BQL), you can use ``call_rcu``.
 | |
| 
 | |
| ``void call_rcu1(struct rcu_head * head, void (*func)(struct rcu_head *head));``
 | |
|         This function invokes ``func(head)`` after all pre-existing RCU
 | |
|         read-side critical sections on all threads have completed.  This
 | |
|         marks the end of the removal phase, with func taking care
 | |
|         asynchronously of the reclamation phase.
 | |
| 
 | |
|         The ``foo`` struct needs to have an ``rcu_head`` structure added,
 | |
|         perhaps as follows::
 | |
| 
 | |
|             struct foo {
 | |
|                 struct rcu_head rcu;
 | |
|                 int a;
 | |
|                 char b;
 | |
|                 long c;
 | |
|             };
 | |
| 
 | |
|         so that the reclaimer function can fetch the ``struct foo`` address
 | |
|         and free it::
 | |
| 
 | |
|             call_rcu1(&foo.rcu, foo_reclaim);
 | |
| 
 | |
|             void foo_reclaim(struct rcu_head *rp)
 | |
|             {
 | |
|                 struct foo *fp = container_of(rp, struct foo, rcu);
 | |
|                 g_free(fp);
 | |
|             }
 | |
| 
 | |
|         ``call_rcu1`` is typically used via either the ``call_rcu`` or
 | |
|         ``g_free_rcu`` macros, which handle the common case where the
 | |
|         ``rcu_head`` member is the first of the struct.
 | |
| 
 | |
| ``void call_rcu(T *p, void (*func)(T *p), field-name);``
 | |
|         If the ``struct rcu_head`` is the first field in the struct, you can
 | |
|         use this macro instead of ``call_rcu1``.
 | |
| 
 | |
| ``void g_free_rcu(T *p, field-name);``
 | |
|         This is a special-case version of ``call_rcu`` where the callback
 | |
|         function is ``g_free``.
 | |
|         In the example given in ``call_rcu1``, one could have written simply::
 | |
| 
 | |
|             g_free_rcu(&foo, rcu);
 | |
| 
 | |
| ``typeof(*p) qatomic_rcu_read(p);``
 | |
|         ``qatomic_rcu_read()`` is similar to ``qatomic_load_acquire()``, but
 | |
|         it makes some assumptions on the code that calls it.  This allows a
 | |
|         more optimized implementation.
 | |
| 
 | |
|         ``qatomic_rcu_read`` assumes that whenever a single RCU critical
 | |
|         section reads multiple shared data, these reads are either
 | |
|         data-dependent or need no ordering.  This is almost always the
 | |
|         case when using RCU, because read-side critical sections typically
 | |
|         navigate one or more pointers (the pointers that are changed on
 | |
|         every update) until reaching a data structure of interest,
 | |
|         and then read from there.
 | |
| 
 | |
|         RCU read-side critical sections must use ``qatomic_rcu_read()`` to
 | |
|         read data, unless concurrent writes are prevented by another
 | |
|         synchronization mechanism.
 | |
| 
 | |
|         Furthermore, RCU read-side critical sections should traverse the
 | |
|         data structure in a single direction, opposite to the direction
 | |
|         in which the updater initializes it.
 | |
| 
 | |
| ``void qatomic_rcu_set(p, typeof(*p) v);``
 | |
|         ``qatomic_rcu_set()`` is similar to ``qatomic_store_release()``,
 | |
|         though it also makes assumptions on the code that calls it in
 | |
|         order to allow a more optimized implementation.
 | |
| 
 | |
|         In particular, ``qatomic_rcu_set()`` suffices for synchronization
 | |
|         with readers, if the updater never mutates a field within a
 | |
|         data item that is already accessible to readers.  This is the
 | |
|         case when initializing a new copy of the RCU-protected data
 | |
|         structure; just ensure that initialization of ``*p`` is carried out
 | |
|         before ``qatomic_rcu_set()`` makes the data item visible to readers.
 | |
|         If this rule is observed, writes will happen in the opposite
 | |
|         order as reads in the RCU read-side critical sections (or if
 | |
|         there is just one update), and there will be no need for other
 | |
|         synchronization mechanism to coordinate the accesses.
 | |
| 
 | |
| The following APIs must be used before RCU is used in a thread:
 | |
| 
 | |
| ``void rcu_register_thread(void);``
 | |
|         Mark a thread as taking part in the RCU mechanism.  Such a thread
 | |
|         will have to report quiescent points regularly, either manually
 | |
|         or through the ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
 | |
| 
 | |
| ``void rcu_unregister_thread(void);``
 | |
|         Mark a thread as not taking part anymore in the RCU mechanism.
 | |
|         It is not a problem if such a thread reports quiescent points,
 | |
|         either manually or by using the
 | |
|         ``QemuCond``/``QemuSemaphore``/``QemuEvent`` APIs.
 | |
| 
 | |
| Note that these APIs are relatively heavyweight, and should **not** be
 | |
| nested.
 | |
| 
 | |
| Convenience macros
 | |
| ------------------
 | |
| 
 | |
| Two macros are provided that automatically release the read lock at the
 | |
| end of the scope.
 | |
| 
 | |
| ``RCU_READ_LOCK_GUARD()``
 | |
|          Takes the lock and will release it at the end of the block it's
 | |
|          used in.
 | |
| 
 | |
| ``WITH_RCU_READ_LOCK_GUARD()  { code }``
 | |
|          Is used at the head of a block to protect the code within the block.
 | |
| 
 | |
| Note that a ``goto`` out of the guarded block will also drop the lock.
 | |
| 
 | |
| Differences with Linux
 | |
| ----------------------
 | |
| 
 | |
| - Waiting on a mutex is possible, though discouraged, within an RCU critical
 | |
|   section.  This is because spinlocks are rarely (if ever) used in userspace
 | |
|   programming; not allowing this would prevent upgrading an RCU read-side
 | |
|   critical section to become an updater.
 | |
| 
 | |
| - ``qatomic_rcu_read`` and ``qatomic_rcu_set`` replace ``rcu_dereference`` and
 | |
|   ``rcu_assign_pointer``.  They take a **pointer** to the variable being accessed.
 | |
| 
 | |
| - ``call_rcu`` is a macro that has an extra argument (the name of the first
 | |
|   field in the struct, which must be a struct ``rcu_head``), and expects the
 | |
|   type of the callback's argument to be the type of the first argument.
 | |
|   ``call_rcu1`` is the same as Linux's ``call_rcu``.
 | |
| 
 | |
| 
 | |
| RCU Patterns
 | |
| ------------
 | |
| 
 | |
| Many patterns using read-writer locks translate directly to RCU, with
 | |
| the advantages of higher scalability and deadlock immunity.
 | |
| 
 | |
| In general, RCU can be used whenever it is possible to create a new
 | |
| "version" of a data structure every time the updater runs.  This may
 | |
| sound like a very strict restriction, however:
 | |
| 
 | |
| - the updater does not mean "everything that writes to a data structure",
 | |
|   but rather "everything that involves a reclamation step".  See the
 | |
|   array example below
 | |
| 
 | |
| - in some cases, creating a new version of a data structure may actually
 | |
|   be very cheap.  For example, modifying the "next" pointer of a singly
 | |
|   linked list is effectively creating a new version of the list.
 | |
| 
 | |
| Here are some frequently-used RCU idioms that are worth noting.
 | |
| 
 | |
| 
 | |
| RCU list processing
 | |
| ^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| TBD (not yet used in QEMU)
 | |
| 
 | |
| 
 | |
| RCU reference counting
 | |
| ^^^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Because grace periods are not allowed to complete while there is an RCU
 | |
| read-side critical section in progress, the RCU read-side primitives
 | |
| may be used as a restricted reference-counting mechanism.  For example,
 | |
| consider the following code fragment::
 | |
| 
 | |
|     rcu_read_lock();
 | |
|     p = qatomic_rcu_read(&foo);
 | |
|     /* do something with p. */
 | |
|     rcu_read_unlock();
 | |
| 
 | |
| The RCU read-side critical section ensures that the value of ``p`` remains
 | |
| valid until after the ``rcu_read_unlock()``.  In some sense, it is acquiring
 | |
| a reference to ``p`` that is later released when the critical section ends.
 | |
| The write side looks simply like this (with appropriate locking)::
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     qatomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     synchronize_rcu();
 | |
|     free(old);
 | |
| 
 | |
| If the processing cannot be done purely within the critical section, it
 | |
| is possible to combine this idiom with a "real" reference count::
 | |
| 
 | |
|     rcu_read_lock();
 | |
|     p = qatomic_rcu_read(&foo);
 | |
|     foo_ref(p);
 | |
|     rcu_read_unlock();
 | |
|     /* do something with p. */
 | |
|     foo_unref(p);
 | |
| 
 | |
| The write side can be like this::
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     qatomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     synchronize_rcu();
 | |
|     foo_unref(old);
 | |
| 
 | |
| or with ``call_rcu``::
 | |
| 
 | |
|     qemu_mutex_lock(&foo_mutex);
 | |
|     old = foo;
 | |
|     qatomic_rcu_set(&foo, new);
 | |
|     qemu_mutex_unlock(&foo_mutex);
 | |
|     call_rcu(foo_unref, old, rcu);
 | |
| 
 | |
| In both cases, the write side only performs removal.  Reclamation
 | |
| happens when the last reference to a ``foo`` object is dropped.
 | |
| Using ``synchronize_rcu()`` is undesirably expensive, because the
 | |
| last reference may be dropped on the read side.  Hence you can
 | |
| use ``call_rcu()`` instead::
 | |
| 
 | |
|      foo_unref(struct foo *p) {
 | |
|         if (qatomic_fetch_dec(&p->refcount) == 1) {
 | |
|             call_rcu(foo_destroy, p, rcu);
 | |
|         }
 | |
|     }
 | |
| 
 | |
| 
 | |
| Note that the same idioms would be possible with reader/writer
 | |
| locks::
 | |
| 
 | |
|     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | |
|     p = foo;                        p = foo;
 | |
|     /* do something with p. */      foo = new;
 | |
|     read_unlock(&foo_rwlock);       free(p);
 | |
|                                     write_mutex_unlock(&foo_rwlock);
 | |
|                                     free(p);
 | |
| 
 | |
|     ------------------------------------------------------------------
 | |
| 
 | |
|     read_lock(&foo_rwlock);         write_mutex_lock(&foo_rwlock);
 | |
|     p = foo;                        old = foo;
 | |
|     foo_ref(p);                     foo = new;
 | |
|     read_unlock(&foo_rwlock);       foo_unref(old);
 | |
|     /* do something with p. */      write_mutex_unlock(&foo_rwlock);
 | |
|     read_lock(&foo_rwlock);
 | |
|     foo_unref(p);
 | |
|     read_unlock(&foo_rwlock);
 | |
| 
 | |
| ``foo_unref`` could use a mechanism such as bottom halves to move deallocation
 | |
| out of the write-side critical section.
 | |
| 
 | |
| 
 | |
| RCU resizable arrays
 | |
| ^^^^^^^^^^^^^^^^^^^^
 | |
| 
 | |
| Resizable arrays can be used with RCU.  The expensive RCU synchronization
 | |
| (or ``call_rcu``) only needs to take place when the array is resized.
 | |
| The two items to take care of are:
 | |
| 
 | |
| - ensuring that the old version of the array is available between removal
 | |
|   and reclamation;
 | |
| 
 | |
| - avoiding mismatches in the read side between the array data and the
 | |
|   array size.
 | |
| 
 | |
| The first problem is avoided simply by not using ``realloc``.  Instead,
 | |
| each resize will allocate a new array and copy the old data into it.
 | |
| The second problem would arise if the size and the data pointers were
 | |
| two members of a larger struct::
 | |
| 
 | |
|     struct mystuff {
 | |
|         ...
 | |
|         int data_size;
 | |
|         int data_alloc;
 | |
|         T   *data;
 | |
|         ...
 | |
|     };
 | |
| 
 | |
| Instead, we store the size of the array with the array itself::
 | |
| 
 | |
|     struct arr {
 | |
|         int size;
 | |
|         int alloc;
 | |
|         T   data[];
 | |
|     };
 | |
|     struct arr *global_array;
 | |
| 
 | |
|     read side:
 | |
|         rcu_read_lock();
 | |
|         struct arr *array = qatomic_rcu_read(&global_array);
 | |
|         x = i < array->size ? array->data[i] : -1;
 | |
|         rcu_read_unlock();
 | |
|         return x;
 | |
| 
 | |
|     write side (running under a lock):
 | |
|         if (global_array->size == global_array->alloc) {
 | |
|             /* Creating a new version.  */
 | |
|             new_array = g_malloc(sizeof(struct arr) +
 | |
|                                  global_array->alloc * 2 * sizeof(T));
 | |
|             new_array->size = global_array->size;
 | |
|             new_array->alloc = global_array->alloc * 2;
 | |
|             memcpy(new_array->data, global_array->data,
 | |
|                    global_array->alloc * sizeof(T));
 | |
| 
 | |
|             /* Removal phase.  */
 | |
|             old_array = global_array;
 | |
|             qatomic_rcu_set(&global_array, new_array);
 | |
|             synchronize_rcu();
 | |
| 
 | |
|             /* Reclamation phase.  */
 | |
|             free(old_array);
 | |
|         }
 | |
| 
 | |
| 
 | |
| References
 | |
| ----------
 | |
| 
 | |
| * The `Linux kernel RCU documentation <https://docs.kernel.org/RCU/>`__
 |