Rust - Race Condition vs Data Race

TL;DR - “Race Condition” and “Data Race” are not the same thing, though they are somewhat related. Rust prevents “Data Races” in safe code, it does not, however, absolutely prevent “Race Conditions”.

Rust prevents “Data Races” at compile time (in the absence of “unsafe” code that is incorrect). It does not prevent (all) “Race Conditions”. To prevent (all) race conditions, a language would need to have the ability to solve the halting problem (IIRC). “Data Races” and “Race Conditions” are two different things.

What is a “Race Condition”?

A “Race Condition” is when two (or more) different “threads” of execution attempt to do things in a logical order without performing appropriate synchronization to enforce the order with respect to one another of the interleaved operations. For example, if you acquire a read-lock on some memory location, check the value, and if the value is some particular range of values (say even), you perform an operation, then release the lock. You then acquire a write-lock on the location and increment it. This is not a “Data Race”, but, it is a “Race Condition”. Rust will “encourage” you through the API’s available and the idiomatic patterns not to do this, but, it won’t prevent it - it will just make it “awkward” to do. You can do this entirely in “safe” code (though, the “locks” you acquire will be implemented in “unsafe” code and wrapped in a “safe” abstraction which is the lock guard, like Mutex that Rust provides.

What is a “Data Race”?

A “Data Race” is when two or more “threads” (not necessarily OS or hardware threads, but, this includes “Tasks” and other types of concurrency) attempt to read a “Composite Value” or “Restricted Atomic Value” from a memory location while at least one other “thread” attempts to modify that same “Composite” or “Restricted Atomic” value, such that, depending upon the interleaving of operations of the threads, could result in the reading of a value, during or after the update, that is not valid for the type or does not correspond to either the value that existed before the write operations began or the value written by the writing thread. Here “Composite Value” is a memory location and extent (range of bytes, words, etc) that contains multiple “Atomic”, “Restricted Atomic”, and/or “Composite” values that have functional dependencies on one another such that the overall composite value only has a defined meaning if the components of that composite value maintain their invariants with respect to one another and the restrictions of any “Restricted Atomic” components are also maintained. A “Restricted Atomic” value is an “Atomic” value where the values permitted to make the value valid are restricted to a subset of the values of the atomic type that is being restricted (for example, a “Restricted Atomic Value” may consist of an unsigned byte, which is “Atomic” and permits and represents the values 0 through 255, that is restricted to a subset of the possible values such that the byte must only contain the values 1 through 100). An “Atomic Value” is a memory location and extent that is fundamental on the underlying computer architecture and will always be updated (by the CPU), “atomically” (not to be confused with “Atomic Operations” or “Synchronization” which is the ability to read the existing value then update it atomically such that the value you read is still the value it contains when you write the new value) such as an “int” that is less than or equal to the “write size” (word size or cache line size) of the CPU architecture and will always (also dependent on CPU architectural considerations that may have exceptions for alignment and cache coherency) be written by the hardware to memory in a single operation AND where all values possible for the fundamental, hardware level type are not restricted. Updating or reading a true “Atomic Value” will never be a “Data Race”, but it may be a “Race Condition”; however, what is an “Atomic Value” varies from CPU architecture to CPU architecture (and can even vary with alignment and other factors) and does not necessarily correspond to the language’s high-level notion of what is an “Atomic Value” (as we’ll see). Also, this definition of “Data Race” is not as restrictive as the definition of “Data Race” that Rust actually uses (as we’ll see).

Example of a “Data-Race” with a “Composite” Type/Value

A “Date” value (represented as a struct) could consist of: a “Type” called “Year” that is a 32-bit hardware level integer where negative values represent years BCE and positive values represent years CE making all integer values of the hardware valid for the type “Year”; a “Type” called “Month”, an unsigned byte, with only the values 1 through 12 permitted; and a “Typed” called “DayOfMonth”, also a byte with only values of either 1 through 28, 1 through 29, 1 through 30, or 1 through 31 permitted (depending on the month or year and month). Under this definition, the int representing the year, is an “Atomic Value” (on most CPU architectures, but, this isn’t guaranteed) and the month and year bytes are each “Restricted Atomic Values” (because they have a restricted range of values independent of the overall composite date value’s invariants). This “Composite” date value is only valid if the constraints on the day of month with respect to the year and month are maintained. If the value is “2000/02/29”, that is not a valid value because the year 1900 was not a leap year and so there is no such date as the 29th of February of the year 1900. Similarly, the value “2020/04/31” is not valid because there is never, in any year, a 31st of April. Also, any value other than 1 through 12, regardless of the year, would always be invalid for the month and any value other than 1 through 31, regardless of the year or month, for the day of the month would always be invalid because both the month and the day of month are “Restricted Atomic Values” that must maintain those invariants to be valid.

If the components of the “Date” type were not part of the overall composite date value, they would simply be the TYPES: “Year” to represent year values, which is a hardware level “int”, with an unrestricted range (making it an “Atomic Type” on most hardware) ; “Month” to represent month values, which is a hardware level “byte” restricted to the range of 1 through 12 (making it a “Restricted Atomic Type”); and “DayOfMonth” for the day of the month, which is a hardware level “byte” restricted to the range 1 through 31 (which makes it also a “Restricted Atomic Type”). Notice, that the “DayOfMonth” type, by itself, has no restrictions beyond being one of the values 1 through 31. This is because, further restrictions, such as 31 not being valid for certain months, and 29 not being valid for certain combinations of year and month, is not a restriction on “DayOfMonth”, but only a restriction on the composite values of composite type “Year/Month/DayOfMonth”.

A “Data Race” would occur if one thread reads the composite value such that the individual atomic components do not represent the value of the composite either before any writing began in another thread or after all the writing of the atomic components was complete, regardless of whether the invariants are met. For example, if the original value was “2020/01/31” and thread A starts to write a new value, one component at a time, such that the new composite value should be “2020/02/15” and thread B reads the composite value after thread A updates the 31 to 15, but before thread A updates the 01 to 02, then, even though the composite invariants hold (the value 2020/01/15 is a valid date after all) it is a “Data Race” because the value seen by Thread B is neither the value that was originally present, nor the value that is present after Thread A completes the update of the composite value; however, it is not “Undefined Behavior”, because the value read is valid for the type, but it is indeterminate and incorrect behavior of the program because whatever action taken based on the value by Thread B will not likely be correct. On the other hand, if Thread A changes the month to 02 first, then Thread B reads the value, Thread B will read the value as, “2020/02/31”, which is not a valid/defined value for the composite type “YearMonthDayOfMonth”. This can, and likely will, lead to “Undefined Behavior” because other code relies on the invariants of the type and when the invariants aren’t met what happens is undefined.

“Data Race” with non-composite values

In addition to “Data Races” from updating “Composite Type Values”, a “Data Race” will occur if at least two “threads” attempt to read and then modify the same “Restricted Atomic” value in such a way that the resulting value would not be valid for the type depending on the interleaving of operations of the threads. So, if Thread A reads an independent (not part of a composite “Date” value) “Month” type value, then increments it, and then performs a modulus operation to clamp it to 1 through 12 (so 13 becomes 1, 14 becomes 2, etc.) while another Thread B does the same, it will NOT be a “Data Race”, but the resulting value will be indeterminate and it will be a “Race Condition” (it could end up 1 less than what it should be depending on the interleaving of the operations of Thread A and B), and it won’t be “Undefined Behavior” because the ending value, even if not “Correct” for the semantics of the program, will be valid for the type. On the other hand, if Thread A reads the value, increments it, and if the value is 13 it changes it to 1 AND Thread B does the same, the resulting value could end up as 14 which is invalid for the type and so this is a “Data Race” and will be “Undefined Behavior” if the value ends up outside the range of 1 through 12. Rust will prevent a “Data Race” by preventing (in “Safe” code which is the default) update of the atomic components of a composite value or a restricted atomic value while it is possible for other “threads” to read those same values or any other component value of the composite type. Rust goes further and even prevents update of ANY value (including “Atomic” values) while another thread may be able to read them and also disallows the POSSIBILITY (sans “unsafe” code) of multiple writers at the same time. Furthermore, it requires, but does not enforce, that any “unsafe” code written MUST maintain the invariants of all types (atomic, restricted atomic, and composite) whenever the possibility exists for another thread to read the value. In other words, all “unsafe” code must either be operating on a mutable variable, which Rust then enforces at compile time no other readers/writers, or the “unsafe” code MUST use synchronization and privacy to ensure that no other readers/writers may read or modify the value at the same time at runtime. The reason that Rust restricts even atomic values is two-fold: 1) it makes understanding the language and implementation of the language more tractable, and 2) more importantly, what is a true “Atomic Value” on the hardware is not always clear and consistent even on a single architecture, much less across multiple hardware architectures. So, Rust being conservative on what constitutes a “Data Race” is more consistently correct than trying to differentiate true “Atomic” vs “Restricted Atomic” vs “Composite”. In fact, the notion of “Atomic Value” does not exist at the language level of Rust, but is instead something implemented in the standard library or other crates that ensure that “Atomic Value” are actually truly “Atomic” on the hardware and the valid operations on them are actually “Atomic” as well (on the actual hardware).

Representation and Defintion of “Composite”, “Atomic Restricted”, and “Atomic” Values by Rust

“Composite” values (like the “Date” type I mentioned above) are Structs. In Rust, Enums, can be, and often are, composite values. Tuples are composite values. Unions are composite values. Everything in any computer language that are not “primitives” that map directly to “atomic” values of the underlying CPU architecture are composite. For the most part, the only “values” that are almost guaranteed (and in practice mostly are) to be “atomic” (in the sense I’m speaking about here) are bytes (and even this is not necessarily true for ALL architectures that may exist now or in the future). All other values, may, on particular architectures, be composite values from the perspective of the hardware even if the language considers them fundamental/primitive. Moreover, even true “Atomic Values” on hardware may only have a limited number of operations that can be performed on them that is truly, 100% guaranteed to be “Atomic”. So, basically, any “Type” in Rust can be actually, and is always treated by default, as not atomic. Only special types, which rely on usage of “unsafe”, because Rust does not have a language notion of truly “Atomic” values, and are implemented in libraries, are “Atomic” (example, Mutex). In Rust, even all the primitive types are treated as non-atomic, including: i8, i16, i32, u32, f32, f64, isize, usize, etc.

How “True Atomics” and “Synchroniztion” Relate to Data Races and Race Conditions

You can’t have a “Data Race” when reading or updating a truly “Atomic” value via a truly atomic operation that that type supports. The underlying hardware will always either read the whole value as it was at some point, or write the new value to memory. You’ll never see a value that is part of the old value and part of the new value combined. You can have a “Race Condition” with an atomic value. For example, if you read the value with synchronization, release the “lock”, take some action based on that value, and then, re-acquire the “lock” and update the value.

Why “Primitive Values/Types” are not “Atomic”

“Primitive” values being non-atomic may at first seem incorrect, but at the hardware level, they are often composite. For example, depending on the hardware, a “long int” or a “float” or a “double” may be composite rather than atomic. That is, when writing a single instance of those values, the CPU may actually do it in 2 or more steps and so, during the write, the memory location can have undefined values for some amount of time. For example, on some architectures, writing a double will be broken into more than 1 operation. It might write the most significant 32-bits to memory first, then, write the least significant 32-bits to memory (or a different order depending on the hardware). This means, that between these two points, the value in memory of the double is a completely meaningless and incorrect value. If another thread reads that location in memory, they may read the most significant 32-bits after those were written, then, read the least significant 32-bits (of the previous value) before the CPU has written the least significant bits of the new value. A “Double” (floating point) value read in such a way will have a completely meaningless value and may actually be an “Undefined/Impossible” value from the perspective of the definition of a floating point value.

Why doesn’t the Rust compiler automatically add synchronization?

It may seem like the compiler could simply automatically insert the necessary synchronization to ensure the above does not occur; however, this would mean that the cost of synchronization (which is not cheap at the hardware level and can, and often will, result in significant performance penalty even in the absence of actual concurrency) would have to be paid at runtime regardless of whether it was needed. It would not be needed if the memory location/value was known to be updateable by only a single writer and that it was not possible for other readers to exist at the same time. Rust can “prove” this restriction at compile-time through the enforcement of ownership, shared borrows/references vs. exclusive borrows/references, and lifetimes. The compiler knows, for sure, that it does not need to insert any sort of synchronization CPU instructions because of this. If it finds, that the usage the programmer is attempting, violates the restrictions, and thus would require some sort of synchronization to be well-defined, it will refuse to compile and will produce an error rather than silently adding potentially expensive synchronization. Even if the compiler wanted to insert the necessary synchronization automatically, it would require whole program analysis, including flow analysis of the entire program, instead of relying on local information (definition of types, traits, functions signature) in order to know when it needed to insert synchronization for sure and when it could elide it. Doing whole program flow-analysis at compile-time would be prohibitively expensive in terms of clock time, memory, cpu time etc. used by the compiler at compile-time. Of course, the compiler could just always insert synchronization, but, that would be too costly in terms of CPU overhead for a language that needs to have maximum efficiency when most of the time it wouldn’t be required in actuality.

What does this all mean?

So, “Data Races”, are defined in Rust as any attempt to update a value (whether effectively atomic or composite, and whether restricted or not) while other readers or writers read or write the same value. To prevent “Data Races”, Rust enforces, at compile-time, that no readers may exist and no other writers may exist if something exists that has a mutable reference to the value/memory and that nothing may have a mutable reference while anything has an immutable reference. So, there can be 1 or more readers OR 1 writer, but, never multiple writers or readers and a writer at the same time. This does not prevent all “Race Conditions”.

“Race Conditions” in Rust are when synchronization is incorrectly used in a way that the semantics of the program are incorrect, though “safe” and “well-defined” (that is, not “Undefined Behavior”). In other words, all of the values could be read and updated in a way that does not cause “Data Races”, but there could still be a “Race Condition” because the usage of “True Atomics” and/or “Synchroniztion Types” was incorrect despite the design of the API’s exposed as “Safe” abstractions of these things making it difficult and awkward to do it wrong.

To attempt to prevent “Race Conditions” Rust libraries and types leverage the ability of Rust to prevent “Data Races” to create types, API’s, and libraries that make it easy for the user to avoid “Race Conditions” and difficult or awkward to cause “Race Conditions”. It does not, though, absolutely prevent them. The programmer can create many kinds of race conditions, in safe code using safe abstractions only, by abusing or misusing API’s or using API’s. It is up to the creators of the API’s (types, modules, crates), to create API’s that leverage the capabilities of the Rust type system, and its ability to prevent “Data Races” efficiently to create efficient, idiomatic, and ergonomic API’s that make it easy (low friction) for the programmer to do things correctly and avoid “Race Conditions” and make it hard (high friction and awkward) to cause them.

So, you skipped all that?

TL;DR - “Race Condition” and “Data Race” are not the same thing, though they are somewhat related. Rust prevents “Data Races” in safe code, it does not, however, absolutely prevent “Race Conditions”.

blog

And Then?