The JSR-133 Cookbook
适用于编译器开发的 JSR-133 指南
“The JSR-133 Cookbook for Compiler Writers”
original website is http://g.oswego.edu/dl/jmm/cookbook.html. by Doug Lea, with help from members of the JMM mailing list.
Chinese edition is translated by 崔新, And website is https://yellowstar5.cn/direct/The%20JSR-133%20Cookbook-chinese.html
Table of Contents (目录)
Preface: Over the 10+ years since this was initially written, many processor and language memory model specifications and issues have become clearer and better understood. And many have not. While this guide is maintained to remain accurate, it is incomplete about some of these evolving details. For more extensive coverage, see especially the work of Peter Sewell and the Cambridge Relaxed Memory Concurrency Group
前言: 自此指南最初编写以来已有十多年,许多处理器和语言内存模型规范和问题已经变得更加清晰和更好地被理解。 然而许多还没有。 尽管本指南一直被维护着以保持准确性,但关于一些不断发展的细节,此指南给出的内容并不完整。 有关更广泛的报道,尤其要参见 Peter Sewell 和 Cambridge Relaxed Memory Concurrency Group 的工作。
This is an unofficial guide to implementing the new Java Memory Model (JMM) specified by JSR-133 . It provides at most brief backgrounds about why various rules exist, instead concentrating on their consequences for compilers and JVMs with respect to instruction reorderings, multiprocessor barrier instructions, and atomic operations. It includes a set of recommended recipes for complying to JSR-133. This guide is “unofficial” because it includes interpretations of particular processor properties and specifications. We cannot guarantee that the intepretations are correct. Also, processor specifications and implementations may change over time.
这是实现由 JSR-133 规范的新 Java Memory Model (JMM) 的非官方指南。 它提供了有关为什么存在各种规则的最简短的背景,而不是专注于它们在指令重新排序,多处理器屏障指令和原子操作方面对编译器和JVM的影响。 它包括一组符合 JSR-133 的推荐食谱。 本指南是“非官方的”,因为它包含对特定处理器属性和规范的解释。 我们不能保证解释是正确的。 此外,处理器规范和实现可能会随时间而变化。
Reorderings (重排序)
For a compiler writer, the JMM mainly consists of rules disallowing reorderings of certain instructions that access fields (where “fields” include array elements) as well as monitors (locks).
对于一个编译器编写者来说,JMM主要由禁止对访问字段(其中“字段”包括数组元素)和监视器(锁)的某些指令进行重排序的规则组成。
Volatiles and Monitors (volatile 和监视器)
The main JMM rules for volatiles and monitors can be viewed as a matrix with cells indicating that you cannot reorder instructions associated with particular sequences of bytecodes. This table is not itself the JMM specification; it is just a useful way of viewing its main consequences for compilers and runtime systems.
可以将 volatile 和监视器的主要 JMM 规则视为带有单元格的矩阵,其中单元格指示了你无法对与特定字节码序列相关的指令进行重排序。 这个表格本身不是 JMM 规范; 它只是查看其对编译器和运行时系统的主要影响的一种有用方法。
Can Reorder | 2nd operation | ||
1st operation | Normal Load Normal Store |
Volatile Load MonitorEnter |
Volatile Store MonitorExit |
Normal Load Normal Store |
No | ||
Volatile Load MonitorEnter |
No | No | No |
Volatile store MonitorExit |
No | No |
Where:
- Normal Loads are getfield, getstatic, array load of non-volatile fields.
- Normal Stores are putfield, putstatic, array store of non-volatile fields
- Volatile Loads are getfield, getstatic of volatile fields that are accessible by multiple threads
- Volatile Stores are putfield, putstatic of volatile fields that are accessible by multiple threads
- MonitorEnters (including entry to synchronized methods) are for lock objects accessible by multiple threads.
- MonitorExits (including exit from synchronized methods) are for lock objects accessible by multiple threads.
其中:
- 普通加载(Normal Loads)是 非volatile字段的 getfield,getstatic,数组加载
- 普通存储(Normal Stores)是 非volatile字段的 putfield,putstatic,数组存储
- Volatile加载(Volatile Loads)是 volatile字段(该字段被多线程访问)的 getfield,getstatic加载
- Volatile存储(Volatile Stores)是 volatile字段(该字段被多线程访问)的 putfield,putstatic存储
- MonitorEnters(包括同步方法的开始)用于可由多个线程访问的锁对象。
- MonitorExits(包括同步方法的退出)用于可由多个线程访问的锁对象。
The cells for Normal Loads are the same as for Normal Stores,
those for Volatile Loads are the same as MonitorEnter, and those for
Volatile Stores are same as MonitorExit, so they are collapsed
together here (but are expanded out as needed in subsequent tables).
We consider here only variables that are readable and writable as an
atomic unit — that is, no bit fields, unaligned accesses, or accesses
larger than word sizes available on a platform.
Normal Loads 的单元格与 Normal Stores 的单元格相同, Volatile Loads 的单元格与 MonitorEnter 相同, 而 Volatile Stores 的单元格与 MonitorExit 相同,因此它们在此处折叠在一起(但根据需要在后续表格被展开)。 在这里,我们仅考虑以原子单位可读写的变量 —— 即没有位字段,未对齐的访问或大于平台上可用字长的访问。
Any number of other operations might be present between the indicated 1st and 2nd operations in the table. So, for example, the “No” in cell [Normal Store, Volatile Store] says that a non-volatile store cannot be reordered with ANY subsequent volatile store; at least any that can make a difference in multithreaded program semantics.
表中指示的 1st 和 2nd 操作之间可能存在任意数量的其他操作。 因此,例如,[Normal Store, Volatile Store]单元格中的”No”表示, 一个 非volatile存储 不能与任何后续的 voaltile存储 一起重排序; 至少是任何在多线程程序语义上有影响的重排序。
The JSR-133 specification is worded such that the rules for both volatiles and monitors apply only to those that may be accessed by multiple threads. If a compiler can somehow (usually only with great effort) prove that a lock is only accessible from a single thread, it may be eliminated. Similarly, a volatile field provably accessible from only a single thread acts as a normal field. More fine-grained analyses and optimizations are also possible, for example, those relying on provable inaccessibility from multiple threads only during certain intervals.
JSR-133规范的措辞使得 volatile 和监视器的规则仅适用于可由多个线程访问的规则。 如果编译器可以用某种方式(通常要花费很大的精力)证明一个锁仅对单个线程可访问,那么该锁可能会被消除。 类似地,可证明仅对单个线程可访问的 volaitle 字段可以当成普通字段。 更细粒度的分析和优化也是可能的,例如,那些依赖于仅在特定时间间隔内对多线程可证明不可访问的分析和优化。
Blank cells in the table mean that the reordering is allowed if the accesses aren’t otherwise dependent with respect to basic Java semantics (as specified in the JLS). For example even though the table doesn’t say so, you can’t reorder a load with a subsequent store to the same location. But you can reorder a load and store to two distinct locations, and may wish to do so in the course of various compiler transformations and optimizations. This includes cases that aren’t usually thought of as reorderings; for example reusing a computed value based on a loaded field rather than reloading and recomputing the value acts as a reordering. However, the JMM spec permits transformations that eliminate avoidable dependencies, and in turn allow reorderings.
表中的空白单元格表示,重排序是允许的,如果那些访问不依赖于基本的 Java 语义(如 JLS 所规范的)。 例如,即使表中没有这样说,你也不能将一个加载与一个后续对同一位置的存储重排序。 但是你可以将对两个不同位置的加载和存储重排序,并且可能希望在各种编译器转换和优化过程中这样做。 这包括通常不认为是重排序的情况; 例如,重用基于一个加载的字段得到的一个计算值,而不是重新加载并重新计算该值(与重排序行为一致)。 但是,JMM 规范允许进行转换,从而消除了可避免的依赖关系,进而允许重排序。
In all cases, permitted reorderings must maintain minimal Java safety properties even when accesses are incorrectly synchronized by programmers: All observed field values must be either the default zero/null “pre-construction” values, or those written by some thread. This usually entails zeroing all heap memory holding objects before it is used in constructors and never reordering other loads with the zeroing stores. A good way to do this is to zero out reclaimed memory within the garbage collector. See the JSR-133 spec for rules dealing with other corner cases surrounding safety guarantees.
在所有情况下,允许的重排序必须保持最小的 Java 安全属性,即使当那些访问被程序员不正确地同步的时候: 所有观察到的字段值都必须是默认的 zero/null “pre-construction”值,或者是由某个线程写入的值。 这通常需要在构造函数使用堆内存之前将持有对象的所有堆内存清零,还需要永远不会将零存储(zeroing stores)与其他存储重排序。 实现上述要求的一个好方法是将垃圾回收器中回收的内存清零。 处理围绕安全保证的其他特殊情况的相关规则,请参见 JSR-133 规范。
The rules and properties described here are for accesses to Java-level fields. In practice, these will additionally interact with accesses to internal bookkeeping fields and data, for example object headers, GC tables, and dynamically generated code.
此处描述的规则和属性用于访问 Java-level 的字段。 实际上,它们还将与对内部记录字段和数据(例如对象头,GC表和动态生成的代码)的访问进行交互。
Final Fields (final字段)
Loads and Stores of final fields act as “normal” accesses with respect to locks and volatiles, but impose two additional reordering rules:
final字段的加载和存储就锁和volatile而言是“普通”访问,但是强加了两个附加的重排序规则:
These rules imply that reliable use of final fields by Java programmers requires that the load of a shared reference to an object with a final field itself be synchronized, volatile, or final, or derived from such a load, thus ultimately ordering the initializing stores in constructors with subsequent uses outside constructors.
这些规则暗示: Java 程序员对 final 字段的可靠使用存在要求, 该要求是对带有 final 字段的对象的共享引用的加载本身必须是 synchronized,volatile 或 final,或者是从此类加载派生来的, 因而最终将构造函数中的初始化存储与构造函数外的后续使用排序。
Memory Barriers (内存屏障)
Compilers and processors must both obey reordering rules. No particular effort is required to ensure that uniprocessors maintain proper ordering, since they all guarantee “as-if-sequential” consistency. But on multiprocessors, guaranteeing conformance often requires emitting barrier instructions. Even if a compiler optimizes away a field access (for example because a loaded value is not used), barriers must still be generated as if the access were still present. (Although see below about independently optimizing away barriers.)
编译器和处理器都必须遵守重排序规则。 不需要特别的努力来确保单处理器保持适当的排序,因为它们都保证 “as-if-sequential” 一致性。 但是在多处理器上,要保证一致性,通常需要调用屏障指令。 即使编译器优化掉了一个字段访问(例如,因为一个加载的值未被使用),屏障也必须仍然被生成,就像访问仍然存在一样。 (但是可参阅下面有关独立地优化掉屏障的信息。)
Memory barriers are only indirectly related to higher-level notions described in memory models such as “acquire” and “release”. And memory barriers are not themselves “synchronization barriers”. And memory barriers are unrelated to the kinds of “write barriers” used in some garbage collectors. Memory barrier instructions directly control only the interaction of a CPU with its cache, with its write-buffer that holds stores waiting to be flushed to memory, and/or its buffer of waiting loads or speculatively executed instructions. These effects may lead to further interaction among caches, main memory and other processors. But there is nothing in the JMM that mandates any particular form of communication across processors so long as stores eventually become globally performed; i.e., visible across all processors, and that loads retrieve them when they are visible.
内存屏障仅与内存模型中描述的更高级概念(例如 “acquire” 和 “release”)间接相关。 并且内存屏障本身并不是”同步屏障”(“synchronization barriers”)。 并且内存屏障与某些垃圾收集器中使用的”写屏障”(“write barriers”)的种类无关。 内存屏障指令仅直接控制 CPU 与该 CPU 的高速缓存,该 CPU的的写入缓冲区(保存等待刷新到主存的存储),和/或该 CPU 的等待加载的缓冲区或推测执行的指令的交互。 这些影响可能导致多个高速缓存,主存和其他多个处理器之间的进一步交互。 但是,只要存储最终在全局执行,JMM 中就没有什么要求在处理器之间进行任何特定形式的通信; 即在所有处理器上均可见,并且在可见时加载会获取它们。
Categories (目录)
Nearly all processors support at least a coarse-grained barrier instruction, often just called a Fence, that guarantees that all loads and stores initiated before the fence will be strictly ordered before any load or store initiated after the fence. This is usually among the most time-consuming instructions on any given processor (often nearly as, or even more expensive than atomic instructions). Most processors additionally support more fine-grained barriers.
几乎所有处理器都至少支持一个粗粒度的屏障指令,通常称为一个栅栏(Fence), 该栅栏可确保在该栅栏之前的所有加载和存储都会被严格排序在在该栅栏之后的任何加载或存储之前。 这通常是在任何给定处理器上最耗时的指令之一(通常与原子指令几乎一样,甚至比原子指令更昂贵)。 大多数处理器还支持更多细粒度的屏障
A property of memory barriers that takes some getting used to is that they apply BETWEEN memory accesses. Despite the names given for barrier instructions on some processors, the right/best barrier to use depends on the kinds of accesses it separates. Here’s a common categorization of barrier types that maps pretty well to specific instructions (sometimes no-ops) on existing processors:
内存屏障的一项属性(该属性需要一些时间来习惯),它们会应用在内存访问之间。 尽管在某些处理器上为屏障指令指定了名称,但要使用的正确/最佳的屏障取决于它分隔的访问类型。 下面是屏障类型的一个常见分类,该分类可以很好地映射到现有处理器上的特定指令(有时是 no-ops):
LoadLoad Barriers
The sequence: Load1; LoadLoad; Load2
ensures that Load1’s data are loaded before data accessed by Load2 and all subsequent load instructions are loaded. In general, explicit LoadLoad barriers are needed on processors that perform speculative loads and/or out-of-order processing in which waiting load instructions can bypass waiting stores. On processors that guarantee to always preserve load ordering, the barriers amount to no-ops.
序列: Load1; LoadLoad; Load2
确保在加载由 Load2 和所有后续加载指令访问的数据之前,先加载 Load1 的数据。 通常,显式的 LoadLoad 屏障在这样的处理器上被需要,该处理器执行推测性加载和/或乱序处理(其中等待中的加载指令可以绕过等待中的存储)。 在保证始终保持加载排序的处理器上,屏障等于no-ops。
StoreStore Barriers
The sequence: Store1; StoreStore; Store2
ensures that Store1’s data are visible to other processors (i.e., flushed to memory) before the data associated with Store2 and all subsequent store instructions. In general, StoreStore barriers are needed on processors that do not otherwise guarantee strict ordering of flushes from write buffers and/or caches to other processors or main memory.
序列: Store1; StoreStore; Store2
确保在与 Store2 和所有后续存储指令关联的数据之前,Store1 的数据对其他处理器可见(即已刷新到内存)。 通常,StoreStore 屏障在这样的处理器上被需要,该处理器否则不能保证从写缓冲区和/或高速缓存到其他处理器或主存储器的刷新严格排序。
LoadStore Barriers
The sequence: Load1; LoadStore; Store2
ensures that Load1’s data are loaded before all data associated with Store2 and subsequent store instructions are flushed. LoadStore barriers are needed only on those out-of-order procesors in which waiting store instructions can bypass loads.
序列: Load1; LoadStore; Store2
确保在与 Store2 和后续存储指令相关的所有数据被刷新之前,Load1 的数据先被加载。 仅在那些等待中的存储指令可以绕过加载的乱序处理器上才需要 LoadStore 屏障
StoreLoad Barriers
The sequence: Store1; StoreLoad; Load2
ensures that Store1’s data are made visible to other processors (i.e., flushed to main memory) before data accessed by Load2 and all subsequent load instructions are loaded. StoreLoad barriers protect against a subsequent load incorrectly using Store1’s data value rather than that from a more recent store to the same location performed by a different processor. Because of this, on the processors discussed below, a StoreLoad is strictly necessary only for separating stores from subsequent loads of the same location(s) as were stored before the barrier. StoreLoad barriers are needed on nearly all recent multiprocessors, and are usually the most expensive kind. Part of the reason they are expensive is that they must disable mechanisms that ordinarily bypass cache to satisfy loads from write-buffers. This might be implemented by letting the buffer fully flush, among other possible stalls.
序列: Store1; StoreLoad; Load2
确保在加载 Load2 和所有后续加载指令所访问的数据之前,使 Store1 的数据对其他处理器可见(即已刷新到主存储器)。 StoreLoad 屏障可以防止一个后续加载错误地使用 Store1 的数据值,而不是使用由不同处理器执行的对相同位置的更新的存储。 因此,在下面讨论的处理器上,只有为了将存储与该屏障之后访问同一位置的后续加载分开时,一个 StoreLoad 才严格需要。 StoreLoad 屏障在几乎所有最新的多处理器中都是必需的,并且通常是最昂贵的屏障。 它们之所以昂贵的部分原因是它们必须禁用通常绕过高速缓存的机制来满足来自写缓冲区的加载。 这可以通过让缓冲区完全刷新以及其他可能的停顿来实现。
On all processors discussed below, it turns out that instructions that perform StoreLoad also obtain the other three barrier effects, so StoreLoad can serve as a general-purpose (but usually expensive) Fence. (This is an empirical fact, not a necessity.) The opposite doesn’t hold though. It is NOT usually the case that issuing any combination of other barriers gives the equivalent of a StoreLoad.
在下面讨论的所有处理器上,事实证明执行 StoreLoad 的指令也获得了其他三种屏障效果, 因此 StoreLoad 可用作通用(但通常很贵)的 Fence。 (这是一个经验事实,不是必须的。) 相反情况并不成立。 调用其他屏障的任意组合相当于 StoreLoad, 通常不是这种情况。
The following table shows how these barriers correspond to JSR-133 ordering rules.
下表显示了这些屏障如何与 JSR-133 排序规则相对应。
Required barriers | 2nd operation | |||
1st operation | Normal Load | Normal Store | Volatile Load MonitorEnter |
Volatile Store MonitorExit |
Normal Load | LoadStore | |||
Normal Store | StoreStore | |||
Volatile Load MonitorEnter |
LoadLoad | LoadStore | LoadLoad | LoadStore |
Volatile Store MonitorExit |
StoreLoad | StoreStore |
Plus the special final-field rule requiring a StoreStore barrier in
x.finalField = v; StoreStore; sharedRef = x;
Here’s an example showing placements.
加上特殊的final字段规则,该规则要求在下面语句中需要一个StoreStore屏障
x.finalField = v; StoreStore; sharedRef = x;
下面是显示展示位置的示例。
Java | Instructions |
class X { int a, b; volatile int v, u; void f() { int i, j; i = a; j = b; i = v; j = u; a = i; b = j; v = i; u = j; i = u; j = b; a = i; } } |
load a |
Data Dependency and Barriers (数据依赖性和屏障)
The need for LoadLoad and LoadStore barriers on some processors interacts with their ordering guarantees for dependent instructions. On some (most) processors, a load or store that is dependent on the value of a previous load are ordered by the processor without need for an explicit barrier. This commonly arises in two kinds of cases, indirection:
Load x; Load x.field
and control
Load x; if (predicate(x)) Load or Store y;
一些处理器上对 LoadLoad 和 LoadStore 屏障的需求与其对相关指令的排序保证相互影响。 在某些(大多数)处理器上,一个加载或存储(该操作依赖于之前加载的值)被处理器排序时并不需要一个显式的屏障。 这通常在两种情况下出现, 间接:
Load x; Load x.field
和控制:
Load x; if (predicate(x)) Load or Store y;
Processors that do NOT respect indirection ordering in particular require barriers for final field access for references initially obtained through shared references:
x = sharedRef; … ; LoadLoad; i = x.finalField;
不考虑间接排序的处理器尤其需要对引用(该引用最初通过共享引用获得)进行 final 字段访问的屏障:
x = sharedRef; … ; LoadLoad; i = x.finalField;
Conversely, as discussed below, processors that DO respect data dependencies provide several opportunities to optimize away LoadLoad and LoadStore barrier instructions that would otherwise need to be issued. (However, dependency does NOT automatically remove the need for StoreLoad barriers on any processor.)
相反,如下所述,要尊重数据依赖性的处理器为优化掉 LoadLoad 和 LoadStore 屏障指令(否则这些指令需要被调用)提供了几个机会。 (但是,依赖关系不会自动消除任何处理器上对 StoreLoad 屏障的需求。)
Interactions with Atomic Instructions (与原子指令的交互)
The kinds of barriers needed on different processors further interact with implementation of MonitorEnter and MonitorExit. Locking and/or unlocking usually entail the use of atomic conditional update operations CompareAndSwap (CAS) or LoadLinked/StoreConditional (LL/SC) that have the semantics of performing a volatile load followed by a volatile store. While CAS or LL/SC minimally suffice, some processors also support other atomic instructions (for example, an unconditional exchange) that can sometimes be used instead of or in conjunction with atomic conditional updates.
在不同处理器上需要的各种屏障进一步与 MonitorEnter 和 MonitorExit 的实现交互。 锁定和/或解锁通常需要使用原子条件更新操作 CompareAndSwap(CAS) 或 LoadLinked/StoreConditional(LL/SC),这些操作具有一个 volatile 加载,然后跟着一个 volatile 存储的语义。 尽管 CAS或 LL/SC 最低限度地满足了使用,但某些处理器还支持其他原子指令(例如,无条件交换),这些指令有时可以用来代替原子条件更新或与原子条件更新结合使用。
On all processors, atomic operations protect against read-after-write problems for the locations being read/updated. (Otherwise standard loop-until-success constructions wouldn’t work in the desired way.) But processors differ in whether atomic instructions provide more general barrier properties than the implicit StoreLoad for their target locations. On some processors these instructions also intrinsically perform barriers that would otherwise be needed for MonitorEnter/Exit; on others some or all of these barriers must be specifically issued.
在所有处理器上,原子操作可以防止正在被读取/更新的位置的 read-after-write 问题。 (否则,标准的 loop-until-success 结构无法按预期的方式工作。) 但是处理器的区别在于原子指令是否为其目标地址提供比隐式 StoreLoad 更通用的屏障属性。 在某些处理器上,这些指令还从根本上执行了 MonitorEnter/Exit 所需的屏障。 在其他处理器上,所有或部分这些屏障必须明确调用。
Volatiles and Monitors have to be separated to disentangle these effects, giving:
volatile 和监视器必须分开才能消除这些影响,从而得到:
Required Barriers | 2nd operation | |||||
1st operation | Normal Load | Normal Store | Volatile Load | Volatile Store | MonitorEnter | MonitorExit |
Normal Load | LoadStore | LoadStore | ||||
Normal Store | StoreStore | StoreExit | ||||
Volatile Load | LoadLoad | LoadStore | LoadLoad | LoadStore | LoadEnter | LoadExit |
Volatile Store | StoreLoad | StoreStore | StoreEnter | StoreExit | ||
MonitorEnter | EnterLoad | EnterStore | EnterLoad | EnterStore | EnterEnter | EnterExit |
MonitorExit | ExitLoad | ExitStore | ExitEnter | ExitExit |
Plus the special final-field rule requiring a StoreStore barrier in:
x.finalField = v; StoreStore; sharedRef = x;
加上特殊的 final 字段规则,该规则要求在以下位置添加 StoreStore 屏障:
x.finalField = v; StoreStore; sharedRef = x;
In this table, “Enter” is the same as “Load” and “Exit” is the same as “Store”, unless overridden by the use and nature of atomic instructions. In particular:
在此表中, “Enter” 与 “Load” 相同,”Exit” 与 “Store” 相同,除非被原子指令的使用和性质所覆盖。 特别是:
- EnterLoad is needed on entry to any synchronized block/method that performs a load. It is the same as LoadLoad unless an atomic instruction is used in MonitorEnter and itself provides a barrier with at least the properties of LoadLoad, in which case it is a no-op.
- EnterLoad 在进入任何执行加载的同步块/方法时被需要。 EnterLoad 与 LoadLoad 相同, 除非在 MonitorEnter 中使用了原子指令,并且 EnterLoad 本身提供了至少具有 LoadLoad 属性的屏障, 在这种情况下,它是一个 no-op。
- StoreExit is needed on exit of any synchronized block/method that performs a store. It is the same as StoreStore unless an atomic instruction is used in MonitorExit and itself provides a barrier with at least the properties of StoreStore, in which case it is a no-op.
- StoreExit 在退出任何执行存储的同步块/方法时被需要。 StoreExit 与 StoreStore 相同, 除非在 MonitorExit 中使用了原子指令,并且 StoreExit 本身提供了至少具有 StoreStore 属性的屏障, 在这种情况下,它是一个 no-op。
- ExitEnter is the same as StoreLoad unless atomic instructions are used in MonitorExit and/or MonitorEnter and at least one of these provide a barrier with at least the properties of StoreLoad, in which case it is a no-op.
- ExitEnter 与 StoreLoad 相同,除非在 MonitorExit 和/或 MonitorEnter 中使用了原子指令,并且其中至少有一个提供了至少具有 StoreLoad 属性的屏障,在这种情况下,它是一个 no-op。
The other types are specializations that are unlikely to play a role in compilation (see below) and/or reduce to no-ops on current processors. For example, EnterEnter is needed to separate nested MonitorEnters when there are no intervening loads or stores. Here’s an example showing placements of most types:
其他类型是专门化,它们不太可能在编译中起作用(请参阅下文)和/或在当前处理器上简化为 no-ops。 例如,当没有中间加载或存储时,需要 EnterEnter 来分隔嵌套的 MonitorEnters。 下面是一个显示大多数类型位置的示例:
Java | Instructions |
class X { int a; volatile int v; void f() { int i; synchronized(this) { i = a; a = i; } synchronized(this) { i = v; v = i; | enter |
Java-level access to atomic conditional update operations will be available in JDK1.5 via JSR-166 (concurrency utilities) so compilers will need to issue associated code, using a variant of the above table that collapses MonitorEnter and MonitorExit — semantically, and sometimes in practice, these Java-level atomic updates act as if they are surrounded by locks.
在 Java1.5 中 通过 JSR-166 (concurrency utilities)对原子条件更新操作进行 Java-level 访问将是可用的, 因此编译器将需要使用上表的变体来调用关联的代码,该变体将 MonitorEnter 和 MonitorExit 折叠起来 —— 从语义上讲,有时在实践中,这些 Java-level 原子更新的行为就像被锁包围一样。
Multiprocessors (多处理器)
Here’s a listing of processors that are commonly used in MPs, along with links to documents providing information about them. (Some require some clicking around from the linked site and/or free registration to access manuals). This isn’t an exhaustive list, but it includes processors used in all current and near-future multiprocessor Java implementations I know of. The list and the properties of processors decribed below are not definitive. In some cases I’m just reporting what I read, and could have misread. Several reference manuals are not very clear about some properties relevant to the JMM. Please help make it definitive.
下面是 MPs 中常用处理器的列表,以及提供有关它们的信息的文档的链接。 (有些需要在链接的站点上单击一下和/或免费注册才能访问手册)。 这并不是一个详尽的清单,但它包括了在我所知道的所有当前和不久将来的多处理器 Java 实现中使用的处理器。 下面描述的处理器的列表和属性不是明确的。 在某些情况下,我只是在报告自己所读的内容,并且可能会误读。 一些参考手册对于与 JMM 相关的某些属性不是很清楚。请帮助使其明确。
Good sources of hardware-specific information about barriers and related properties of machines not listed here are Hans Boehm’s atomic_ops library, the Linux Kernel Source, and Linux Scalability Effort. Barriers needed in the linux kernel correspond in straightforward ways to those discussed here, and have been ported to most processors. For descriptions of the underlying models supported on different processors, see Sarita Adve et al, Recent Advances in Memory Consistency Models for Hardware Shared-Memory Systems and Sarita Adve and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial.
此处未列出的有关机器的屏障和相关属性的特定硬件信息的良好来源是 Hans Boehm’s atomic_ops library,Linux Kernel Source 和 Linux Scalability Effort。 linux 内核中所需的屏障以直接的方式对应于此处讨论的屏障,并且已移植到大多数处理器中。 有关不同处理器支持的基础模型的说明,请参见 Sarita Adve et al, Recent Advances in Memory Consistency Models for Hardware Shared-Memory Systems 和 Sarita Adve and Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial。
sparc-TSO
Ultrasparc 1, 2, 3 (sparcv9) in TSO (Total Store Order) mode. Ultra3s only support TSO mode. (RMO mode in Ultra1/2 is never used so can be ignored.) See UltraSPARC III Cu User’s Manual and The SPARC Architecture Manual, Version 9 .
x86 (and x64)
Intel 486+, as well as AMD and apparently others. There was a flurry of re-specs in 2005-2009, but the current specs are nearly identical to TSO, differing mainly only in supporting different cache modes, and dealing with corner cases such as unaligned accesses and special forms of instructions. See The IA-32 Intel Architecture Software Developers Manuals: System Programming Guide and AMD Architecture Programmer’s Manual Programming.
ia64
Itanium. See Intel Itanium Architecture Software Developer’s Manual, Volume 2: System Architecture
ppc (POWER)
All versions have the same basic memory model, but the names and definition of some memory barrier instructions changed over time. The listed versions have been current since Power4; see architecture manuals for details. See MPC603e RISC Microprocessor Users Manual, MPC7410/MPC7400 RISC Microprocessor Users Manual , Book II of PowerPC Architecture Book, PowerPC Microprocessor Family: Software reference manual, Book E- Enhanced PowerPC Architecture, EREF: A Reference for Motorola Book E and the e500 Core. For discussion of barriers see IBM article on power4 barriers, and IBM article on powerpc barriers.
arm
Version 7+. See ARM processor specifications
alpha
21264x and I think all others. See Alpha Architecture Handbook
pa-risc
HP pa-risc implementations. See the pa-risc 2.0 Architecture manual.
Here’s how these processors support barriers and atomics:
以下是这些处理器如何支持屏障和原子的:
Processor | LoadStore | LoadLoad | StoreStore | StoreLoad | Data dependency orders loads? |
Atomic Conditional |
Other Atomics |
Atomics provide barrier? |
sparc-TSO | no-op | no-op | no-op | membar (StoreLoad) |
yes | CAS: casa |
swap, ldstub |
full |
x86 | no-op | no-op | no-op | mfence or cpuid or locked insn |
yes | CAS: cmpxchg |
xchg, locked insn |
full |
ia64 | combine with st.rel or ld.acq |
ld.acq | st.rel | mf | yes | CAS: cmpxchg |
xchg, fetchadd |
target + acq/rel |
arm | dmb (see below) |
dmb (see below) |
dmb-st | dmb | indirection only |
LL/SC: ldrex/strex |
target only |
|
ppc | lwsync (see below) |
hwsync (see below) |
lwsync | hwsync | indirection only |
LL/SC: ldarx/stwcx |
target only |
|
alpha | mb | mb | wmb | mb | no | LL/SC: ldx_l/stx_c |
target only |
|
pa-risc | no-op | no-op | no-op | no-op | yes | build from ldcw |
ldcw | (NA) |
Notes (备注)
- Some of the listed barrier instructions have stronger properties than actually needed in the indicated cells, but seem to be the cheapest way to get desired effects.
- 列出的一些屏障指令具有比所示单元格实际所需的属性更强的属性,但似乎是获得所需效果的最便宜方法。
- The listed barrier instructions are those designed for use with normal program memory, but not necessarily other special forms/modes of caching and memory used for IO and system tasks. For example, on x86-SPO, StoreStore barriers (“sfence”) are needed with WriteCombining (WC) caching mode, which is designed for use in system-level bulk transfers etc. OSes use Writeback mode for programs and data, which doesn’t require StoreStore barriers.
- 列出的屏障指令是被设计用于普通程序存储器的指令,但不一定适合用于 IO 和系统任务的缓存和存储器的其他特殊形式/模式。 例如,在 x86-SPO 上,WriteCombining(WC) 缓存模式需要 StoreStore 屏障(”sfence”),该屏障被设计用于 system-level 批量传输等。 将 Writeback 模式用于程序和数据的操作系统不需要 StoreStore 屏障。
- On x86, any lock-prefixed instruction can be used as a StoreLoad barrier. (The form used in linux kernels is the no-op lock; addl $0,0(%%esp).) Versions supporting the “SSE2” extensions (Pentium4 and later) support the mfence instruction which seems preferable unless a lock-prefixed instruction like CAS is needed anyway. The cpuid instruction also works but is slower.
- 在 x86 上,任何带 lock 前缀的指令都可以用作一个 StoreLoad 屏障。 (在 Linux 内核中使用的形式是 no-op lock; addl $0,0(%%esp)。) 支持 “SSE2” 扩展的版本(Pentium4 和更高版本)支持 mfence 指令, 该指令似乎是更好的,除非无论如何都需要像 CAS 这样的带 lock 前缀的指令。 cpuid 指令也可以,但是速度较慢。
- On ia64, LoadStore, LoadLoad and StoreStore barriers are folded into special forms of load and store instructions — there aren’t separate instructions. ld.acq acts as (load; LoadLoad+LoadStore) and st.rel acts as (LoadStore+StoreStore; store). Neither of these provide a StoreLoad barrier — you need a separate mf barrier instruction for that.
- 在 ia64 上,LoadStore, LoadLoad 和 StoreStore 屏障被折叠成特殊形式的装载和存储指令 —— 没有单独的指令。 ld.acq 充当( (load; LoadLoad+LoadStore),而 st.rel 充当(LoadStore+StoreStore; store)。 它们都不提供 StoreLoad 屏障 —— 为此你需要单独的 mf 屏障指令。
- On both ARM and ppc, there may be opportunities to replace load fences in the presence of data dependencies with non-fence-based instruction sequences. Sequences and cases in which they apply are described in work by the Cambridge Relaxed Memory Concurrency Group.
- 在 ARM 和 ppc 上,如果存在数据依赖性,则可能有机会使用 non-fence-based 的指令序列来替换 load 栅栏。 Cambridge Relaxed Memory Concurrency Group 在成果中描述了它们适用的序列和情况。
- The sparc membar instruction supports all four barrier modes, as well as combinations of modes. But only the StoreLoad mode is ever needed in TSO. On some UltraSparcs, any membar instruction produces the effects of a StoreLoad, regardless of mode.
- sparc membar 指令支持所有四种屏障模式以及模式组合。 但是在 TSO 中只需要 StoreLoad 模式。 在某些 UltraSparcs 上,任何模式的任何 membar 指令都会产生 StoreLoad 的效果。
- The x86 processors supporting “streaming SIMD” SSE2 extensions require LoadLoad “lfence” only only in connection with these streaming instructions.
- 支持 “streaming SIMD” SSE2扩展的 x86 处理器仅在与这些流式指令结合使用时才需要 LoadLoad “lfence”。
- Although the pa-risc specification does not mandate it, all HP pa-risc implementations are sequentially consistent, so have no memory barrier instructions.
- 尽管 pa-risc 规范没有强制要求,但所有 HP pa-risc 实现都是顺序一致的,因此没有内存屏障指令。
- The only atomic primitive on pa-risc is ldcw, a form of test-and-set, from which you would need to build up atomic conditional updates using techniques such as those in the HP white paper on spinlocks.
- pa-risc 上唯一的原子原语是 ldcw,它是一种 test-and-set 的形式,你将需要使用诸如 HP white paper on spinlocks 中的技术来实现原子条件更新。
- CAS and LL/SC take multiple forms on different processors, differing only with respect to field width, minimially including 4 and 8 byte versions.
- CAS 和 LL/SC 在不同的处理器上采用多种形式, 仅在字段宽度方面有所不同,至少包括 4 和 8 字节版本。
- On sparc and x86, CAS has implicit preceding and trailing full StoreLoad barriers. The sparcv9 architecture manual says CAS need not have post-StoreLoad barrier property, but the chip manuals indicate that it does on ultrasparcs.
- 在 sparc 和 x86 上,CAS 具有隐式的前缀和后缀的完整 StoreLoad 屏障。 sparcv9 体系结构手册说 CAS 不需要具有 post-StoreLoad 屏障属性,但是芯片手册表明 StoreLoad 在 ultrasparcs 上存在。
- On ppc and alpha, LL/SC have implicit barriers only with respect to the locations being loaded/stored, but don’t have more general barrier properties.
- 在 ppc 和 alpha 上,LL/SC 仅对于要加载/存储的位置具有隐式屏障,但没有更一般的屏障属性。
- The ia64 cmpxchg instruction also has implicit barriers with respect to the locations being loaded/stored, but additionally takes an optional .acq (post-LoadLoad+LoadStore) or .rel (pre-StoreStore+LoadStore) modifier. The form cmpxchg.acq can be used for MonitorEnter, and cmpxchg.rel for MonitorExit. In those cases where exits and enters are not guaranteed to be matched, an ExitEnter (StoreLoad) barrier may also be needed.
- ia64 cmpxchg 指令相对于要加载/存储的位置也具有隐式屏障,但是另外需要一个可选的 .acq (post-LoadLoad+LoadStore) 或 .rel (pre-StoreStore+LoadStore) 修饰符。 cmpxchg.acq 的格式可用于 MonitorEnter,而 cmpxchg.rel 的格式可用于 MonitorExit。 在无法保证退出和进入匹配的情况下,可能还需要一个 ExitEnter(StoreLoad) 屏障。
- Sparc, x86 and ia64 support unconditional-exchange (swap, xchg). Sparc ldstub is a one-byte test-and-set. ia64 fetchadd returns previous value and adds to it. On x86, several instructions (for example add-to-memory) can be lock-prefixed, causing them to act atomically.
- Sparc,x86 和 ia64 支持 nconditional-exchange (swap, xchg)。 Sparc ldstub 是一字节的 test-and-set。 ia64 fetchadd 返回先前的值并将其添加。 在 x86 上,多个指令(例如,add-to-memory)可以是 lock-prefixed 的,从而使其原子地执行。
Recipes (食谱)
Uniprocessors (单处理器)
If you are generating code that is guaranteed to only run on a uniprocessor, then you can probably skip the rest of this section. Because uniprocessors preserve apparent sequential consistency, you never need to issue barriers unless object memory is somehow shared with asynchrononously accessible IO memory. This might occur with specially mapped java.nio buffers, but probably only in ways that affect internal JVM support code, not Java code. Also, it is conceivable that some special barriers would be needed if context switching doesn’t entail sufficient synchronization.
如果你生成的代码只能在单处理器上运行,则可以跳过本节的其余部分。 因为单处理器保留了明显的顺序一致性,所以除非对象存储与异步访问的IO内存以某种方式共享,否则你无需调用屏障。 这可能发生在特殊映射的 java.nio 缓冲区中,但可能仅以影响内部 JVM 支持代码而不是 Java 代码的方式发生。 同样,可以想象,如果上下文切换没有足够的同步性,那么将需要一些特殊的屏障。
Inserting Barriers (插入屏障)
Barrier instructions apply between different kinds of accesses as they occur during execution of a program. Finding an “optimal” placement that minimizes the total number of executed barriers is all but impossible. Compilers often cannot tell if a given load or store will be preceded or followed by another that requires a barrier; for example, when a volatile store is followed by a return. The easiest conservative strategy is to assume that the kind of access requiring the “heaviest” kind of barrier will occur when generating code for any given load, store, lock, or unlock:
屏障指令适用于在程序执行期间发生的各种访问之间。 几乎不可能找到一个“最优的”布局,以使要执行的屏障的总数最少。 编译器通常无法确定给定的加载或存储是在需要屏障的加载或存储之前还是之后;例如,当一个 volatile 存储后面跟着一个 return 时。 最简单的保守策略是假设在为任何给定的加载,存储,锁定或解锁生成代码时,需要“最重”类型屏障的那种访问将会发生:
Many of these barriers usually reduce to no-ops. In fact, most of them reduce to no-ops, but in different ways under different processors and locking schemes. For the simplest examples, basic conformance to JSR-133 on x86 or sparc-TSO using CAS for locking amounts only to placing a StoreLoad barrier after volatile stores.
许多这些屏障通常简化为 no-ops。 实际上,它们大多数简化为 no-ops,但是在不同的处理器和锁定方案下以不同的方式进行。 对于最简单的示例,使用 CAS 锁定数量的 x86 或 sparc-TSO 上的 JSR-133 基本一致性仅用于在 volatile 存储之后放置一个 StoreLoad 屏障。
Removing Barriers (消除屏障)
The conservative strategy above is likely to perform acceptably for many programs. The main performance issues surrounding volatiles occur for the StoreLoad barriers associated with stores. These ought to be relatively rare — the main reason for using volatiles in concurrent programs is to avoid the need to use locks around reads, which is only an issue when reads greatly overwhelm writes. But this strategy can be improved in at least the following ways:
上面的保守策略可能在许多程序中都能令人满意地执行。 围绕 volatiles 的主要性能问题会发生在与存储关联的 StoreLoad 屏障。 这些应该相对较少 —— 在并发程序中使用 volatile 的主要原因是避免需要在读取周围使用锁,这仅在读取大大超过写入的情况下才是问题。 但是,至少可以通过以下方式改进此策略:
- Removing redundant barriers. The above tables indicate that barriers can be eliminated as follows:
消除多余的屏障。 上表表明可以按以下步骤消除屏障:
Original | => | Transformed | ||||
1st | ops | 2nd | => | 1st | ops | 2nd |
LoadLoad | [no loads] | LoadLoad | => | [no loads] | LoadLoad | |
LoadLoad | [no loads] | StoreLoad | => | [no loads] | StoreLoad | |
StoreStore | [no stores] | StoreStore | => | [no stores] | StoreStore | |
StoreStore | [no stores] | StoreLoad | => | [no stores] | StoreLoad | |
StoreLoad | [no loads] | LoadLoad | => | StoreLoad | [no loads] | |
StoreLoad | [no stores] | StoreStore | => | StoreLoad | [no stores] | |
StoreLoad | [no volatile loads] | StoreLoad | => | [no volatile loads] | StoreLoad |
- Similar eliminations can be used for interactions with locks, but depend on how locks are implemented. Doing all this in the presence of loops, calls, and branches is left as an exercise for the reader. ?
类似的消除方法可用于与锁进行交互,但要取决于实现锁的方式。 在存在循环,调用和分支的情况下进行所有这些操作留给读者作为练习。 ? - Rearranging code (within the allowed constraints) to further enable removing LoadLoad and LoadStore barriers that are not needed because of data dependencies on processors that preserve such orderings.
重排列代码(在允许的约束范围内),以进一步启用除去不需要的 LoadLoad 和 LoadStore 屏障,由于保留此类排序的处理器上的数据依赖。 - Moving the point in the instruction stream that the barriers are issued, to improve scheduling, so long as they still occur somewhere in the interval they are required.
只要在要求的间隔内仍出现障碍,就在指令流中移动调用屏障的点以改善调度。 - Removing barriers that aren’t needed because there is no possibility that multiple threads could rely on them; for example volatiles that are provably visible only from a single thread. Also, removing some barriers when it can be proven that threads can only store or only load certain fields. All this usually requires a fair amount of analysis.
消除不必要的屏障,因为不可能有多个线程依赖它们; 例如,可证明仅从单个线程可见的 volatiles。 另外,在可证明线程只能存储或仅加载某些字段的情况下,消除一些障碍。 所有这些通常需要进行大量分析。
Miscellany (杂记)
JSR-133 also addresses a few other issues that may entail barriers in more specialized cases:
JSR-133 还解决了一些其他问题,这些问题在更特殊的情况下可能会带来屏障:
- Thread.start() requires barriers ensuring that the started thread sees all stores visible to the caller at the call point. Conversely, Thread.join() requires barriers ensuring that the caller sees all stores by the terminating thread. These are normally generated by the synchronization entailed in implementations of these constructs.
Thread.start() 要求设置屏障,以确保被启动的线程在调用点看到调用者可见的所有存储。 相反,Thread.join() 要求使用屏障,以确保调用者可以通过终止线程看到所有存储。 这些通常是由这些构造的实现中包含的同步生成的。 - Static final initialization requires StoreStore barriers that are normally entailed in mechanics needed to obey Java class loading and initialization rules.
静态 final 初始化需要 StoreStore 屏障,该屏障通常在 需要遵循 Java 类加载和初始化规则的机制中被需要。 - Ensuring default zero/null initial field values normally entails barriers, synchronization, and/or low-level cache control within garbage collectors.
确保默认的 zero/null 初始字段值通常需要在垃圾收集器中进行屏障,同步和/或进行低级缓存控制。 - JVM-private routines that “magically” set System.in, System.out, and System.err outside of constructors or static initializers need special attention since they are special legacy exceptions to JMM rules for final fields.
JVM-private 例程在构造函数或静态初始化程序之外“神奇地”设置了 System.in,System.out 和 System.err, 这需要特别注意,因为它们是用于 fianl 字段的 JMM 规则的特殊遗留例外。 - Similarly, internal JVM deserialization code that sets final fields normally requires a StoreStore barrier.
同样,设置 final 字段的内部 JVM 反序列化代码通常需要一个 StoreStore 屏障。 - Finalization support may require barriers (within garbage collectors) to ensure that Object.finalize code sees all stores to all fields prior to the objects becoming unreferenced. This is usually ensured via the synchronization used to add and remove references in reference queues.
Finalization 支持可能需要屏障(在垃圾收集器内),以确保 Object.finalize 代码在对象成为不可引用之前能够看到所有字段的所有存储。通常通过用于在引用队列中添加和删除引用的同步来确保这一点。 - Calls to and returns from JNI routines may require barriers, although this seems to be a quality of implementation issue.
尽管这似乎是实现质量的问题,但对JNI例程的调用和返回能会遇到屏障。 - Most processors have other synchronizing instructions designed primarily for use with IO and OS actions. These don’t impact JMM issues directly, but may be involved in IO, class loading, and dynamic code generation.
大多数处理器还有其他同步指令,这些指令主要设计用于 IO 和 OS 操作。 这些不会直接影响 JMM 问题,但可能涉及 IO,类加载和动态代码生成
Acknowledgments (致谢)
Thanks to Bill Pugh, Dave Dice, Jeremy Manson, Kourosh Gharachorloo, Tim Harris, Cliff Click, Allan Kielstra, Yue Yang, Hans Boehm, Kevin Normoyle, Juergen Kreileder, Alexander Terekhov, Tom Deneau, Clark Verbrugge, Peter Kessler, Peter Sewell, Jan Vitek, and Richard Grisenthwaite for corrections and suggestions.
Last modified: Tue Mar 22 07:11:36 EDT 2011
原创文章,作者:奋斗,如若转载,请注明出处:https://blog.ytso.com/176063.html