Inter Thread Latency

原文地址:http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html (移到墙内)

Message rates between threads are fundamentally determined by the latency of memory exchange between CPU cores.   The minimum unit of transfer will be a cache line exchanged via shared caches or socket interconnects.  In a previous article I explained Memory Barriers and why they are important to concurrent programming between threads.  These are the instructions that cause a CPU to make memory visible to other cores in an ordered and timely manner.

Lately I’ve been asked a lot about how much faster the Disruptor would be if C++ was used instead of Java.  For sure C++ would give more control for memory alignment and potential access to underlying CPU instructions such as memory barriers and lock instructions.  In this article I’ll directly compare C++ and Java to measure the cost of signalling a change between threads.

For the test we’ll use two counters each updated by their own thread.  A simple ping-pong algorithm will be used to signal from one to the other and back again.  The exchange will be repeated millions of times to measure the average latency between cores.  This measurement will give us the latency of exchanging a cache line between cores in a serial manner.

For Java we’ll use volatile counters which the JVM will kindly insert a lock instruction for the update giving us an effective memory barrier.

public final class InterThreadLatency
implements Runnable
{
public static final long ITERATIONS = 500L * 1000L * 1000L;

public static volatile long s1;
public static volatile long s2;

public static void main(final String[] args)
{
Thread t = new Thread(new InterThreadLatency());
t.setDaemon(true);
t.start();

long start = System.nanoTime();

long value = s1;
while (s1 < ITERATIONS)
{
while (s2 != value)
{
// busy spin
}
value = ++s1;
}

long duration = System.nanoTime() - start;

System.out.println("duration = " + duration);
System.out.println("ns per op = " + duration / (ITERATIONS * 2));
System.out.println("op/sec = " +
(ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration);
System.out.println("s1 = " + s1 + ", s2 = " + s2);
}

public void run()
{
long value = s2;
while (true)
{
while (value == s1)
{
// busy spin
}
value = ++s2;
}
}
}

For C++ we’ll use the <a href=”http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html”>GNU Atomic Builtins</a> which give us a similar lock instruction insertion to that which the JVM uses.

#include <time.h>
#include <pthread.h>
#include <stdio.h>

typedef unsigned long long uint64;
const uint64 ITERATIONS = 500LL * 1000LL * 1000LL;

volatile uint64 s1 = 0;
volatile uint64 s2 = 0;

void* run(void*)
{
    register uint64 value = s2;
    while (true)
    {
        while (value == s1)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s2, 1);
    }
}

int main (int argc, char *argv[])
{
    pthread_t threads[1];
    pthread_create(&threads[0], NULL, run, NULL);

    timespec ts_start;
    timespec ts_finish;
    clock_gettime(CLOCK_MONOTONIC, &ts_start);

    register uint64 value = s1;
    while (s1 < ITERATIONS)
    {
        while (s2 != value)
        {
            // busy spin
        }
        value = __sync_add_and_fetch(&s1, 1);
    }

    clock_gettime(CLOCK_MONOTONIC, &ts_finish);

    uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec;
    uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec;
    uint64 duration = finish - start;

    printf("duration = %lld/n", duration);
    printf("ns per op = %lld/n", (duration / (ITERATIONS * 2)));
    printf("op/sec = %lld/n",
        ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration));
    printf("s1 = %lld, s2 = %lld/n", s1, s2);

    return 0;
}

Results

$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency
duration = 50790271150
ns per op = 50
op/sec = 19,688,810
s1 = 500000000, s2 = 500000000

$ g++ -O3 -lpthread -lrt -o itl itl.cpp
$ taskset -c 2,4 ./itl
duration = 45087955393
ns per op = 45
op/sec = 22,178,872
s1 = 500000000, s2 = 500000000

The C++ version is slightly faster on my Intel Sandybridge laptop.  So what does this tell us?  Well, that the latency between 2 cores on a 2.2 GHz machine is ~45ns and that you can exchange 22m messages per second in a serial fashion.  On an Intel CPU this is fundamentally the cost of the lock instruction enforcing total order and forcing the store buffer and write combining buffers to drain, followed by the resulting cache coherency traffic between the cores.   Note that each core has a 96GB/s port onto the L3 cache ring bus, yet 22m * 64-bytes is only 1.4 GB/s.  This is because we have measured latency and not throughput.  We could easily fit some nice fat messages between those memory barriers as part of the exchange if the data has been written before the lock instruction was executed.

So what does this all mean for the Disruptor?  Basically, the latency of the Disruptor is about as low as we can get from Java.  It would be possible to get a ~10% latency improvement by moving to C++.  I’d expect a similar improvement in throughput for C++.  The main win with C++ would be the control, and therefore, the predictability that comes with it if used correctly.  The JVM gives us nice safety features like garbage collection in complex applications but we pay a little for that with the extra instructions it inserts that can be seen if you get Hotspot to dump the assembler instructions it is generating.

How does the Disruptor achieve more than 25m messages per second I hear you say???   Well that is one of the neat parts of its design.  The “waitFor” semantics on the SequenceBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with theRingBuffer, all without incurring a memory barrier.  For real world applications this batching effect is really significant.  For micro benchmarks it only makes the results more random,  especially when there is little work done other than accepting the message.

Conclusion

So when processing events in series, the measurements tell us that the current generation of processors can do between 20-30 million exchanges per second at a latency less than 50ns.  The Disruptor design allows us to get greater throughput without explicit batching on the publisher side.  In addition the Disruptor has an explicit batching API on the publisher side that can give over 100 million messages per second.

原创文章,作者:kepupublish,如若转载,请注明出处:https://blog.ytso.com/141007.html

(0)
上一篇 2021年9月5日
下一篇 2021年9月5日

相关推荐

发表回复

登录后才能评论