Atomic
low-latency atomic operations for a multi-threads environment for JVM
Install / Use
/learn @catap/AtomicREADME
Korinsky's Atomic
It is the low-latency atomic operations for a multi-threads environment for JVM.
It has similar API with java.util.concurrent.atomic but it much faster at the multi-threads environment,
you can just update imports and enjoy it.
Each time when CAS loop can't update the value, it disables current thread for a very smaller possible time to prevent stupid looping when another thread may release the value.
How to use it?
It is in Maven Central, and can be added to a Maven project as follows
<dependency>
<groupId>ky.korins</groupId>
<artifactId>atomic</artifactId>
<version>1.1</version>
</dependency>
How fast is it?
It is similar with j.u.c.atomic for one concurrent thread and has 2x times lower latency for bigger concurrency.
Full benchmark:
| Benchmark | Threads | p0.00 | p0.50 | p0.90 | p0.95 | p0.99 | p0.999 | p0.9999 | p1.00 | |--------------|---------|-------|-------|-------|-------|-------|--------|---------|---------| | J.u.c.atomic | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 138 | 479 | | Korinsky | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 560 | | J.u.c.atomic | 2 | 8 | 74 | 108 | 113 | 123 | 206 | 234 | 1716 | | Korinsky | 2 | 8 | 14 | 64 | 92 | 164 | 300 | 536 | 2068 | | J.u.c.atomic | 4 | 9 | 188 | 579 | 721 | 973 | 1292 | 1476 | 20576 | | Korinsky | 4 | 9 | 137 | 340 | 490 | 1036 | 2368 | 4136 | 10384 | | J.u.c.atomic | 8 | 9 | 550 | 868 | 1294 | 2976 | 5472 | 8095 | 131840 | | Korinsky | 8 | 9 | 402 | 919 | 1140 | 1612 | 2168 | 2596 | 130432 | | J.u.c.atomic | 16 | 9 | 767 | 921 | 1224 | 4424 | 130816 | 220672 | 1161216 | | Korinsky | 16 | 9 | 772 | 1170 | 1334 | 1706 | 130816 | 220672 | 1290240 | | J.u.c.atomic | 32 | 9 | 773 | 924 | 1202 | 4824 | 330752 | 893952 | 3313664 | | Korinsky | 32 | 9 | 782 | 1190 | 1360 | 1834 | 330240 | 840704 | 2838528 | | J.u.c.atomic | 64 | 9 | 772 | 926 | 1146 | 4848 | 700416 | 1812480 | 7282688 | | Korinsky | 64 | 9 | 785 | 1204 | 1384 | 1952 | 660480 | 1730560 | 6684672 |
Measured at Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz at
openjdk version "9.0.1"
OpenJDK Runtime Environment (build 9.0.1+11-Debian-1)
OpenJDK 64-Bit Server VM (build 9.0.1+11-Debian-1, mixed mode)
Which operations does it support?
Right now it supports AtomicLong, AtomicInteger, AtomicBoolean, AtomicLongArray and AtomicIntegerArray
with Java8 compatibility API.
Why does j.u.c.atomic increment is faster without concurrent threads?
This atomic implemented all operation includes getAndAdd* and addAndGet* over CAS-loop,
j.u.c.atomic's getAndAdd and addAndGet* uses lock addq CPU instruction
or just i++ where i is volatile variable that much faster without concurrent threads.
Why does it add unspecified backoff timeout?
JVM hasn't got a way to make very small specified sleep period. It has few different ways to add a sleep period:
Thread.onSpinWait()since Java9Thread.yield()Thread.sleep(long millis)orThread.sleep(long millis, int nanos)Object.wait(long timeout, int nanos)orObject.wait(long timeout, int nanos)LockSupport.parkNanos(long nanos)TimeUnit.sleep(x)
If we check the code we will see that:
Object.wait(0, 1)isObject.wait(1)Thread.sleep(0, 1)isThread.sleep(0)orThread.sleep(1)that depends on amount of nanosTimeUnit.sleep(X)isThread.sleep(ms, ns)
And if we made a simple test we will have results:
on my laptop with macOS I have
Benchmark Mode Cnt Score Error Units
LockSupport.parkNanos(1) avgt 20 10553.701 ± 309.104 ns/op
LockSupport.parkNanos(10) avgt 20 13006.281 ± 73.393 ns/op
LockSupport.parkNanos(100) avgt 20 12976.135 ± 139.048 ns/op
LockSupport.parkNanos(1000) avgt 20 5795.405 ± 132.921 ns/op
LockSupport.parkNanos(10000) avgt 20 6127.251 ± 203.732 ns/op
Object.wait(0, 1) avgt 20 1351918.387 ± 7285.491 ns/op
Thread.onSpinWait() avgt 20 3.271 ± 0.047 ns/op
Thread.sleep(0) avgt 20 258.652 ± 3.411 ns/op
Thread.yield() avgt 20 225.066 ± 3.808 ns/op
on debian sid I have
Benchmark Mode Cnt Score Error Units
LockSupport.parkNanos(1) avgt 20 288.112 ± 1.606 ns/op
LockSupport.parkNanos(10) avgt 20 285.572 ± 0.784 ns/op
LockSupport.parkNanos(100) avgt 20 4674.412 ± 237.567 ns/op
LockSupport.parkNanos(1000) avgt 20 55064.592 ± 245.676 ns/op
LockSupport.parkNanos(10000) avgt 20 54860.746 ± 593.290 ns/op
Object.wait(0, 1) avgt 20 1120201.743 ± 2152.094 ns/op
Thread.onSpinWait() avgt 20 2.496 ± 0.003 ns/op
Thread.sleep(0) avgt 20 183.340 ± 2.365 ns/op
Thread.yield() avgt 20 170.595 ± 2.303 ns/op
JVM hasn't got any way to make a small specified timeout and:
Object.wait(0, 1)is about 1 msLockSupport.parkNanos(long nanos)is absolutely unpredictableThread.sleep(0)andThread.yield()very close (~13%) but depends on platformThread.onSpinWait()is faster but depends on platform and works only since Java9
Thread.onSpinWait() is converted to pause CPU instruction if the target system supports it (by CPU and by JVM),
and to nothing, when it doesn’t..
Honestly, we don't need just pause, the idea of this low-latency atomic operation is doesn't burn CPU cycles
for CAS-loop when another threads may finish same CAS-loop.
Anyway, this code is using LockSupport.parkNanos(1) because tests've shown that it is the best option.
Very short summary (only p0.9999):
Full data:
| Benchmark | Threads | p0.00 | p0.50 | p0.90 | p0.95 | p0.99 | p0.999 | p0.9999 | p1.00 | |----------------------------------------------|---------|-------|-------|-------|-------|--------|---------|---------|---------| | LockSupport.parkNanos(1) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 560 | | LockSupport.parkNanos(10) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 137 | 533 | | LockSupport.parkNanos(100) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 140 | 458 | | LockSupport.parkNanos(1000) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 137 | 1560 | | LockSupport.parkNanos(10000) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 136 | 473 | | LockSupport.parkNanos(25) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 139 | 460 | | LockSupport.parkNanos(250) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 130 | 1542 | | LockSupport.parkNanos(2500) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 138 | 422 | | LockSupport.parkNanos(5) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 132 | 1394 | | LockSupport.parkNanos(50) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 138 | 462 | | LockSupport.parkNanos(500) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 124 | 551 | | LockSupport.parkNanos(5000) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 136 | 536 | | Plain() | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 138 | 479 | | Thread.onSpinWait_LockSupport_parkNanos(1) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 133 | 1386 | | Thread.onSpinWait_LockSupport_parkNanos(128) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 136 | 5624 | | Thread.onSpinWait_LockSupport_parkNanos(16) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 132 | 1390 | | Thread.onSpinWait_LockSupport_parkNanos(2) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 143 | 6544 | | Thread.onSpinWait_LockSupport_parkNanos(32) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 462 | | Thread.onSpinWait_LockSupport_parkNanos(4) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 133 | 489 | | Thread.onSpinWait_LockSupport_parkNanos(64) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 327 | | Thread.onSpinWait_LockSupport_parkNanos(8) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 137 | 485 | | Thread.onSpinWait_yield(1) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 133 | 439 | | Thread.onSpinWait_yield(128) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 135 | 1544 | | Thread.onSpinWait_yield(16) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 136 | 670 | | Thread.onSpinWait_yield(2) | 1 | 8 | 8 | 8 | 8 | 9 | 18 | 136 | 489 | | Thread.onSpinWait_yiel
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
