SIMDVectorizationOpt
Blink`团队 (https://github.com/WANG-lp and https://github.com/CheYulin ), Vectorization for Tokenizing and Base64 Encoding/Decoding
Install / Use
/learn @RapidsBlink/SIMDVectorizationOptREADME
Disclaimer
此项目仅供学习与交流使用,请遵循MIT协议,如果您在任何项目中使用相关代码,请保留此项目的LICENSE文件。
SIMD Vectorization Demo
This demo is made by Wang Lipeng and CHE Yulin from Hong Kong University of Science and Technology. Feel free to give this repo a star if you like it.
intel intrinsics guide: see https://software.intel.com/sites/landingpage/IntrinsicsGuide/
bechmarking CPU info: see src/playground, using single-thread
Tokenizing
/ | type --- | --- input | form body output | dubbo package
algorithms | elapsed time --- | --- serial transform time | 380 ms sse4 transform time | 240 ms avx2 transform time | 90 ms
- workaround for
__tzcnt_u32(requiring bmi) in sse4 (handle 16bits currently, else modify zero if-sentence)
inline int __tzcnt_u32_using_popcnt_cmpeq(unsigned int x) {
int ret_cnt = 0;
int half_bits = x & (0xff);
if (half_bits == 0) {
half_bits = (x >> 8) & (0xff);
ret_cnt = 8;
}
__m128i pivot_u = _mm_set1_epi16(half_bits);
__m128i inspected_ele = _mm_set_epi16(
0xff, 0x7f, 0x3f, 0x1f, 0xf, 0x7, 0x3, 0x1);
__m128i trunc_pivot_u = _mm_and_si128(pivot_u, inspected_ele);
__m128i pivot_new = _mm_set1_epi16(0);
__m128i cmp_res = _mm_cmpeq_epi16(trunc_pivot_u, pivot_new);
int mask = _mm_movemask_epi8(cmp_res); // 16 bits
ret_cnt += _mm_popcnt_u32(mask) >> 1;
return ret_cnt;
}
- idea: advance with simd cmp and tzcnt
// 1st: advance fo find first '='
while (true) {
__m128i pivot_u = _mm_set1_epi8('=');
__m128i inspected_ele = _mm_loadu_si128((__m128i *) (str + i));
__m128i cmp_res = _mm_cmpeq_epi8(pivot_u, inspected_ele);
int mask = _mm_movemask_epi8(cmp_res); // 16 bits
int advance = (mask == 0 ?
16 : __tzcnt_u32_using_popcnt_cmpeq(mask));
i += advance;
if (advance < 16) { break; }
}
off[next] = i;
next++;
Base64 Decoding/Encoding (i.e, Serialization/Deserialization)
-
base64 encoding includes 1) unpacking and 2) ascii translation (to)
-
base64 decoding includes 1) ascii translation (from) and 2) packing
-
the translation table
base64 | ascii-dec | ascii-hex | offset (diff) --- | --- | --- | --- 0-25 | 65-90 | 0x41-0x5a | 65 26-51 | 97-122 | 0x61-0x7a | 71 52-61 | 48-57 | 0x30-0x39 | -4 62 | 43, i.e., '+' | 0x2b | -19 63 | 47, i.e., '/' | 0x2f | -16 others | invalid | invalid | invalid
Base64 Decoding Vectorization
ascii code to base64 translation vectorization (simplified)
// attention: each SIMD lane is 128-bits
const __m256i lut_roll = _mm256_setr_epi8(
0, 16, 19, 4, -65, -65, -71, -71,
0, 0, 0, 0, 0, 0, 0, 0,
0, 16, 19, 4, -65, -65, -71, -71,
0, 0, 0, 0, 0, 0, 0, 0
);
const __m256i mask_2F = _mm256_set1_epi8(0x2f);
// 1st: shift right for 4 32bit-wise-elements in each SIMD lane
// shift to get high 4bits
__m256i hi_nibbles = _mm256_srli_epi32(str, 4);
// 2nd: cmpeq to distinguish 0x2f ('/') from 0x2b ('+')
// when the byte is 0x2f, the result byte in eq_2F is 0xff (-1)
const __m256i eq_2F = _mm256_cmpeq_epi8(str, mask_2F);
// 3rd: add -1 when the byte is 0x2f,
// lookup lut_roll for offsets to add, shuffling in each SIMD Lane
// shuffling using the previous high 4-bits in the byte (see 1st step)
const __m256i roll = _mm256_shuffle_epi8(lut_roll,
_mm256_add_epi8(eq_2F, hi_nibbles));
// 4th: add offsets
str = _mm256_add_epi8(str, roll);
Benchmark
- implementations
benchmark main file: src/playground/test_serialization.cpp
algorithms | link --- | --- naive-comp | src/naive_base64/naive_base64.cpp naive-lookup | src/naive_base64/naive_base64.cpp chromium | src/fast_base64/chromiumbase64.c fast-avx2 | src/fast_base64/fastavxbase64.c
- measure: we report TPS (throughput including both serialization and deserialization time)
algorithms | idea --- | --- naive-comp | ascii-base64 translation using computations naive-lookup | ascii-base64 translation using two look-up tables chromium | use separate lookup tables for saving some computations fast-avx2 | apply avx2, and serial fallback using chromium
- message len 200
algorithms | TPS-total | TPS-serializtion | TPS-deserializtion --- | --- | --- | --- naive_comp | 201.448 MB/s | 467.150 MB/s | 354.179 MB/s naive_lookup | 490.444 MB/s | 1080.221 MB/s | 898.286 MB/s chromium | 493.397 MB/s | 1039.237 MB/s | 939.390 MB/s avx2_chromium | 887.260 MB/s | 1838.769 MB/s | 1714.608 MB/s
- message len 80
algorithms | TPS-total | TPS-serializtion | TPS-deserializtion --- | --- | --- | --- naive_comp | 149.235 MB/s | 288.609 MB/s | 309.029 MB/s naive_lookup | 394.224 MB/s | 740.423 MB/s | 843.135 MB/s chromium | 409.404 MB/s | 725.954 MB/s | 938.900 MB/s avx2_chromium | 556.984 MB/s | 974.215 MB/s | 1300.531 MB/s
- message len 50
algorithms | TPS-total | TPS-serializtion | TPS-deserializtion --- | --- | --- | --- naive_comp | 115.787 MB/s | 218.357 MB/s | 246.493 MB/s naive_lookup | 330.488 MB/s | 598.695 MB/s | 737.719 MB/s chromium | 335.035 MB/s | 595.675 MB/s | 765.698 MB/s avx2_chromium | 385.963 MB/s | 656.360 MB/s | 936.887 MB/s
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
