Swar

Low level, branch-free functions for number and string conversion and other utils

Generate Convert Improve

Install / Use

/learn @yb303/Swar

About this skill

Quality Score

0/100

README

SWAR

Low level, branch-free functions for number and string conversion and other utils.

SWAR stands for SIMD within a register. This means treating parts of a uint64_t as individual uint8_t's, uint16_t's, or uint32_t's.

This library is header-only. There is nothing to build.

Most functions have different variants optimized for limited lengths (8 and 4), and printable input (ascii < 128), and more

Language and build

I'm used to C++17 so used if constexpr, but the rest of the code is C++03 copatible, and can easily be converted to C. Include swar.h and build with -std=c++17 For forward declarations only, include swar_fwd.h instead.

Test and benchmark

The test dir includes:

swar_test.cpp that is using google-test for unit testing. (TODO create a build: passing badge)
swar_bench.cpp that produces the numbers for the graph below.

Performance

Comparison of various atoi implementations (actually a-to-ull) on my home machine: <img src="https://lh3.googleusercontent.com/KeHaQM5RM2_hS6Gf8MEtRlV4EgVwnJBqLnxMinczB67XUaR8wXyriYNrqY4ukYBE0aGNQ4TDA31f1MIP57j4r78zmeYk3OAqFySu2yfHQClDhxMnm86ACtUVI6EQSGX7aF0HgTYD3Cg=w2400"> Machine and env spec: i5-3470, Cygwin on Windows 10 64 bit, g++ 9.3.0

The not-perfectly-straight lines, in SWAR's performance, are just measurement artifacts. In swar8, for example, it's the same instructions executed for every input, so there should be no difference. You can see how SWAR is faster than the naive impl and is fixed cost per word. The stock implementation is surprisingly slow. I don't know why as I didn't read its code yet.

Functions with 8, or 4, suffix are branchless and faster (see swar8 vs swar). Functions with longer input must have a branch per word. An SSE implementation can follow the same ideas as here for longer inputs. However, using SSE instruction may switch some processors to a different P-state, if the BIOS allows, and the switching itself can take a few hundred cycles.

Branchless code is not always faster than branched code. Benchmarks are typically less impacted by branch miss-predictions, then real world applications. This applies also in my benchmark. I did not take special care to litter the BP caches before each function call as this would make each call harder to measure. Loops may be fully predicted, especially if BP caches are all working for the benchmark. However, in a real world app, using branchless low level code means that BP caches have more room for the application logic so the app as a whole may become faster. The only way to know for sure is to test within the app.

These performance characteristics are the same for strlen and similar functions.

Functions

All functions come in a few variants:

memchr and memrchr
strlen
atoi, htoi (hex string to int), atod
itoa
hasbyte - does word include a certain byte?

Supported operating systems

Linux
Cygwin

Supported programming languages

Supported compilers

Supported architectures

x86_64

Related Skills

node-connect

349.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。