Popcount

popcount algorithms in C and Rust

Generate Convert Improve

Install / Use

/learn @BartMassey/Popcount

About this skill

Quality Score

0/100

README

Popcount

Here's some implementations of bit population count in C and Rust, with benchmarks.

The writeup in this README is based on work from 2007 and later, as far as I can tell today. It consists of status, and then a chronological log including benchmark results.

Background

Everybody on the net does bit population count. I needed to do it for XCB. My criteria were:

Should be fast even on cruddy machines (so limit operators to integer ones and try to avoid multiply and especially divide).
Should be deterministically fast in the worst case.
Should be comprehensible / maintainable.

X has always used the venerable HAKMEM 169 algorithm, but this didn't seem very good for the above criteria. Keith Packard, who inserted the HAKMEM code, tells me he selected it mainly because it was the first reasonable thing he came across when he was looking for an algorithm.

I wrote this code to benchmark some tries at a replacement. The Wikipedia <a href="http://en.wikipedia.org/wiki/Hamming_weight">page</a> was a fairly good starting point.

Hacker's Delight was a useful reference. I'd lost my copy of the book, but I've now replaced it and I can say that there's nothing better in there than the stuff in here (as of the second edition), which is a pretty strong statement.

I think my code is also a reasonable tutorial for micro-benchmarking frameworks, illustrating some reasonable techniques. There's still some things I would fix, though: see below.

Build and Run

You can build the C code with both GCC and Clang and the Rust code with Cargo by typing make, but you should first examine the Makefile and set what you need to tune. Depending on your environment, you might just want to build specific binaries.

Once you've built popcount_gcc or whichever, run it with a number of iterations to bench: 100000 is good to start. It will spit out some explanatory numbers.

Advice

If you need predictable speed, use __builtin_popcount() (GCC, Clang) or count_ones() (Rust). Your machine might have a popcount instruction (most modern ones finally do), and this will use it if present. C code will be portable only to GCC-compatible C compilers. Rust code will currently only get a speedup on x86 when compiled with --cpu-type native. All compilers bench reasonably fast on my box without a popcount instruction available; looks like they are using popcount_2 or similar, but generating slightly better code. (I haven't investigated what they are actually doing.)
If you want true machine-independence and don't care too much about speed, use the famous popcount_2. It's fast enough, familiar and completely portable.
Use popcount_tabular_8 if you're willing to burn some cache (not much) and want a slight but portable speedup over popcount_2. Also if you're on some machine without a barrel shifter, this can be adapted to not do any shifts without too much work at the expense of hitting memory. (Do byte reads from the input variable.)

Future Improvements

The op is called in a tight loop, so self-pipelining is in play. This isn't very realistic. However, it's hard to build a reasonable set of calling environments, and hard to calibrate out their effects.
I only report elapsed time. This is probably OK for this application, but it would be easy to add user and system time reporting.
The user gives the number of iterations of a fixed-size block. I should do the x11perf thing and have the user supply a desired number of seconds instead, and use the preheat loop to calibrate.

For now, I have a set of manually-chosen loop divisors to speed up the slower algorithms. All algorithms take about the same amount of time per "iteration" specified on my machine, although probably not on yours.

Talk

I have included slides from a PDXByte talk I gave in March 2014. Enjoy.

License

This work is made available under the "MIT License". Please see the file LICENSE in this distribution for license terms.

Notes and Results

This is a log of work on the project. Most recent numbers, current entry, then older entries in chronological order.

Benchmark Results

2024-06-27

Got some more self-measured results on a non-x86 machine. Note that the checksum differs from the x86 one, so that will need to be investigated.

The problems with preventing optimization of popcount_mult into a popcount instruction continue. Need a better plan here.

Performance on desktop with Ryzen-9 3900X 4.6GHz processor. GCC 13.2.0-25, Clang 16.0.6, rustc 1.79.0, Debian Bookwork/Trixie/sid. <blockquote>

popcount_naive: 3.125e+07 iters in 356 msecs for 11.39 nsecs/iter
popcount_8: 1.25e+08 iters in 468 msecs for 3.74 nsecs/iter
popcount_6: 1.25e+08 iters in 492 msecs for 3.94 nsecs/iter
popcount_hakmem: 1.25e+08 iters in 612 msecs for 4.90 nsecs/iter
popcount_keane: 1.25e+08 iters in 637 msecs for 5.10 nsecs/iter
popcount_anderson: 8.3333e+07 iters in 425 msecs for 5.10 nsecs/iter
popcount_3: 1.25e+08 iters in 453 msecs for 3.62 nsecs/iter
popcount_4: 1.25e+08 iters in 446 msecs for 3.57 nsecs/iter
popcount_2: 1.25e+08 iters in 463 msecs for 3.70 nsecs/iter
popcount_mult: 1.25e+08 iters in 437 msecs for 3.50 nsecs/iter
popcount_tabular_8: 1.25e+08 iters in 350 msecs for 2.80 nsecs/iter
popcount_tabular_16: 1.25e+08 iters in 553 msecs for 4.42 nsecs/iter
popcount_cc: 5e+08 iters in 352 msecs for 0.70 nsecs/iter
popcount_x86: 5e+08 iters in 353 msecs for 0.71 nsecs/iter

popcount_clang
popcount_naive: 3.125e+07 iters in 442 msecs for 14.14 nsecs/iter
popcount_8: 1.25e+08 iters in 639 msecs for 5.11 nsecs/iter
popcount_6: 1.25e+08 iters in 696 msecs for 5.57 nsecs/iter
popcount_hakmem: 1.25e+08 iters in 530 msecs for 4.24 nsecs/iter
popcount_keane: 1.25e+08 iters in 588 msecs for 4.70 nsecs/iter
popcount_anderson: 8.3333e+07 iters in 391 msecs for 4.69 nsecs/iter
popcount_3: 1.25e+08 iters in 421 msecs for 3.37 nsecs/iter
popcount_4: 1.25e+08 iters in 431 msecs for 3.45 nsecs/iter
popcount_2: 1.25e+08 iters in 443 msecs for 3.54 nsecs/iter
popcount_mult: 1.25e+08 iters in 417 msecs for 3.34 nsecs/iter
popcount_tabular_8: 1.25e+08 iters in 308 msecs for 2.46 nsecs/iter
popcount_tabular_16: 1.25e+08 iters in 495 msecs for 3.96 nsecs/iter
popcount_cc: 5e+08 iters in 332 msecs for 0.66 nsecs/iter
popcount_x86: 5e+08 iters in 332 msecs for 0.66 nsecs/iter

popcount_rs
popcount_naive: 3.125e7 iters in 574 msecs for 18.36 nsecs/iter
popcount_8: 1.25e8 iters in 669 msecs for 5.35 nsecs/iter
popcount_6: 1.25e8 iters in 729 msecs for 5.83 nsecs/iter
popcount_hakmem: 1.25e8 iters in 561 msecs for 4.49 nsecs/iter
popcount_keane: 1.25e8 iters in 620 msecs for 4.96 nsecs/iter
popcount_anderson: 8.3333e7 iters in 356 msecs for 4.28 nsecs/iter
popcount_3: 1.25e8 iters in 446 msecs for 3.57 nsecs/iter
popcount_4: 1.25e8 iters in 448 msecs for 3.59 nsecs/iter
popcount_2: 1.25e8 iters in 469 msecs for 3.75 nsecs/iter
popcount_mult: 1.25e8 iters in 88 msecs for 0.70 nsecs/iter
popcount_tabular_8: 1.25e8 iters in 360 msecs for 2.88 nsecs/iter
popcount_tabular_16: 1.25e8 iters in 547 msecs for 4.38 nsecs/iter
popcount_rs: 5e8 iters in 352 msecs for 0.70 nsecs/iter
popcount_x86: 5e8 iters in 352 msecs for 0.70 nsecs/iter

</blockquote>

Performance on Orange Pi Zero 3 1GB with Allwinner H618 1GHz processor. GCC 13.3.0-1, Clang 16.0.6, rustc 1.79.0, Debian Bookworm/Trixie. <blockquote>

popcount_gcc popcount_naive: 3.125e+07 iters in 1680 msecs for 53.76 nsecs/iter popcount_8: 1.25e+08 iters in 1574 msecs for 12.59 nsecs/iter popcount_6: 1.25e+08 iters in 1492 msecs for 11.94 nsecs/iter popcount_hakmem: 1.25e+08 iters in 1989 msecs for 15.91 nsecs/iter popcount_keane: 1.25e+08 iters in 1823 msecs for 14.58 nsecs/iter popcount_anderson: 8.3333e+07 iters in 1215 msecs for 14.58 nsecs/iter popcount_3: 1.25e+08 iters in 1574 msecs for 12.59 nsecs/iter popcount_4: 1.25e+08 iters in 1491 msecs for 11.93 nsecs/iter popcount_2: 1.25e+08 iters in 1574 msecs for 12.59 nsecs/iter popcount_mult: 1.25e+08 iters in 1491 msecs for 11.93 nsecs/iter popcount_tabular_8: 1.25e+08 iters in 1160 msecs for 9.28 nsecs/iter popcount_tabular_16: 1.25e+08 iters in 2589 msecs for 20.71 nsecs/iter popcount_cc: 5e+08 iters in 3647 msecs for 7.29 nsecs/iter

popcount_clang popcount_naive: 3.125e+07 iters in 2936 msecs for 93.95 nsecs/iter popcount_8: 1.25e+08 iters in 1822 msecs for 14.58 nsecs/iter popcount_6: 1.25e+08 iters in 1905 msecs for 15.24 nsecs/iter popcount_hakmem: 1.25e+08 iters in 1905 msecs for 15.24 nsecs/iter popcount_keane: 1.25e+08 iters in 1905 msecs for 15.24 nsecs/iter popcount_anderson: 8.3333e+07 iters in 1270 msecs for 15.24 nsecs/iter popcount_3: 1.25e+08 iters in 1491 msecs for 11.93 nsecs/iter popcount_4: 1.25e+08 iters in 1491 msecs for 11.93 nsecs/iter popcount_2: 1.25e+08 iters in 1574 msecs for 12.59 nsecs/iter popcount_mult: 1.25e+08 iters in 1491 msecs for 11.93 nsecs/iter popcount_tabular_8: 1.25e+08 iters in 1077 msecs for 8.62 nsecs/iter popcount_tabular_16: 1.25e+08 iters in 2505 msecs for 20.04 nsecs/iter popcount_cc: 5e+08 iters in 3647 msecs for 7.29 nsecs/iter

popcount_rs popcount_naive: 3.125e7 iters in 2411 msecs for 77.14 nsecs/iter popcount_8: 1.25e8 iters in 1822 msecs for 14.58 nsecs/iter popcount_6: 1.25e8 iters in 1905 msecs for 15.24 nsecs/iter popcount_hakmem: 1.25e8 iters in 1789 msecs for 14.31 nsecs/iter popcount_keane: 1.25e8 iters in 1839 msecs for 14.71 nsecs/iter popcount_anders

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。