README
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
<title>nedalloc Readme</title>
<style type="text/css">
<!--
body {
text-align: justify;
}
h1, h2, h3, h4, h5, h6 {
margin-bottom: -0.5em;
}
h1 {
text-align: center;
}
h2 {
text-decoration: underline;
margin-bottom: -0.25em;
}
p {
margin-top: 0.5em;
margin-bottom: 0.5em;
}
ul li, ol li {
margin-top: 0.2em;
margin-bottom: 0.2em;
}
dl {
margin-left: 2em;
}
dl dt {
font-weight: bold;
}
dt + dd {
margin-bottom: 1em;
}
.gitcommit {
font-family: "Courier New", Courier, monospace;
font-size: smaller;
}
-->
</style>
</head>
<body>
<div style="text-align: center">
<h1 style="text-decoration: underline">nedalloc v1.10 beta 4 (?)</h1>
<h2 style="text-decoration: none;">by Niall Douglas</h2>
<p>Web site: <a href="http://www.nedprod.com/programs/portable/nedmalloc/">http://www.nedprod.com/programs/portable/nedmalloc/</a></p>
<p>Trunk build status: <a href="https://travis-ci.org/ned14/nedmalloc"><img style="vertical-align:middle;border:none" src="https://travis-ci.org/ned14/nedmalloc.png?branch=master"/></a></p>
<hr /></div>
<p>Enclosed is nedalloc, an alternative malloc implementation for multiple threads
without lock contention based on <a href="http://g.oswego.edu/" target="_blank">
dlmalloc</a> v2.8.4 and a specialised user mode page allocator (Windows Vista or
later only). It has the following features:</p>
<ol>
<li>A per-thread small block cache for maximum CPU scalability.</li>
<li>A per-thread arena to minimise lock contention.</li>
<li>The ability to patch Windows binaries to replace the C memory allocation
API malloc, realloc(), free() et al such that by simply inserting nedmalloc.dll
into a process one realises performance improvements without recompilation.</li>
<li>On POSIX, it knows how to talk to valgrind so you can track memory
corruption and/or memory leaks.</li>
<li>A unique user mode page allocator implementation which delivers O(1) scaling
for blocks of any size, including an O(1) very fast realloc(). Improves medium
sized block (~1Mb) allocation speeds by about 25 times on current hardware.
Requires Windows Vista or later only, and requires Administrator privileges
as well as either UAC disabled or a UAC prompt at the start of each program
run.</li>
<li>A malloc v2 API which enables considerable improvements in efficiency by
allowing client code to better inform the allocator on what (not) to do.</li>
<li>An enhanced C++ STL allocator implementation to enable super-fast std::vector<>
<strong>[unfinished]</strong></li>
</ol>
<p>It is licensed under the
<a href="http://www.boost.org/LICENSE_1_0.txt" target="_blank">Boost Software License</a>
which basically means you can do anything you like with it. This does not apply
to the malloc.c.h file which remains copyright to others. Commercial support is
available from <a href="http://www.nedproductions.biz/" target="_blank">ned Productions
Limited</a>.</p>
<p>It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and
Apple Mac OS X (x86). It works very well on all of these and is very significantly
faster than the system allocator on Windows XP and FreeBSD <v7. If you are using
>= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won't
see much improvement without modifying your source to use the v2 malloc API (and
kudos to Apple and Microsoft for adopting excellent allocators).</p>
<p>The user mode page allocator returns jaw dropping real world performance improvements
but requires running the process as the superuser. Without, it still offers sizeable
gains on all older operating systems and through the v2 malloc API modest gains
on all very recent operating systems, especially in these situations:</p>
<ol>
<li>If you are repeatedly extending large vector arrays, you will see a LARGE
improvement if you use the address space reservation features.</li>
<li>If you do a lot of work with 16 byte aligned vectors e.g. SSE or AVX vector
arrays, you will find the v2 malloc API a godsend.</li>
</ol>
<p style="text-decoration: underline"><strong>Table of Contents: </strong></p>
<ol style="list-style-type: upper-alpha; position: relative; margin-top: -0.5em;">
<li><a href="#touse">How to use</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
<li>A1. <a href="#CPPAPI">The C++ API</a></li>
<li>A2. <a href="#v2mallocAPI">The v2 malloc C API</a></li>
</ul>
</li>
<li><a href="#notes">Notes</a><ul style="list-style-type: none; margin-left: 0; padding-left: 0">
<li>B1. <a href="#memorybloat">Memory Bloating</a></li>
<li>B2. <a href="#memoryleaks">Memory Leakage</a></li>
<li>B3. <a href="#threadcache">The Threadcache</a></li>
<li>B4. <a href="#largepages">Large Page support</a></li>
<li>B5. <a href="#logger">Memory operation logging</a></li>
<li>B6. <a href="#windowsonly">Windows-only features</a></li>
</ul>
</li>
<li><a href="#speedcomparisons">Speed Comparisons</a></li>
<li><a href="#troubleshooting">Troubleshooting</a></li>
<li><a href="#changelog">Changelog</a></li>
</ol>
<h2><a name="touse">A. To use:</a></h2>
<p>The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your
project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your
normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if
you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache()
for every pool you use on thread exit, and don't forget neddisablethreadcache(0)
for the system pool if necessary. Run and enjoy!</p>
<p>To test, compile <a href="test.c">test.c</a> (C) and <a href="test.cpp">test.cpp</a>
(C++). Both will run a comparison between your system allocator and nedalloc and
tell you how much faster nedalloc is. They also serve as examples of usage.</p>
<p>If you'd like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest
thing to do is to use <a href="http://www.scons.org/" target="_blank">scons</a>
which comes with a myriad of build options listed using scons -h. <b>If you want
to build some MSVC project files for use with Microsoft Visual Studio</b> then what
you do is (i) install <a href="http://www.python.org/" target="_blank">python</a>
(ii) install <a href="http://www.scons.org/" target="_blank">scons</a> (iii) open
a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu
=> Programs => Microsoft Visual Studio XXXX => Visual Studio Tools => Visual Studio
XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging
in its folder) (v) type "!MakeMSVCProjs" and hit Return. Note that for Visual Studio
2008 and later support you need scons v2.1 or later.</p>
<p>nedalloc comes with two new memory allocator APIs: one is for C++, and the other
is for C. <strong>Full documentation</strong> for all nedalloc's APIs and features
is provided in the enclosed <a href="nedalloc.chm">nedalloc.chm</a> which is in
Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format
too). If you don't want to use the CHM documentation, <a href="nedmalloc.h">nedmalloc.h</a>
is extensively commented with <a href="http://www.doxygen.org/" target="_blank">
doxygen markup</a>.</p>
<h3><a name="CPPAPI">A1: The C++ API:</a></h3>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>,
a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc
features to remedy many of the long standing problems and inefficiencies caused
by C++'s traditional over-fondness for copying things. While its implementation
is complex, usage is extremely easy - simply supply nedallocator<> as the custom
allocator to STL container classes.</p>
<p>As nedmalloc can do even better for vector extension, nedmalloc.h also contains
a nedvector<> implementation which is the standard STL vector<> implementation except
that it makes use of the non-relocating facilities of realloc2() (see below). This
allows nedvector<> to not need to overallocate memory (most STL vector<> implementations
will overallocate by 50%) which saves a lot of memory as well as <strong>completely
avoiding array copy construction</strong> which make std::vector<>::resize() so
very, very slow.</p>
<p>Even without nedalloc's major speed improvements as a simple C style allocator,
the improvements to the C++ memory infrastructure alone can generate huge performance
gains.</p>
<h3><a name="v2mallocAPI">A2: The v2 malloc C API:</a></h3>
<p><strong>[Note: This API will be completely replaced in v1.2]</strong></p>
<p>For the v1.10 release which was generously sponsored by
<a href="http://www.ara.com/" target="_blank">Applied Research Associates (USA)</a>,
a new general purpose allocator API was designed which is intended to remedy many
of the long standing problems and inefficiencies introduced by the ISO C allocator
API. Internally nedalloc's implementations of nedmalloc(), nedcalloc(), nedmemalign()
and nedrealloc() all call into this API:</p>
<ul>
<li><code>void* malloc2(size_t bytes, size_t alignment, unsigned flags)</code></li>
<li><code>void* realloc2(void* mem, size_t bytes, size_t alignment, unsigned
flags)</code></li>
</ul>
<p>If nedmalloc.h is being included by C++ code, the alignment and flags parameters
default to zero which makes the new API identical to the old API (roll on the introduction
of defaul