NanoCSV, Faster C++11 multithreaded header-only CSV parser

NanoCSV is a faster C++11 multithreaded header-only CSV parser with only STL dependency. NanoCSV is designed for CSV data with numeric values.

tty

Status

In development. Not recommended to use NanoCSV in production at the moment.

Requirements

C++11 compiler(with thread support)

Usage


// defined this only in **one** c++ file.
#define NANOCSV_IMPLEMENTATION
#include "nanocsv.h"

int main(int argc, char **argv)
{
  if (argc < 2) {
    std::cout << "csv_parser_example input.csv (num_threads) (delimiter)\n";
  }

  std::string filename("./data/array-4-5.csv");
  int num_threads = -1; // -1 = use all system threads
  char delimiter = ' '; // delimiter character.

  if (argc > 1) {
    filename = argv[1];
  }

  if (argc > 2) {
    num_threads = std::atoi(argv[2]);
  }

  if (argc > 3) {
    delimiter = argv[3][0];
  }

  nanocsv::ParseOption<float> option;
  option.delimiter = delimiter;
  option.req_num_threads = num_threads;
  option.verbose = true; // verbse message will be stored in `warn`.
  option.ignore_header = true; // Parse header(the first line. default = true).

  std::string warn;
  std::string err;

  nanocsv::CSV<float> csv;

  bool ret = nanocsv::ParseCSVFromFile(filename, option, &csv, &warn, &err);

  if (!warn.empty()) {
    std::cout << "WARN: " << warn << "\n";
  }


  if (!ret) {

    if (!err.empty()) {
      std::cout << "ERROR: " << err << "\n";
    }

    return EXIT_FAILURE;
  }

  std::cout << "num records(rows) = " << csv.num_records << "\n";
  std::cout << "num fields(columns) = " << csv.num_fields << "\n";

  // values are 1D array of length [num_records * num_fields]
  // std::cout << csv.values[4 * num_fields + 3] << "\n";

  // header string is stored in `csv.header`
  if (!option.ignore_header) {
    for (size_t i = 0; i < csv.header.size(); i++) {
      std::cout << csv.header[i] << "\n";
    }
  }


  return EXIT_SUCCESS;
}

NaN, Inf

nanocsv supports parsing

nan, -nan as NaN, -NaN
inf, -inf as Inf, -Inf

Support for N/A and null value

In default, missing value(e.g. N/A(including invalid numeric string), NaN) are replaced by nan, and null(empty) value(e.g. "") are replaced by nan.

You can control the behavior with the following parametes in ParseOption.

replace_na : Replace N/A, NaN value?
- na_value : The value to be replaced for N/A, NaN value
replace_null : Replace null(empty) value?
- null_value : The value to be replaced for null value

Parse Text CSV

Parsing Text CSV(each field is just a string) is also supported. (Use differnt API. See the source code for details.)

Compiler options

NANOCSV_NO_IO : Disable I/O(file access, stdio, mmap).
NANOCSV_WITH_RYU : Use ryu library to parse floating-point string. https://github.com/ulfjack/ryu . This will give precise handling of floating point values.
- NANOCSV_WITH_RYU_NOINCLUDE: Do not include Ryu header files in nanocsv.h. This is useful when you want to include Ryu header files outside of nanocsv.h.

TODO

[ ] Support UTF-8
- [x] Detect BOM header
- [ ] Validate UTF-8 string
[ ] Support UTF-16 and UTF-32?
[ ] mmap based API
[ ] Reduce memory usage. Currently nanocsv allocates some memory for intermediate buffer.
[ ] Robust error handling.
[x] Support header.
[x] Support comment line(A line start with #)
[ ] Support different number of fields among records;
[ ] Parse complex value(e.g. 3.0 + 4.2j)
[ ] Parse special value like #INF, #NAN.
- https://docs.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=vs-2019
[ ] Use floaxie https://github.com/aclex/floaxie for better floating point string parsing.
[ ] CSV writer.
[ ] Write tests.
[ ] Remove libm(pow) dependency.

Performance

Dataset is 8192 x 4096, 800 MB in file size(generated by tools/gencsv/gen.py)

Thradripper 1950X
DDR4 2666 64 GB memory

perf

1 thread.

total parsing time: 3833.33 ms
  line detection : 1264.99 ms
  alloc buf      : 0.016351 ms
  parse          : 2508.83 ms
  construct      : 55.726 ms

16 thread.

total parsing time: 545.646 ms
  line detection : 159.078 ms
  alloc buf      : 0.077979 ms
  parse          : 337.207 ms
  construct      : 46.7815 ms

23 threads

Since 23 threads are faster than 32 thread for 1950x.

total parsing time: 494.849 ms
  line detection : 127.176 ms
  alloc buf      : 0.050988 ms
  parse          : 314.287 ms
  construct      : 50.7568 ms

Roughly 7.7 times faster than signle therad parsing.

Note on memory consumption

Not sure, but it should not exceed 3 * filesize, so guess 2.4 GB.

In python

Using numpy.loadtxt to load data takes 23.4 secs.

23 threaded naocsv parsing is Roughly 40 times faster than numpy.loadtxt.

References

RFC 4180 https://www.ietf.org/rfc/rfc4180.txt

License

MIT License

Third-party license

stack_container : Copyright (c) 2006-2008 The Chromium Authors. BSD-style license.
acutest : MIT license. Used for unit tester.
ryu : Apache 2.0 or Boost 1.0 dual license.

Nanocsv

Install / Use

README