simplecsv

A reboot of the OpenCSV parser for Java.

Origins and Philosophy

In early 2013, I forked the OpenCSV project from Sourceforge, since it hadn't been touched in over two years. Earlier, I had needed to add some functionality, which I posted back to the Sourceforge project's forum. But that patch along with many others there were left untouched. There were also a couple of key bugs that had been reported that looked serious enough to me that I didn't want to use the OpenCSV library until they were fixed.

After trying unsuccessfully to fix some of the key bugs in OpenCSV, I concluded that the core of the library -- the CSVParser -- was too complicated a patchwork to salvage. I decided to rewrite it. That effort led to forking the project entirely, with the primary intent of simplifying the parser code, but keeping it fast and generally in the spirit of the OpenCSV library.

Thanks to footloosejava, version 2 of simplecsv now has support for reading CSV records from file that have new lines and supports properly parsing RFC4180 quoted quotes in a quoted field (see below for details if that is confusing).

I toyed with keeping the name "OpenCSV" or even calling the library "ReOpenCSV", but in the end I believe the behavior is just different enough that that would be misleading. My goal has been to simplify, so I call this "simplecsv".

Release Status

A 2.0 tag was applied in July 2014 and is now available in maven central: http://search.maven.org/#search|ga|1|simplecsv.

<a name="opencsv"></a> ## Similarities to OpenCSV

All the supporting classes from OpenCSV, such as the CsvWriter, CsvReader, CsvIterator, BeanToCsv, CsvToBean and ResultSetHelper were copied over more or less intact.

Almost all of the differences are in the parsers. As described below, simplecsv has a pluggable parser model and two parsers are current available: a "simple" one and a "multiline" one.

Pluggable Parser Model

With simplecsv-2.0, CsvParser is now an interface with two methods:

List<String> parse(String s);
List<String> parseNext(Reader reader) throws IOException;

Thus, simplecsv has a pluggable parser model. The original CsvParser (from simplecsv-1.x) is now called SimpleCsvParser and a new parser MultiLineCsvParser has been added (by contributor footlosejava).

<a name="multiline"></a> ### MultiLine Parser and RFC4180 compliance

The MultiLineCsvParser parser adds three features not found in the SimpleCsvParser:

when reading from a file that has a newline in a quoted field, it will parse it as a single csv "record", whereas the SimpleCsvParser will treat the newline as the end of the record
it can follow RFC4180 in allowing quotes to escape quotes in a quoted field. This is described in more detail below if you aren't familiar with this (somewhat peculiar) RFC "standard".
it is threadsafe

Based on a series of benchmarks I ran, the SimpleCsvParser is typically 20 to 30% faster than the MultiLineCsvParser, so SimpleCsvParser is used as the default parser. The CsvParserBuilder will return you a MultiLineCsvParser only if you ask for any one of the following options:

multiLine (no surprise!)
supportRfc4180QuotedQuotes - allow quotes inside a quoted field if they are doubled (a quoted quote), ala RFC 4180. See the Options to the CsvParser section for more details on this.
threadSafe

<a name="options"></a> ## Options to the CsvParser

With many Java CSV libraries, you have to use a Reader to get anything done. In simplecsv (as with OpenCSV), the CsvParser is a first-class citizen and can be used on its own. A CsvReader is really just a convenience wrapper around the CsvParser.

As with OpenCSV, the separator or delimiter, the escape char and the quote char are configurable.

simplecsv also preserves many of the nice configurable options that OpenCSV provided, changes a few to be more consistent or sensible and adds a number of new ones. Here is a full listing of the options and their defaults:

|----------------------------+---------|
| Option                     | Default |
|----------------------------+---------|
| separator                  | `,`     |
| quoteChar                  | `"`     |
| escapeChar                 | `\`     |
| trimWhitespace             | false   |
| allowUnbalancedQuotes      | false   |
| retainOuterQuotes          | false   |
| alwaysQuoteOutput          | false   |
| strictQuotes               | false   |
| retainEscapeChars          | true    |
| multiline                  | false   |
| supportRfc4180QuotedQuotes | false   |
| threadSafe                 | false   |
|----------------------------+---------|

Details on CsvParser Options

**Default behavior**

By default, fields do not have whitespace trimmed, unbalanced quotes will cause an exception to be thrown, the outer quotes, if present and not escaped, will be removed, and the entire string between the separators will be returned, including whitespace, escape characters and things outside of quotes if quotes are present.

The default separator is comma; the default escape char is backslash; the default quote char is double quote.

If any newlines are present, even inside quoted strings, they will be interpreted as the end of the csv record.

In the examples below, the << and >> characters are not part of the string - they just indicate its start and end, so whitespace can be "seen" in the input. Also these are examples are shown as if in a text file - not as they would appear in a Java string. The outputs are shown with braces to indicate that the output is a List<String> The spaces between words in the output are significant.

_Input_                                  _Output_
>>"one","two","3 3",\"four\"<<       =>  [one,two,3 3,\"four\"]
>>"one", " two " , "3 3",\"four\"<<  =>  [one,  two  , 3 3,\"four\"]

Important Notes:

the CsvParser is fastest when using the default settings. Changing some settings can lead to a 10 to 30% decrease in overall throughput based on a series of (unpublished) benchmarks I have done. So use as many of the default settings as you can if you are concerned about overall parsing throughput.
each of the options described below are described in isolation - if you combine options you may get different results and some option combinations are not allowed. Disallowed combinations are detected at construction time and an error will be thrown.

TrimWhitespace=true

This changes the default behavior to trim all outer whitespace. Java's Character.isWhitespace() method is used to define "whitespace", so it includes CR and LF characters. Outer quotes are still removed if not escaped.

CsvParser p = new CsvParserBuilder().
  trimWhitespace(true).
  build();

_Input_                                  _Output_
>>"one", " two " , "3 3",\"four\"<<  =>  [one,two,3 3,\"four\"]

AllowUnbalancedQuotes=true

This changes the default behavior to accept unbalanced quotes and pass them on to the output, rather than throw an Exception.

If AllowUnbalancedQuotes=false (the default), you will get:

_Input_          _Output_
>>one,"""<<  =>  java.lang.IllegalArgumentException: Un-terminated quoted field at end of CSV line

If AllowUnbalancedQuotes=true, you will get:

CsvParser p = new CsvParserBuilder().
  allowUnbalancedQuotes(true).
  build();

_Input_          _Output_
>>one,"""<<  =>  [one,"]

The key meaning of "allow unbalanced quotes" is that no Exception is thrown if the quotes are balanced when the parser gets to the end of the line/tuple. The first quote that is seen is still considered the start of a quoted field.

Here's an example (many thanks to Patricia Goldweic):

CsvParser p = new CsvParserBuilder().
  separator('|').
  allowUnbalancedQuotes(true).
  build();

input:

blah|this is a long name for this" record|blah2

The first | seen is interpreted as a field separator, but the second (between "record" and "blah2") is not because it is inside a quoted section. It turns out that the quoted section doesn't have a close quote, but that is allowed since "allow unbalanced quotes" was set to true.

Thus the expected output will be:

tok0: blah
tok1: this is a long name for this" record|blah2

**RetainEscapeChars=false**

By default, escape chars are retained, like so:

CsvParser p = new CsvParserBuilder().
  quoteChar('\'').
  build();

_Input_             _Output_
>>one,'\'\''<<  =>  [one,\'\']  (The escapes are in the string.)

But with

CsvParser p = new CsvParserBuilder().
  quoteChar('\'').
  retainEscapeChars(false).
  build();

_Input_             _Output_
>>one,'\'\''<<  =>  [one,'']

Here it kept the inner quotes. The outer quotes are removed as normal and the escape chars are removed.

RetainOuterQuotes=true

If you want to retain outer quotes that are present in the input, but not add them where they were not present, use this setting.

CsvParser p = new Csv

Simplecsv

Install / Use

README

simplecsv

Origins and Philosophy

TOC

Release Status

Pluggable Parser Model

Details on CsvParser Options