SkillAgentSearch skills...

CleverCSV

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

Install / Use

/learn @alan-turing-institute/CleverCSV

README

<p align="center"> <img width="60%" src="https://raw.githubusercontent.com/alan-turing-institute/CleverCSV/eea72549195e37bd4347d87fd82bc98be2f1383d/.logo.png"> <br> <a href="https://github.com/alan-turing-institute/CleverCSV/actions"> <img src="https://github.com/alan-turing-institute/CleverCSV/workflows/build/badge.svg" alt="Github Actions Build Status"> </a> <a href="https://pypi.org/project/clevercsv/"> <img src="https://badge.fury.io/py/clevercsv.svg" alt="PyPI version"> </a> <a href="https://clevercsv.readthedocs.io/en/latest/?badge=latest"> <img src="https://readthedocs.org/projects/clevercsv/badge/?version=latest" alt="Documentation Status"> </a> <a href="https://pepy.tech/project/clevercsv"> <img src="https://pepy.tech/badge/clevercsv" alt="Downloads"> </a> <a href="https://mybinder.org/v2/gh/alan-turing-institute/CleverCSVDemo/master?filepath=CSV_dialect_detection_with_CleverCSV.ipynb"> <img src="https://mybinder.org/badge_logo.svg" alt="Binder"> </a> <a href="https://rdcu.be/bLVur"> <img src="https://img.shields.io/badge/DOI-10.1007%2Fs10618--019--00646--y-blue"> </a> </p>

CleverCSV provides a drop-in replacement for the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:


Contents: <a href="#quick-start"><b>Quick Start</b></a> | <a href="#introduction"><b>Introduction</b></a> | <a href="#installation"><b>Installation</b></a> | <a href="#usage"><b>Usage</b></a> | <a href="#python-library">Python Library</a> | <a href="#command-line-tool">Command-Line Tool</a> | <a href="#version-control-integration">Version Control Integration</a> | <a href="#contributing"><b>Contributing</b></a> | <a href="#notes"><b>Notes</b></a>


Quick Start

Click here to go to the introduction with more details about CleverCSV. If you're in a hurry, below is a quick overview of how to get started with the CleverCSV Python package and the command line interface.

For the Python package:

# Import the package
>>> import clevercsv

# Load the file as a list of rows
# This uses the imdb.csv file in the examples directory
>>> rows = clevercsv.read_table('./imdb.csv')

# Load the file as a Pandas Dataframe
# Note that df = pd.read_csv('./imdb.csv') would fail here
>>> df = clevercsv.read_dataframe('./imdb.csv')

# Use CleverCSV as drop-in replacement for the Python CSV module
# This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer
# Note that csv.Sniffer would fail here
>>> with open('./imdb.csv', newline='') as csvfile:
...     dialect = clevercsv.Sniffer().sniff(csvfile.read())
...     csvfile.seek(0)
...     reader = clevercsv.reader(csvfile, dialect)
...     rows = list(reader)

And for the command line interface:

# Install the full version of CleverCSV (this includes the command line interface)
$ pip install clevercsv[full]

# Detect the dialect
$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')

# Generate code to import the file
$ clevercsv code ./imdb.csv

import clevercsv

with open("./imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

# Explore the CSV file as a Pandas dataframe
$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: df
>>> df

Introduction

  • CSV files are awesome! They are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
  • CSV files are terrible! They can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for recording metadata!

CleverCSV is a Python package that aims to solve some of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of row lengths of the parsed file and the data type of the resulting cells. With our method we achieve 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files compared to the Python standard library.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

@article{van2019wrangling,
        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
        author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.},
        journal = {Data Mining and Knowledge Discovery},
        year = {2019},
        volume = {33},
        number = {6},
        pages = {1799--1820},
        issn = {1573-756X},
        doi = {10.1007/s10618-019-00646-y},
}

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

CleverCSV is available on PyPI. You can install either the full version, which includes the command line interface and all optional dependencies, using

$ pip install clevercsv[full]

or you can install a lighter, core version of CleverCSV with

$ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Python Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

  • detect_dialect: takes a path to a CSV file and returns the detected dialect
  • read_table: automatically detects the dialect and encoding of the file, and returns the data as a list of rows. A version that returns a generator is also available: stream_table
  • read_dataframe: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame. Note that this function requires Pandas to be installed.
  • read_dicts: detect the dialect and return the rows of the file as dictionaries, assuming the first row contains the headers. A streaming version called stream_dicts is also available.
  • write_table: write a table (a list of lists) to a file using the RFC-4180 dialect.
  • write_dicts: write a list of dictionaries to a file using the RFC-4180 dialect.

Of course, you can also use the traditional way of loading a C

Related Skills

View on GitHub
GitHub Stars1.3k
CategoryData
Updated27d ago
Forks79

Languages

Python

Security Score

100/100

Audited on Feb 23, 2026

No findings