meza: A Python toolkit for processing tabular data

Index

Introduction

meza is a Python library_ for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.

With meza, you can

Read csv/xls/xlsx/mdb/dbf files, and more!
Type cast records (date, float, text...)
Process Uñicôdë text
Lazily stream files by default
and much more...

Requirements

meza has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.

Optional Dependencies ^^^^^^^^^^^^^^^^^^^^^

=============================== ============== ============================== ======================= Function Dependency Installation File type / extension =============================== ============== ============================== ======================= meza.io.read_mdb mdbtools_ sudo port install mdbtools Microsoft Access / mdb meza.io.read_html lxml_ [#]_ pip install lxml HTML / html meza.convert.records2array NumPy_ [#]_ pip install numpy n/a meza.convert.records2df pandas_ pip install pandas n/a =============================== ============== ============================== =======================

Notes ^^^^^

.. [#] If lxml isn't present, read_html will default to the builtin Python html reader

.. [#] records2array can be used without numpy by passing native=True in the function call. This will convert records into a list of native array.array objects.

Motivation

Why I built meza ^^^^^^^^^^^^^^^^

pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.

Why you should use meza ^^^^^^^^^^^^^^^^^^^^^^^

meza provides a number of benefits / differences from similar libraries such as pandas. Namely:

a functional programming (instead of object oriented) API
iterators by default_ (reading/writing)
PyPy compatibility_
geojson support_ (reading/writing)
seamless integration_ with sqlachemy (and other libs that work with iterators of dicts)

For more detailed information, please check-out the FAQ_.

Hello World

A simple data processing example is shown below:

First create a simple csv file (in bash)

.. code-block:: bash

echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv

Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.

.. code-block:: python

>>> from meza import io, process as pr, convert as cv
>>> from io import open

>>> # Load the csv file
>>> records = io.read_csv('data.csv')

>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}

>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)

# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}

# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}

# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}

# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))

# View the result
>>> with open('out.csv', 'utf-8') as f:
...     f.read()
'col2,col3\n2015-01-01,3\n'

Usage

meza is intended to be used directly as a Python library.

Usage Index ^^^^^^^^^^^

Reading data_
Processing data_
- Numerical analysis (à la pandas)_
- Text processing (à la csvkit)_
- Geo processing (à la mapbox)_
Writing data_
Cookbook_

Reading data ^^^^^^^^^^^^

meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.

.. code-block:: python

>>> from io import open, StringIO
>>> from meza import io

"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')

"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)

"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}

"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
...     for row in io.read_xls(f, sanitize=True):
...         # do something with the `row`
...         pass

"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
...     records = io.read_xls(f, encoding='utf-8', sheet=1)
...     first_row = next(records)
...     # do something with the `first_row`

"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)

Please see readers_ for a complete list of available readers and recognized file types.

Processing data ^^^^^^^^^^^^^^^

Numerical analysis (à la pandas) [#]_


In the following example, ``pandas`` equivalent methods are preceded by ``-->``.

.. code-block:: python

    >>> import itertools as it
    >>> import random

    >>> from io import StringIO
    >>> from meza import io, process as pr, convert as cv, stats

    # Create some data in the same structure as what the various `read...`
    # functions output
    >>> header = ['A', 'B', 'C', 'D']
    >>> data = [(random.random() for _ in range(4)) for x in range(7)]
    >>> df = [dict(zip(header, d)) for d in data]
    >>> df[0]
    {'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}

    """Sort records by the value of column `B` --> df.sort_values(by='B')"""
    >>> next(pr.sort(df, 'B'))
    {'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}

    """Select column `A` --> df['A']"""
    >>> next(pr.cut(df, ['A']))
    {'A': 0.53908170489952006}

    """Select the first three rows of data --> df[0:3]"""
    >>> len(list(it.islice(df, 3)))
    3

    """Select all data whose value for column `A` is less than 0.5
    --> df[df.A < 0.5]
    """
    >>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
    {'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}

    # Note: since `aggregate` and `merge` (by definition) return just one row,
    # they return them as is (not wrapped in a generator).
    """Calculate the mean of column `A` across all data --> df.mean()['A']"""
    >>> pr.aggregate(df, 'A', stats.mean)['A']
    0.5410437473067938

    """Calculate the sum of each column across all data --> df.sum()"""
    >>> pr.merge(df, pred=bool, op=sum)
    {'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}

Text processing (à la csvkit) [#]_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the following example, ``csvkit`` equivalent commands are preceded by ``-->``.

First create a few simple csv files (in bash)

.. code-block:: bash

    echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
    echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv

Now we can read the files, manipulate the data, convert the manipulated data to
json, and write the json back to a new file. Also, note that since all readers
return equivalent `records` iterators, you can use them interchangeably (in
place of ``read_csv``) to open any supported file. E.g., ``read_xls``,
``read_sqlite``, etc.

.. code-block:: python

    >>> import itertools as it

    >>> from meza import io, process as pr, convert as cv

    """Combine the files into one iterator
    --> csvstack file1.csv file2.csv
    """
    >>> records = io.join('file1.csv', 'file2.csv')
    >>> next(records)
    {'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
    >>> next(it.islice(records, 4, None))
    {'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}

    # Now let's create a persistent records list
    >>> records = list(io.read_csv('file1.csv'))

    """Sort records by the value of column `col_2`
    --> csvsort -c col_2 file1.csv
    """
    >>> next(pr.sort(records, 'col_2'))
    {'col_1': '2', 'col_2': 'bob', 'col_3': 'male'

    """Select column `col_2` --> csvcut -c col_2 file1

Meza

Install / Use

README