Meza
A Python toolkit for processing tabular data
Install / Use
/learn @reubano/MezaREADME
meza: A Python toolkit for processing tabular data
|GHA| |versions| |pypi|
Index
Introduction_ | Requirements_ | Motivation_ | Hello World_ | Usage_ |
Interoperability_ | Installation_ | Project Structure_ |
Design Principles_ | Scripts_ | Contributing_ | Credits_ |
More Info_ | License_
Introduction
meza is a Python library_ for reading and processing tabular data.
It has a functional programming style API, excels at reading/writing large files,
and can process 10+ file types.
With meza, you can
- Read csv/xls/xlsx/mdb/dbf files, and more!
- Type cast records (date, float, text...)
- Process Uñicôdë text
- Lazily stream files by default
- and much more...
Requirements
meza has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.
Optional Dependencies ^^^^^^^^^^^^^^^^^^^^^
=============================== ============== ============================== =======================
Function Dependency Installation File type / extension
=============================== ============== ============================== =======================
meza.io.read_mdb mdbtools_ sudo port install mdbtools Microsoft Access / mdb
meza.io.read_html lxml_ [#]_ pip install lxml HTML / html
meza.convert.records2array NumPy_ [#]_ pip install numpy n/a
meza.convert.records2df pandas_ pip install pandas n/a
=============================== ============== ============================== =======================
Notes ^^^^^
.. [#] If lxml isn't present, read_html will default to the builtin Python html reader
.. [#] records2array can be used without numpy by passing native=True in the function call. This will convert records into a list of native array.array objects.
Motivation
Why I built meza ^^^^^^^^^^^^^^^^
pandas is great, but installing it isn't exactly a walk in the park, and it
doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to
pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.
Why you should use meza ^^^^^^^^^^^^^^^^^^^^^^^
meza provides a number of benefits / differences from similar libraries such
as pandas. Namely:
- a functional programming (instead of object oriented) API
iterators by default_ (reading/writing)PyPy compatibility_geojson support_ (reading/writing)seamless integration_ with sqlachemy (and other libs that work with iterators of dicts)
For more detailed information, please check-out the FAQ_.
Hello World
A simple data processing example is shown below:
First create a simple csv file (in bash)
.. code-block:: bash
echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv
Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.
.. code-block:: python
>>> from meza import io, process as pr, convert as cv
>>> from io import open
>>> # Load the csv file
>>> records = io.read_csv('data.csv')
>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}
>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)
# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}
# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}
# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}
# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))
# View the result
>>> with open('out.csv', 'utf-8') as f:
... f.read()
'col2,col3\n2015-01-01,3\n'
Usage
meza is intended to be used directly as a Python library.
Usage Index ^^^^^^^^^^^
-
Reading data_ -
Processing data_Numerical analysis (à la pandas)_Text processing (à la csvkit)_Geo processing (à la mapbox)_
-
Writing data_ -
Cookbook_
Reading data ^^^^^^^^^^^^
meza can read both filepaths and file-like objects. Additionally, all readers
return equivalent records iterators, i.e., a generator of dictionaries with
keys corresponding to the column names.
.. code-block:: python
>>> from io import open, StringIO
>>> from meza import io
"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')
"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)
"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}
"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
... for row in io.read_xls(f, sanitize=True):
... # do something with the `row`
... pass
"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
... records = io.read_xls(f, encoding='utf-8', sheet=1)
... first_row = next(records)
... # do something with the `first_row`
"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)
Please see readers_ for a complete list of available readers and recognized
file types.
Processing data ^^^^^^^^^^^^^^^
Numerical analysis (à la pandas) [#]_
In the following example, ``pandas`` equivalent methods are preceded by ``-->``.
.. code-block:: python
>>> import itertools as it
>>> import random
>>> from io import StringIO
>>> from meza import io, process as pr, convert as cv, stats
# Create some data in the same structure as what the various `read...`
# functions output
>>> header = ['A', 'B', 'C', 'D']
>>> data = [(random.random() for _ in range(4)) for x in range(7)]
>>> df = [dict(zip(header, d)) for d in data]
>>> df[0]
{'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}
"""Sort records by the value of column `B` --> df.sort_values(by='B')"""
>>> next(pr.sort(df, 'B'))
{'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}
"""Select column `A` --> df['A']"""
>>> next(pr.cut(df, ['A']))
{'A': 0.53908170489952006}
"""Select the first three rows of data --> df[0:3]"""
>>> len(list(it.islice(df, 3)))
3
"""Select all data whose value for column `A` is less than 0.5
--> df[df.A < 0.5]
"""
>>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
{'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}
# Note: since `aggregate` and `merge` (by definition) return just one row,
# they return them as is (not wrapped in a generator).
"""Calculate the mean of column `A` across all data --> df.mean()['A']"""
>>> pr.aggregate(df, 'A', stats.mean)['A']
0.5410437473067938
"""Calculate the sum of each column across all data --> df.sum()"""
>>> pr.merge(df, pred=bool, op=sum)
{'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}
Text processing (à la csvkit) [#]_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the following example, ``csvkit`` equivalent commands are preceded by ``-->``.
First create a few simple csv files (in bash)
.. code-block:: bash
echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv
Now we can read the files, manipulate the data, convert the manipulated data to
json, and write the json back to a new file. Also, note that since all readers
return equivalent `records` iterators, you can use them interchangeably (in
place of ``read_csv``) to open any supported file. E.g., ``read_xls``,
``read_sqlite``, etc.
.. code-block:: python
>>> import itertools as it
>>> from meza import io, process as pr, convert as cv
"""Combine the files into one iterator
--> csvstack file1.csv file2.csv
"""
>>> records = io.join('file1.csv', 'file2.csv')
>>> next(records)
{'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
>>> next(it.islice(records, 4, None))
{'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}
# Now let's create a persistent records list
>>> records = list(io.read_csv('file1.csv'))
"""Sort records by the value of column `col_2`
--> csvsort -c col_2 file1.csv
"""
>>> next(pr.sort(records, 'col_2'))
{'col_1': '2', 'col_2': 'bob', 'col_3': 'male'
"""Select column `col_2` --> csvcut -c col_2 file1
