SkillAgentSearch skills...

Meza

A Python toolkit for processing tabular data

Install / Use

/learn @reubano/Meza
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

meza: A Python toolkit for processing tabular data

|GHA| |versions| |pypi|

Index

Introduction_ | Requirements_ | Motivation_ | Hello World_ | Usage_ | Interoperability_ | Installation_ | Project Structure_ | Design Principles_ | Scripts_ | Contributing_ | Credits_ | More Info_ | License_

Introduction

meza is a Python library_ for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.

With meza, you can

  • Read csv/xls/xlsx/mdb/dbf files, and more!
  • Type cast records (date, float, text...)
  • Process Uñicôdë text
  • Lazily stream files by default
  • and much more...

Requirements

meza has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.

Optional Dependencies ^^^^^^^^^^^^^^^^^^^^^

=============================== ============== ============================== ======================= Function Dependency Installation File type / extension =============================== ============== ============================== ======================= meza.io.read_mdb mdbtools_ sudo port install mdbtools Microsoft Access / mdb meza.io.read_html lxml_ [#]_ pip install lxml HTML / html meza.convert.records2array NumPy_ [#]_ pip install numpy n/a meza.convert.records2df pandas_ pip install pandas n/a =============================== ============== ============================== =======================

Notes ^^^^^

.. [#] If lxml isn't present, read_html will default to the builtin Python html reader

.. [#] records2array can be used without numpy by passing native=True in the function call. This will convert records into a list of native array.array objects.

Motivation

Why I built meza ^^^^^^^^^^^^^^^^

pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.

Why you should use meza ^^^^^^^^^^^^^^^^^^^^^^^

meza provides a number of benefits / differences from similar libraries such as pandas. Namely:

  • a functional programming (instead of object oriented) API
  • iterators by default_ (reading/writing)
  • PyPy compatibility_
  • geojson support_ (reading/writing)
  • seamless integration_ with sqlachemy (and other libs that work with iterators of dicts)

For more detailed information, please check-out the FAQ_.

Hello World

A simple data processing example is shown below:

First create a simple csv file (in bash)

.. code-block:: bash

echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv

Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.

.. code-block:: python

>>> from meza import io, process as pr, convert as cv
>>> from io import open

>>> # Load the csv file
>>> records = io.read_csv('data.csv')

>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}

>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)

# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}

# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}

# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}

# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))

# View the result
>>> with open('out.csv', 'utf-8') as f:
...     f.read()
'col2,col3\n2015-01-01,3\n'

Usage

meza is intended to be used directly as a Python library.

Usage Index ^^^^^^^^^^^

  • Reading data_

  • Processing data_

    • Numerical analysis (à la pandas)_
    • Text processing (à la csvkit)_
    • Geo processing (à la mapbox)_
  • Writing data_

  • Cookbook_

Reading data ^^^^^^^^^^^^

meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.

.. code-block:: python

>>> from io import open, StringIO
>>> from meza import io

"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')

"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)

"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}

"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
...     for row in io.read_xls(f, sanitize=True):
...         # do something with the `row`
...         pass

"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
...     records = io.read_xls(f, encoding='utf-8', sheet=1)
...     first_row = next(records)
...     # do something with the `first_row`

"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)

Please see readers_ for a complete list of available readers and recognized file types.

Processing data ^^^^^^^^^^^^^^^

Numerical analysis (à la pandas) [#]_


In the following example, ``pandas`` equivalent methods are preceded by ``-->``.

.. code-block:: python

    >>> import itertools as it
    >>> import random

    >>> from io import StringIO
    >>> from meza import io, process as pr, convert as cv, stats

    # Create some data in the same structure as what the various `read...`
    # functions output
    >>> header = ['A', 'B', 'C', 'D']
    >>> data = [(random.random() for _ in range(4)) for x in range(7)]
    >>> df = [dict(zip(header, d)) for d in data]
    >>> df[0]
    {'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}

    """Sort records by the value of column `B` --> df.sort_values(by='B')"""
    >>> next(pr.sort(df, 'B'))
    {'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}

    """Select column `A` --> df['A']"""
    >>> next(pr.cut(df, ['A']))
    {'A': 0.53908170489952006}

    """Select the first three rows of data --> df[0:3]"""
    >>> len(list(it.islice(df, 3)))
    3

    """Select all data whose value for column `A` is less than 0.5
    --> df[df.A < 0.5]
    """
    >>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
    {'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}

    # Note: since `aggregate` and `merge` (by definition) return just one row,
    # they return them as is (not wrapped in a generator).
    """Calculate the mean of column `A` across all data --> df.mean()['A']"""
    >>> pr.aggregate(df, 'A', stats.mean)['A']
    0.5410437473067938

    """Calculate the sum of each column across all data --> df.sum()"""
    >>> pr.merge(df, pred=bool, op=sum)
    {'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}

Text processing (à la csvkit) [#]_
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the following example, ``csvkit`` equivalent commands are preceded by ``-->``.

First create a few simple csv files (in bash)

.. code-block:: bash

    echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
    echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv

Now we can read the files, manipulate the data, convert the manipulated data to
json, and write the json back to a new file. Also, note that since all readers
return equivalent `records` iterators, you can use them interchangeably (in
place of ``read_csv``) to open any supported file. E.g., ``read_xls``,
``read_sqlite``, etc.

.. code-block:: python

    >>> import itertools as it

    >>> from meza import io, process as pr, convert as cv

    """Combine the files into one iterator
    --> csvstack file1.csv file2.csv
    """
    >>> records = io.join('file1.csv', 'file2.csv')
    >>> next(records)
    {'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
    >>> next(it.islice(records, 4, None))
    {'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}

    # Now let's create a persistent records list
    >>> records = list(io.read_csv('file1.csv'))

    """Sort records by the value of column `col_2`
    --> csvsort -c col_2 file1.csv
    """
    >>> next(pr.sort(records, 'col_2'))
    {'col_1': '2', 'col_2': 'bob', 'col_3': 'male'

    """Select column `col_2` --> csvcut -c col_2 file1
View on GitHub
GitHub Stars421
CategoryDevelopment
Updated2mo ago
Forks29

Languages

Python

Security Score

100/100

Audited on Jan 6, 2026

No findings