daffodil_logo

Python Daffodil

The Python Daffodil (DAtaFrames For Optimized Data Inspection and Logical processing) package provides lightweight, simple and flexible 2-d dataframes built on python data types, including a list-of-list array as the core datatype. Daffodil is similar to other data frame packages, such as Pandas, Numpy, Polars, Swift, Vaex, Dask, PyArrow, SQLite, PySpark, etc. but is simpler and may be faster because it does not have conversion overhead. Daffodil excels in row-based appends and apply operations, complex embedded types, etc.

STATUS: Daffodil is largely operating quite well, but there are still design tradeoffs that are being investigated. Some of the methods of the class and assumptions that can be made about the state of the data may change slightly as these design tradeoffs are being evaluated and the final initial design resolved. Please see GitHub issues to weigh in on the design.

Also planning extensions to support treating SQL tables as daffodil data tables, improved support for AI and ML embeddings, rapid conversion of specified cols to dict-of-numpy array .to_dnpa() which is a lightweight pandas-like form that supports column-oriented array operations.

Data Model

The Daffodil data model is really very simple. The core array is a list-of-lists (lol), optionally with one or two associated dictionaries, one for the column names and one for row keys.

Daffodil uses standard python data types, and can mix data types in rows and columns, and can store any type within a cell, even another Daffodil instance.

It works well in traditional Pythonic processing paradigms, such as in loops, allowing fast row appends, insertions and other operations that column-oriented packages like Pandas handle poorly or don't offer at all. Selecting, inserting, appending rows does not make a copy of the data but uses references the way Python normally does, leveraging the inherent power of Python without replacing it.

Daffodil offers row-based apply and reduce functions, including support for chunked large data sets that can be described by a Daffodil table which operates as a manifest to chunks, and useful for delegations for parallel processing, where each delegation can handle a number of chunks.

Daffodil is a very simple 'bare metal' class that is well suited for those situations where pure number crunching is not the main objective. But it is also very compatible with other dataframe packages and can provide great way to build and clean the data before providing the data to other packages for number crunching.

Tabular data is commonly built record-by-record, while popular analysis and manipulation tools are oriented to work on data columns once it is fully assembled. If only a very few data operations are performed (say < 30) on columns (such as a sums, stdev, etc.) then it is frequently more performant to leave it in row format rather than reforming it into columns and enduring the delays of porting and converting the data to those other packages.

Spreadsheet-like operations are also provided, which are useful for processing the entire array with the same formula template, and can avoid glue code for many transformations. Python equations in the formula pane operate on the data pane and calculations from spreadsheet programs can be easily ported in, to avoid random glue code.

Good for general data operations

We were surprised to find that Pandas is very slow in importing Python data. Pandas uses a numpy array for each column which must be allocated in memory as one contiguous block. Converting a row-oriented list-of_dict (lod) array to Pandas DataFrame using the simple pd.DataFrame(lod) method takes about 45x longer than converting the same data to a Daffodil instance.

The Daffodil class is simple, with the data array a list-of-list (lol), and uses a dictionary for column names (hd -- header dict) and for row keys (kd -- key dict), making it extremely fast for column and row indexing, while avoiding the requirement for contiguous data allocation. Python uses dynamic arrays to store references to each data item in the lol structure. But it also provides a simple UI to make it easy to select rows, columns, slices by number or names.

For numerical operations such as sums, max, min, stdev, etc., Daffodil is not as performant as Pandas or NumPy when the data is uniform within columns or the entire array. Daffodil does not offer additional array operations like C = A + B, where A and B are both large arrays with the same shape producing array C, which is the sum of each cell in the same grid location. This type of functionality, as well as matrix operations is already available in NumPy, and NumPy can fill that role.

Appending rows in Pandas is slow because each column is stored as a separate NumPy array, and appending a row involves creating a new array for each column with the added row. This process can lead to significant overhead, especially when dealing with large DataFrames. In fact, it is so bad that the append operation is now deprecated in Pandas. That means you have to turn to some other method of building your data, and then the question comes up: Should I export the data to Pandas or just work on it in Python.

When is it worth porting to Pandas? That will depend on the size of the array and what the calculations are. But a convenient rule of thumb from our testing is that Pandas can be more performant than Daffodil if column-oriented manipulations (similar to summing) are repeated on the same data at least ~30 times.

In other words, if you have an array and you need to do just a few column-based operations (fewer than 30), then it will be probably be faster to just do them using Daffodil .apply() or .reduce() operations, rather than exporting the array to Pandas, performing the calcs and the transferring it back in. (You can see our benchmarks and other tests linked below.)

In addition, Daffodil is a good partner with Pandas and NumPy when only some number crunching and array-based operations are needed. Use Daffodil to build the array incrementally using row-based operations, then export the data to NumPy or Pandas. NumPy is recommended if the data is uniform enough because it is faster and has a smaller memory footprint than Pandas.

Daffodil is pure python and can be run with no (or few) other packages installed. It is relatively tiny. If Pandas is not used, start up time is improved dramatically. This can be very important in cloud-based parallel processing where every millsecond is billed or in embedded systems that want to utilize tables but can't suffer the overhead. If conversions to or from Pandas is not required, then that package is not needed.

Daffodil record lookups by key are extremely fast because they use the Python dictionary for looking up rows. It is about 10x faster than Pandas in this regard. It will be hard to beat this, as long as the data table can fit in memory. And with any string data, Daffodil tables are smaller than Pandas tables.

Memory Footprint

A Daffodil object (usually daf) is about 1/3 the size of a Python lod (list-of-dict) structure where each record has repeated column names. The Daffodil array has only one (optional) dictionary for column keys and one (optional) dictionary for row keys.

With only numeric data, it takes about 4x more memory than a minimal Pandas dataframe and 10x more memory than single NumPy array. Yet, sometimes Pandas will be much larger when strings are included in the data. The inclusion of one string column to be used for indexed selections in Pandas consumes 10x more memory than the same data without that column. Daffodil does not expand appreciably and will be 1/3 the size of Pandas in that case.

Thus, Daffodil is a compromise. It is not as wasteful as commonly used lod for such tables, and is a good choice when rapid appends, inserts, row-based operations, and other mutation is required. It also provides row and column operations using [row, col] indexes, where each can be slices, or lists of indices or names. This type of indexing is syntactically similar to what is offered by Pandas and Polars, but Daffodil has almost no constraints on the data in the array, including mixed types in columns, other objects, and even entire Daffodil arrays in one cell. We believe the UI is simpler than what is offered by Pandas.

Supports incremental appends

Daffodil can append or insert one or more rows or concatenate with another Daffodil object extremely quickly, because it leverages Python's data referencing approach. When the data can be used without copying, then this will minimize overhead. Concatenating, dropping and inserting columns is functionality that is provided, but is not recommennded. Avoid explicitly dropping columns and simply provide a list of columns included in calculations.

Column names

Similar to Pandas and other dataframe concepts, Daffodil has a separate set of column names that can be optionally used to name the columns. This is organized internally as a Python dictionary (hd -- header dict) for fast column lookups by name. Column names must be hashable and unique, and other than that, there are no firm restrictions.
(However, to use the interface with SQLite, avoid using the double underscore "__" in the names, which is used to allow arbitrary names in SQLite.)

When reading CSV files, the header is normally taken from the first (non-comment) line. If "user_format" is specified on reading csv files, the csv data will be pre-processed and "comment" lines starting with # are removed.

<!-- Daffodil supports CSVJ, which is a mix of CSV with JSON metadata in comment fields in the first few lines of the file, to provide data type,

Daffodil

Install / Use

README