NimData

DataFrame API written in Nim, enabling fast out-of-core data processing

Generate Convert Improve

Install / Use

/learn @bluenote10/NimData

About this skill

Quality Score

0/100

README

NimData <a href="https://github.com/yglukhov/nimble-tag"><img src="https://raw.githubusercontent.com/yglukhov/nimble-tag/master/nimble.png" height="23" ></a>

Overview

NimData is a data manipulation and analysis library for the Nim programming language. It combines Pandas-like syntax with the type-safe, lazy APIs of distributed frameworks like Spark/Flink/Thrill. Although NimData is currently non-distributed, it harnesses the power of Nim to perform out-of-core processing at native speed.

NimData's core data type is the generic DataFrame[T]. All DataFrame methods are based on the MapReduce paradigm and fall into two categories:

Transformations: Operations like map or filter transform one DataFrame into another. Transformations are lazy, meaning that they are not executed until an action is called. They can also be chained.
Actions: Operations like count, min, max, sum, reduce, fold, collect, or show perform an aggregation on a DataFrame. Calling an action triggers the processing pipeline.

For a complete list of NimData's supported operations, see the module docs.

Installation

Install Nim and ensure that both Nim and Nimble (Nim's package manager) are added to your PATH.
From the command line, run $ nimble install NimData (this will download NimData's source from GitHub to ~/.nimble/pkgs).

Quickstart

Hello, World!

Once NimData is installed, we'll write a simple program to test it. Create a new file named test.nim with the following contents:

import nimdata

echo DF.fromRange(0, 10).collect()

From the command line, use $ nim c -r test.nim to compile and run the program (c for compile, and -r to run directly after compilation). It should print this sequence:

# => @[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Pandas users: This is roughly equivalent to print(pd.DataFrame(range(10))[0].values)

Reading raw text data

Next we'll use this German soccer data set to explore NimData's main functionality.

To create a DataFrame which simply iterates over the raw text content of a file, we can use DF.fromFile():

let dfRawText = DF.fromFile("examples/Bundesliga.csv")

Note that fromFile is a lazy operation, meaning that NimData doesn't actually read the contents of the file yet. To read the file, we need to call an action on our dataframe. Calling count, for example, triggers a line-by-line reading of the file and returns the number of rows:

echo dfRawText.count()
# => 14018

We can chain multiple operations on dfRawText. For example, we can use take to filter the file down to its first five rows, and show to print the result:

dfRawText.take(5).show()
# =>
# "1","Werder Bremen","Borussia Dortmund",3,2,1,1963,1963-08-24 09:30:00
# "2","Hertha BSC Berlin","1. FC Nuernberg",1,1,1,1963,1963-08-24 09:30:00
# "3","Preussen Muenster","Hamburger SV",1,1,1,1963,1963-08-24 09:30:00
# "4","Eintracht Frankfurt","1. FC Kaiserslautern",1,1,1,1963,1963-08-24 09:30:00
# "5","Karlsruher SC","Meidericher SV",1,4,1,1963,1963-08-24 09:30:00

Pandas users: This is equivalent to print(dfRawText.head(5)).

Note, however, that every time an action is called, the file is read from scratch, which is inefficient. We'll improve on that in a moment.

Type-safe schema parsing

At this stage, dfRawText's data type is a plain DataFrame[string]. It also doesn't have any column headers, and the first field isn't a proper index, but rather contains string literals. Let's transform our dataframe into something more useful for analysis:

const schema = [
  strCol("index"),
  strCol("homeTeam"),
  strCol("awayTeam"),
  intCol("homeGoals"),
  intCol("awayGoals"),
  intCol("round"),
  intCol("year"),
  dateCol("date", format="yyyy-MM-dd hh:mm:ss")
]
let df = dfRawText.map(schemaParser(schema, ','))
                  .map(record => record.projectAway(index))
                  .cache()

This code does three things:

The schemaParser macro constructs a specialized parsing function for each field, which takes a string as input and returns a type-safe named tuple corresponding to the type definition in schema. For instance, dateCol("date") tells the parser that the last column is named "date" and contains datetime values. We can even specify the datetime format by passing a format string to dateCol() as a named parameter. A key benefit of defining the schema at compile time is that the parser produces highly optimized machine code, resulting in very fast performance.
The projectAway macro transforms the results of schemeParser into a new dataframe with the "index" column removed (Pandas users: this is roughly equivalent to dfRawText.drop(columns=['index'])). See also projectTo, which instead keeps certain fields, and addFields, which extends the schema by new fields.
The cache method stores the parsing result in memory. This allows us to perform multiple actions on the data without having to re-read the file contents every time. Spark users: In contrast to Spark, cache is currently implemented as an action.

Now we can perform the same operations as before, but this time our dataframe contains the parsed tuples:

echo df.count()
# => 14018

df.take(5).show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Werder B… | "Borussia… |          3 |          2 |          1 |       1963 | 1963-08-2… |
# | "Hertha B… | "1. FC Nu… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Preussen… | "Hamburge… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Eintrach… | "1. FC Ka… |          1 |          1 |          1 |       1963 | 1963-08-2… |
# | "Karlsruh… | "Meideric… |          1 |          4 |          1 |       1963 | 1963-08-2… |
# +------------+------------+------------+------------+------------+------------+------------+

Note that instead of starting the pipeline from dfRawText and using caching, we could always write the pipeline from scratch:

DF.fromFile("examples/Bundesliga.csv")
  .map(schemaParser(schema, ','))
  .map(record => record.projectAway(index))
  .take(5)
  .show()

Filter

Data can be filtered by using filter. For instance, we can filter the data to get games of a certain team only:

import strutils

df.filter(record =>
    record.homeTeam.contains("Freiburg") or
    record.awayTeam.contains("Freiburg")
  )
  .take(5)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Bayern M… | "SC Freib… |          3 |          1 |          1 |       1993 | 1993-08-0… |
# | "SC Freib… | "Wattensc… |          4 |          1 |          2 |       1993 | 1993-08-1… |
# | "Borussia… | "SC Freib… |          3 |          2 |          3 |       1993 | 1993-08-2… |
# | "SC Freib… | "Hamburge… |          0 |          1 |          4 |       1993 | 1993-08-2… |
# | "1. FC Ko… | "SC Freib… |          2 |          0 |          5 |       1993 | 1993-09-0… |
# +------------+------------+------------+------------+------------+------------+------------+

Note: Without the strutils module, contains will throw a type error here.

Or search for games with many home goals:

df.filter(record => record.homeGoals >= 10)
  .show()
# =>
# +------------+------------+------------+------------+------------+------------+------------+
# | homeTeam   | awayTeam   |  homeGoals |  awayGoals |      round |       year | date       |
# +------------+------------+------------+------------+------------+------------+------------+
# | "Borussia… | "Schalke … |         11 |          0 |         18 |       1966 | 1967-01-0… |
# | "Borussia… | "Borussia… |         10 |          0 |         12 |       1967 | 1967-11-0… |
# | "Bayern M… | "Borussia… |         11 |          1 |         16 |       1971 | 1971-11-2… |
# | "Borussia… | "Borussia… |         12 |          0 |         34 |       1977 | 1978-04-2… |
# | "Borussia… | "Arminia … |         11 |          1 |         12 |       1982 | 1982-11-0… |
# | "Borussia… | "Eintrach… |         10 |          0 |          8 |       1984 | 1984-10-1… |
# +------------+------------+------------+------------+------------+------------+------------+

Note that we can now fully benefit from type-safety: The compiler knows the exact fields and types of a record. No dynamic field lookup and/or type casting is required. Assumptions about the data structure are moved to the earliest possible step in the pipeline, allowing to fail early if they are wrong. After transitioning into the type-safe domain, the compiler helps to verify the correctness of even long processing pipelines, reducing the risk of runtime errors.

Other filter-like transformation are:

take, which takes the first N records as already seen.
drop, which discard the first N records.
filterWithIndex, which allows to define a filter function th

Related Skills

node-connect

339.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.9k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

339.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.9k

Commit, push, and open a PR