Geni
A Clojure dataframe library that runs on Spark
Install / Use
/learn @zero-one-group/GeniREADME
Geni (/gɜni/ or "gurney" without the r) is a Clojure dataframe library that runs on Apache Spark. The name means "fire" in Javanese.
Overview
Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's -> threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.
Resources
<table> <tbody> <tr> <th align="center" width="441"> Docs </th> <th align="center" width="441"> Cookbook </th> </tr> <tr> <td> <ul> <li><a href="docs/simple_performance_benchmark.md">A Simple Performance Benchmark</a></li> <li><a href="CODE_OF_CONDUCT.md">Code of Conduct</a></li> <li><a href="CONTRIBUTING.md">Contributing Guide</a></li> <li><a href="docs/creating_spark_schemas.md">Creating Spark Schemas</a></li> <li><a href="docs/examples.md">Examples</a></li> <li><a href="docs/design_goals.md">Design Goals</a></li> <li><a href="docs/semantics.md">Geni Semantics</a></li> <li><a href="docs/manual_dataset_creation.md">Manual Dataset Creation</a></li> <li><a href="docs/xgboost.md">Optional XGBoost Support</a></li> <li><a href="docs/pandas_numpy_and_other_idioms.md">Pandas, NumPy and Other Idioms</a></li> <li><a href="docs/dataproc.md">Using Dataproc</a></li> <li><a href="docs/kubernetes_basic.md">Using Kubernetes</a></li> <li><a href="docs/spark_session.md">Where's The Spark Session</a></li> <li><a href="docs/why.md">Why?</a></li> <li><a href="docs/sql_maps.md">Working with SQL Maps</a></li> <li><a href="docs/collect.md">Collecting Data from Spark Datasets</a></li> </ul> </td> <td> <ol start="0"> <li><a href="docs/cookbook/part_00_getting_started_with_clojure_geni_and_spark.md"> Getting Started with Clojure, Geni and Spark </a></li> <li><a href="docs/cookbook/part_01_reading_and_writing_datasets.md"> Reading and Writing Datasets </a></li> <li><a href="docs/cookbook/part_02_selecting_rows_and_columns.md"> Selecting Rows and Columns </a></li> <li><a href="docs/cookbook/part_03_grouping_and_aggregating.md"> Grouping and Aggregating </a></li> <li><a href="docs/cookbook/part_04_combining_datasets_with_joins_and_unions.md"> Combining Datasets with Joins and Unions </a></li> <li><a href="docs/cookbook/part_05_string_operations.md"> String Operations </a></li> <li><a href="docs/cookbook/part_06_cleaning_up_messy_data.md"> Cleaning up Messy Data </a></li> <li><a href="docs/cookbook/part_07_timestamps_and_dates.md"> Timestamps and Dates </a></li> <li><a href="docs/cookbook/part_08_window_functions.md"> Window Functions </a></li> <li><a href="docs/cookbook/part_09_reading_from_and_writing_to_sql_databases.md"> Reading from and Writing to SQL Databases </a></li> <li><a href="docs/cookbook/part_10_avoiding_repeated_computations_with_caching.md"> Avoiding Repeated Computations with Caching </a></li> <li><a href="docs/cookbook/part_11_basic_ml_pipelines.md"> Basic ML Pipelines </a></li> <li><a href="docs/cookbook/part_12_customer_segmentation_with_nmf.md"> Customer Segmentation with NMF </a></li> </ol> </td> </tr> </tbody> </table>Basic Examples
All examples below use the Statlib California housing prices data available for free on Kaggle.
Spark SQL API for data wrangling:
(require '[zero-one.geni.core :as g])
(def dataframe (g/read-parquet! "test/resources/housing.parquet"))
(g/count dataframe)
=> 5000
(g/print-schema dataframe)
; root
; |-- longitude: double (nullable = true)
; |-- latitude: double (nullable = true)
; |-- housing_median_age: double (nullable = true)
; |-- total_rooms: double (nullable = true)
; |-- total_bedrooms: double (nullable = true)
; |-- population: double (nullable = true)
; |-- households: double (nullable = true)
; |-- median_income: double (nullable = true)
; |-- median_house_value: double (nullable = true)
; |-- ocean_proximity: string (nullable = true)
(-> dataframe (g/limit 5) g/show)
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|ocean_proximity|
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
; |-122.23 |37.88 |41.0 |880.0 |129.0 |322.0 |126.0 |8.3252 |452600.0 |NEAR BAY |
; |-122.22 |37.86 |21.0 |7099.0 |1106.0 |2401.0 |1138.0 |8.3014 |358500.0 |NEAR BAY |
; |-122.24 |37.85 |52.0 |1467.0 |190.0 |496.0 |177.0 |7.2574 |352100.0 |NEAR BAY |
; |-122.25 |37.85 |52.0 |1274.0 |235.0 |558.0 |219.0 |5.6431 |341300.0 |NEAR BAY |
; |-122.25 |37.85 |52.0 |1627.0 |280.0 |565.0 |259.0 |3.8462 |342200.0 |NEAR BAY |
; +---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+---------------+
(-> dataframe (g/describe :housing_median_age :total_rooms :population) g/show)
; +-------+------------------+------------------+-----------------+
; |summary|housing_median_age|total_rooms |population |
; +-------+------------------+------------------+-----------------+
; |count |5000 |5000 |5000 |
; |mean |30.9842 |2393.2132 |1334.9684 |
; |stddev |12.969656616832669|1812.4457510408017|954.0206427949117|
; |min |1.0 |1000.0 |100.0 |
; |max |9.0 |999.0 |999.0 |
; +-------+------------------+------------------+-----------------+
(-> dataframe
(g/group-by :ocean_proximity)
(g/agg {:count (g/count "*")
:mean-rooms (g/mean :total_rooms)
:distinct-lat (g/count-distinct (g/int :latitude))})
(g/order-by (g/desc :count))
g/show)
; +---------------+-----+------------------+------------+
; |ocean_proximity|count|mean-rooms |distinct-lat|
; +---------------+-----+------------------+------------+
; |INLAND |1823 |2358.181020296215 |10 |
; |<1H OCEAN |1783 |2467.5361749859785|7 |
; |NEAR BAY |1287 |2368.72027972028 |2 |
; |NEAR OCEAN |107 |2046.1869158878505|2 |
; +---------------+-----+------------------+------------+
(-> dataframe
(g/select {:ocean :ocean_proximity
:house (g/struct {:rooms (g/struct :total_rooms :total_bedrooms)
:age :housing_median_age})
:coord (g/struct {:lat :latitude :long :longitude})})
(g/limit 3)
g/collect)
=> ({:ocean "NEAR BAY",
:house {:rooms {:total_rooms 880.0, :total_bedrooms 129.0},
:age 41.0},
:coord {:lat 37.88, :long -122.23}}
{:ocean "NEAR BAY",
:house {:rooms {:total_rooms 7099.0, :total_bedrooms 1106.0},
:age 21.0},
:coord {:lat 37.86, :long -122.22}}
{:ocean "NEAR BAY",
:house {:rooms {:total_rooms 1467.0, :total_bedrooms 190.0},
:age 52.0},
:coord {:lat 37.85, :long -122.24}})
Spark ML example translated from Spark's programming guide:
(require '[zero-one.geni.core :as g])
(require '[zero-one.geni.ml :as ml])
(def training-set
(g/table->dataset
[[0 "a b c d e spark" 1.0]
[1 "b d" 0.0]
[2 "spark f g h" 1.0]
[3 "hadoop mapreduce" 0.0]]
[:id :text :label]))
(def pipeline
(ml/pipeline
(ml/tokenizer {:input-col :text
:output-col :words})
(ml/hashing-tf {:num-features 1000
Related Skills
feishu-drive
339.3k|
things-mac
339.3kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
339.3kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
yu-ai-agent
2.0k编程导航 2025 年 AI 开发实战新项目,基于 Spring Boot 3 + Java 21 + Spring AI 构建 AI 恋爱大师应用和 ReAct 模式自主规划智能体YuManus,覆盖 AI 大模型接入、Spring AI 核心特性、Prompt 工程和优化、RAG 检索增强、向量数据库、Tool Calling 工具调用、MCP 模型上下文协议、AI Agent 开发(Manas Java 实现)、Cursor AI 工具等核心知识。用一套教程将程序员必知必会的 AI 技术一网打尽,帮你成为 AI 时代企业的香饽饽,给你的简历和求职大幅增加竞争力。
