Fileprep
struct-tag preprocessing and validation for CSV/TSV/LTSV, JSON/JSONL, Parquet, Excel.
Install / Use
/learn @nao1215/FileprepREADME
fileprep
日本語 | Español | Français | 한국어 | Русский | 中文

fileprep is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, JSON, JSONL, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, zstd, zlib, snappy, s2, and lz4 streams.
Why fileprep?
I developed nao1215/filesql, which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created nao1215/csv for CSV file validation.
While studying machine learning, I realized: "If I extend nao1215/csv to support the same file formats as nao1215/filesql, I could combine them to perform ETL-like operations." This idea led to the creation of fileprep—a library that bridges data preprocessing/validation with SQL-based file querying.
Features
- Multiple file format support: CSV, TSV, LTSV, JSON (.json), JSONL (.jsonl), Parquet, Excel (.xlsx)
- Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst), zlib (.z), snappy (.snappy), s2 (.s2), lz4 (.lz4)
- Name-based column binding: Fields auto-match
snake_casecolumn names, customizable vianametag - Struct tag-based preprocessing (
preptag): trim, lowercase, uppercase, default values - Struct tag-based validation (
validatetag): required, omitempty, and more - Processor options:
WithStrictTagParsing()for catching tag misconfigurations,WithValidRowsOnly()for filtering output - Seamless filesql integration: Returns
io.Readerfor direct use with filesql - Detailed error reporting: Row and column information for each error
Installation
go get github.com/nao1215/fileprep
Requirements
- Go Version: 1.25 or later
- Operating Systems:
- Linux
- macOS
- Windows
Quick Start
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// User represents a user record with preprocessing and validation
type User struct {
Name string `prep:"trim" validate:"required"`
Email string `prep:"trim,lowercase"`
Age string
}
func main() {
csvData := `name,email,age
John Doe ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var users []User
reader, result, err := processor.Process(strings.NewReader(csvData), &users)
if err != nil {
fmt.Printf("Error: %v\n", err)
return
}
fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)
for _, user := range users {
fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
}
// reader can be passed directly to filesql
_ = reader
}
Output:
Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"
Gotchas
A few things worth knowing before you start.
JSON/JSONL → single "data" column. fileparser flattens each JSON array element or JSONL line into one column called "data". Your struct needs a field that maps to it:
type JSONRecord struct {
Data string `name:"data" prep:"trim" validate:"required"`
}
Output is always compact JSONL. A prep tag that breaks JSON structure causes ErrInvalidJSONAfterPrep; all-empty output causes ErrEmptyJSONOutput.
Column matching is case-sensitive. Field UserName auto-converts to user_name. Headers spelled differently (User_Name, USERNAME, userName) won't match. Override with the name tag:
type Record struct {
UserName string // matches "user_name" only
Email string `name:"EMAIL"` // matches "EMAIL" exactly
}
Duplicate headers → first column wins. Given id,id,name, only the first id binds.
Missing columns → empty string. If a column is absent, the field gets "". Use validate:"required" to catch this.
Excel → first sheet only. Additional sheets in .xlsx are silently skipped.
Saving output memory → use ProcessToWriter. Process buffers the entire output in memory. ProcessToWriter skips that buffer and writes directly to any io.Writer. Note that input records are still loaded into memory for preprocessing; this only eliminates the output copy:
f, _ := os.Create("output.csv")
defer f.Close()
result, err := processor.ProcessToWriter(input, &records, f)
Advanced Examples
Complex Data Preprocessing and Validation
This example demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// Employee represents employee data with comprehensive preprocessing and validation
type Employee struct {
// ID: pad to 6 digits, must be numeric
EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`
// Name: clean whitespace, required alphabetic with spaces
FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`
// Email: normalize to lowercase, validate format
Email string `prep:"trim,lowercase" validate:"required,email"`
// Department: normalize case, must be one of allowed values
Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`
// Salary: keep only digits, validate range
Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`
// Phone: extract digits, validate E.164 format after adding country code
Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`
// Start date: validate datetime format
StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`
// Manager ID: required only if department is not HR
ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`
// Website: fix missing scheme, validate URL
Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
}
func main() {
// Messy real-world CSV data
csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
42, John Doe ,JOHN.DOE@COMPANY.COM,engineering,"$75,000",555-123-4567,2023-01-15,000001,company.com/john
7,Jane Smith,jane@COMPANY.com, Sales ,"$120,000",(555) 987-6543,2022-06-01,000002,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,555.111.2222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,$88500,555-444-3333,2023-09-10,000003,https://alice.dev
`
processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
var employees []Employee
_, result, err := processor.Process(strings.NewReader(csvData), &employees)
if err != nil {
fmt.Printf("Fatal error: %v\n", err)
return
}
fmt.Printf("=== Processing Result ===\n")
fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)
for i, emp := range employees {
fmt.Printf("Employee %d:\n", i+1)
fmt.Printf(" ID: %s\n", emp.EmployeeID)
fmt.Printf(" Name: %s\n", emp.FullName)
fmt.Printf(" Email: %s\n", emp.Email)
fmt.Printf(" Department: %s\n", emp.Department)
fmt.Printf(" Salary: %s\n", emp.Salary)
fmt.Printf(" Phone: %s\n", emp.Phone)
fmt.Printf(" Start Date: %s\n", emp.StartDate)
fmt.Printf(" Manager ID: %s\n", emp.ManagerID)
fmt.Printf(" Website: %s\n\n", emp.Website)
}
}
Output:
=== Processing Result ===
Total rows: 4, Valid rows: 4
Employee 1:
ID: 000042
Name: John Doe
Email: john.doe@company.com
Department: ENGINEERING
Salary: 75000
Phone: +15551234567
Start Date: 2023-01-15
Manager ID: 000001
Website: https://company.com/john
Employee 2:
ID: 000007
Name: Jane Smith
Email: jane@company.com
Department: SALES
Salary: 120000
Phone: +15559876543
Start Date: 2022-06-01
Manager ID: 000002
Website: https://www.linkedin.com/in/jane
Employee 3:
ID: 000123
Name: Bob Wilson
Email: bob.wilson@company.com
Department: HR
Salary: 45000
Phone: +15551112222
Start Date: 2024-03-20
Manager ID: 000000
Website:
Employee 4:
ID: 000099
Name: Alice Brown
Email: alice@company.com
Department: MARKETING
Salary: 88500
Phone: +15554443333
Start Date: 2023-09-10
Manager ID: 000003
Website: https://alice.dev
Detailed Error Reporting
When validation fails, fileprep provides precise error information including row number, column name, and specific validation failure reason.
package main
import (
"fmt"
"strings"
"github.com/nao1215/fileprep"
)
// Order represents an order with strict validation rules
type Order struct {
OrderID string `name:"order_id" validate:"required,uuid4"`
CustomerID string `name:"customer_id" val
Related Skills
node-connect
337.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
xurl
337.3kA CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
