SkillAgentSearch skills...

Fileprep

struct-tag preprocessing and validation for CSV/TSV/LTSV, JSON/JSONL, Parquet, Excel.

Install / Use

/learn @nao1215/Fileprep
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

fileprep

Go Reference Go Report Card MultiPlatformUnitTest Coverage

日本語 | Español | Français | 한국어 | Русский | 中文

fileprep-logo

fileprep is a Go library for cleaning, normalizing, and validating structured data—CSV, TSV, LTSV, JSON, JSONL, Parquet, and Excel—through lightweight struct-tag rules, with seamless support for gzip, bzip2, xz, zstd, zlib, snappy, s2, and lz4 streams.

Why fileprep?

I developed nao1215/filesql, which allows you to execute SQL queries on files like CSV, TSV, LTSV, Parquet, and Excel. I also created nao1215/csv for CSV file validation.

While studying machine learning, I realized: "If I extend nao1215/csv to support the same file formats as nao1215/filesql, I could combine them to perform ETL-like operations." This idea led to the creation of fileprep—a library that bridges data preprocessing/validation with SQL-based file querying.

Features

  • Multiple file format support: CSV, TSV, LTSV, JSON (.json), JSONL (.jsonl), Parquet, Excel (.xlsx)
  • Compression support: gzip (.gz), bzip2 (.bz2), xz (.xz), zstd (.zst), zlib (.z), snappy (.snappy), s2 (.s2), lz4 (.lz4)
  • Name-based column binding: Fields auto-match snake_case column names, customizable via name tag
  • Struct tag-based preprocessing (prep tag): trim, lowercase, uppercase, default values
  • Struct tag-based validation (validate tag): required, omitempty, and more
  • Processor options: WithStrictTagParsing() for catching tag misconfigurations, WithValidRowsOnly() for filtering output
  • Seamless filesql integration: Returns io.Reader for direct use with filesql
  • Detailed error reporting: Row and column information for each error

Installation

go get github.com/nao1215/fileprep

Requirements

  • Go Version: 1.25 or later
  • Operating Systems:
    • Linux
    • macOS
    • Windows

Quick Start

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// User represents a user record with preprocessing and validation
type User struct {
    Name  string `prep:"trim" validate:"required"`
    Email string `prep:"trim,lowercase"`
    Age   string
}

func main() {
    csvData := `name,email,age
  John Doe  ,JOHN@EXAMPLE.COM,30
Jane Smith,jane@example.com,25
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var users []User

    reader, result, err := processor.Process(strings.NewReader(csvData), &users)
    if err != nil {
        fmt.Printf("Error: %v\n", err)
        return
    }

    fmt.Printf("Processed %d rows, %d valid\n", result.RowCount, result.ValidRowCount)

    for _, user := range users {
        fmt.Printf("Name: %q, Email: %q\n", user.Name, user.Email)
    }

    // reader can be passed directly to filesql
    _ = reader
}

Output:

Processed 2 rows, 2 valid
Name: "John Doe", Email: "john@example.com"
Name: "Jane Smith", Email: "jane@example.com"

Gotchas

A few things worth knowing before you start.

JSON/JSONL → single "data" column. fileparser flattens each JSON array element or JSONL line into one column called "data". Your struct needs a field that maps to it:

type JSONRecord struct {
    Data string `name:"data" prep:"trim" validate:"required"`
}

Output is always compact JSONL. A prep tag that breaks JSON structure causes ErrInvalidJSONAfterPrep; all-empty output causes ErrEmptyJSONOutput.

Column matching is case-sensitive. Field UserName auto-converts to user_name. Headers spelled differently (User_Name, USERNAME, userName) won't match. Override with the name tag:

type Record struct {
    UserName string                 // matches "user_name" only
    Email    string `name:"EMAIL"`  // matches "EMAIL" exactly
}

Duplicate headers → first column wins. Given id,id,name, only the first id binds.

Missing columns → empty string. If a column is absent, the field gets "". Use validate:"required" to catch this.

Excel → first sheet only. Additional sheets in .xlsx are silently skipped.

Saving output memory → use ProcessToWriter. Process buffers the entire output in memory. ProcessToWriter skips that buffer and writes directly to any io.Writer. Note that input records are still loaded into memory for preprocessing; this only eliminates the output copy:

f, _ := os.Create("output.csv")
defer f.Close()

result, err := processor.ProcessToWriter(input, &records, f)

Advanced Examples

Complex Data Preprocessing and Validation

This example demonstrates the full power of fileprep: combining multiple preprocessors and validators to clean and validate real-world messy data.

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// Employee represents employee data with comprehensive preprocessing and validation
type Employee struct {
    // ID: pad to 6 digits, must be numeric
    EmployeeID string `name:"id" prep:"trim,pad_left=6:0" validate:"required,numeric,len=6"`

    // Name: clean whitespace, required alphabetic with spaces
    FullName string `name:"name" prep:"trim,collapse_space" validate:"required,alphaspace"`

    // Email: normalize to lowercase, validate format
    Email string `prep:"trim,lowercase" validate:"required,email"`

    // Department: normalize case, must be one of allowed values
    Department string `prep:"trim,uppercase" validate:"required,oneof=ENGINEERING SALES MARKETING HR"`

    // Salary: keep only digits, validate range
    Salary string `prep:"trim,keep_digits" validate:"required,numeric,gte=30000,lte=500000"`

    // Phone: extract digits, validate E.164 format after adding country code
    Phone string `prep:"trim,keep_digits,prefix=+1" validate:"e164"`

    // Start date: validate datetime format
    StartDate string `name:"start_date" prep:"trim" validate:"required,datetime=2006-01-02"`

    // Manager ID: required only if department is not HR
    ManagerID string `name:"manager_id" prep:"trim,pad_left=6:0" validate:"required_unless=Department HR"`

    // Website: fix missing scheme, validate URL
    Website string `prep:"trim,lowercase,fix_scheme=https" validate:"url"`
}

func main() {
    // Messy real-world CSV data
    csvData := `id,name,email,department,salary,phone,start_date,manager_id,website
  42,  John   Doe  ,JOHN.DOE@COMPANY.COM,engineering,"$75,000",555-123-4567,2023-01-15,000001,company.com/john
7,Jane Smith,jane@COMPANY.com,  Sales  ,"$120,000",(555) 987-6543,2022-06-01,000002,WWW.LINKEDIN.COM/in/jane
123,Bob Wilson,bob.wilson@company.com,HR,45000,555.111.2222,2024-03-20,,
99,Alice Brown,alice@company.com,Marketing,$88500,555-444-3333,2023-09-10,000003,https://alice.dev
`

    processor := fileprep.NewProcessor(fileprep.FileTypeCSV)
    var employees []Employee

    _, result, err := processor.Process(strings.NewReader(csvData), &employees)
    if err != nil {
        fmt.Printf("Fatal error: %v\n", err)
        return
    }

    fmt.Printf("=== Processing Result ===\n")
    fmt.Printf("Total rows: %d, Valid rows: %d\n\n", result.RowCount, result.ValidRowCount)

    for i, emp := range employees {
        fmt.Printf("Employee %d:\n", i+1)
        fmt.Printf("  ID:         %s\n", emp.EmployeeID)
        fmt.Printf("  Name:       %s\n", emp.FullName)
        fmt.Printf("  Email:      %s\n", emp.Email)
        fmt.Printf("  Department: %s\n", emp.Department)
        fmt.Printf("  Salary:     %s\n", emp.Salary)
        fmt.Printf("  Phone:      %s\n", emp.Phone)
        fmt.Printf("  Start Date: %s\n", emp.StartDate)
        fmt.Printf("  Manager ID: %s\n", emp.ManagerID)
        fmt.Printf("  Website:    %s\n\n", emp.Website)
    }
}

Output:

=== Processing Result ===
Total rows: 4, Valid rows: 4

Employee 1:
  ID:         000042
  Name:       John Doe
  Email:      john.doe@company.com
  Department: ENGINEERING
  Salary:     75000
  Phone:      +15551234567
  Start Date: 2023-01-15
  Manager ID: 000001
  Website:    https://company.com/john

Employee 2:
  ID:         000007
  Name:       Jane Smith
  Email:      jane@company.com
  Department: SALES
  Salary:     120000
  Phone:      +15559876543
  Start Date: 2022-06-01
  Manager ID: 000002
  Website:    https://www.linkedin.com/in/jane

Employee 3:
  ID:         000123
  Name:       Bob Wilson
  Email:      bob.wilson@company.com
  Department: HR
  Salary:     45000
  Phone:      +15551112222
  Start Date: 2024-03-20
  Manager ID: 000000
  Website:

Employee 4:
  ID:         000099
  Name:       Alice Brown
  Email:      alice@company.com
  Department: MARKETING
  Salary:     88500
  Phone:      +15554443333
  Start Date: 2023-09-10
  Manager ID: 000003
  Website:    https://alice.dev

Detailed Error Reporting

When validation fails, fileprep provides precise error information including row number, column name, and specific validation failure reason.

package main

import (
    "fmt"
    "strings"

    "github.com/nao1215/fileprep"
)

// Order represents an order with strict validation rules
type Order struct {
    OrderID    string `name:"order_id" validate:"required,uuid4"`
    CustomerID string `name:"customer_id" val

Related Skills

View on GitHub
GitHub Stars17
CategoryDevelopment
Updated4d ago
Forks0

Languages

Go

Security Score

95/100

Audited on Mar 22, 2026

No findings