Sergeant
:guardsman: Tools to Transform and Query Data with 'Apache' 'Drill'
Install / Use
/learn @hrbrmstr/SergeantREADME
💂 sergeant
Tools to Transform and Query Data with ‘Apache’ ‘Drill’
** IMPORTANT **
Version 0.7.0+ (a.k.a. the main branch) splits off the JDBC interface
into a separate package sergeant.caffeinated
(GitHub).
I# Description
Drill + sergeant is (IMO) a streamlined alternative to Spark +
sparklyr if you don’t need the ML components of Spark (i.e. just need
to query “big data” sources, need to interface with parquet, need to
combine disparate data source types — json, csv, parquet, rdbms - for
aggregation, etc). Drill also has support for spatial queries.
Using Drill SQL queries that reference parquet files on a local linux or macOS workstation can often be more performant than doing the same data ingestion & wrangling work with R (especially for large or disperate data sets). Drill can often help further streamline workflows that involve wrangling many tiny JSON files on a daily basis.
Drill can be obtained from https://drill.apache.org/download/ (use
“Direct File Download”). Drill can also be installed via
Docker. For
local installs on Unix-like systems, a common/suggestion location for
the Drill directory is /usr/local/drill as the install directory.
Drill embedded (started using the $DRILL_BASE_DIR/bin/drill-embedded
script) is a super-easy way to get started playing with Drill on a
single workstation and most of many workflows can “get by” using Drill
this way.
There are a few convenience wrappers for various informational SQL
queries (like drill_version()). Please file an PR if you add more.
Some of the more “controlling vs data ops” REST API functions aren’t implemented. Please file a PR if you need those.
The following functions are implemented:
DBI (REST)
- A “just enough” feature complete R
DBIdriver has been implemented using the Drill REST API, mostly to facilitate thedplyrinterface. Use theRJDBCdriver interface if you need moreDBIfunctionality. - This also means that SQL functions unique to Drill have also been
“implemented” (i.e. made accessible to the
dplyrinterface). If you have custom Drill SQL functions that need to be implemented please file an issue on GitHub. Many should work without it, but some may require a custom interface.
dplyr: (REST)
src_drill: Connect to Drill (usingdplyr) + supporting functions
Note that a number of Drill SQL functions have been mapped to R
functions (e.g. grepl) to make it easier to transition from
non-database-backed SQL ops to Drill. See the help on
drill_custom_functions for more info on these helper Drill custom
function mappings.
Drill APIs:
drill_connection: Setup parameters for a Drill server/cluster connectiondrill_active: Test whether Drill HTTP REST API server is updrill_cancel: Cancel the query that has the given queryiddrill_functions: Show all the available Drill built-in functions & UDFs (Apache Drill 1.15.0+ required)drill_jdbc: Connect to Drill using JDBCdrill_metrics: Get the current memory metricsdrill_options: List the name, default, and data type of the system and session optionsdrill_popts: Show all the available Drill options (1.15.0+)drill_profile: Get the profile of the query that has the given query iddrill_profiles: Get the profiles of running and completed queriesdrill_query: Submit a query and return resultsdrill_set: Set Drill SYSTEM or SESSION optionsdrill_settings_reset: Changes (optionally, all) session settings back to system defaultsdrill_show_files: Show files in a file system schema.drill_show_schemas: Returns a list of available schemas.drill_stats: Get Drillbit information, such as ports numbersdrill_status: Get the status of Drilldrill_storage: Get the list of storage plugin names and configurationsdrill_system_reset: Changes (optionally, all) system settings back to system defaultsdrill_threads: Get information about threadsdrill_uplift: Turn a columnar query results into a type-converted tbldrill_use: Change to a particular schema.drill_version: Identify the version of Drill running
Helpers
ctas_profile: Generate a Drill CTAS Statement from a Querydrill_up: sart a Dockerized Drill Instance #sdrill_down: stop a Dockerized Drill Instance by container idhowall_drill: Show all dead and running Drill Docker containersstopall_drill: Prune all dead and running Drill Docker containers
Installation
install.packages("sergeant", repos = "https://cinc.rud.is")
# or
devtools::install_git("https://git.rud.is/hrbrmstr/sergeant.git")
# or
devtools::install_git("https://git.sr.ht/~hrbrmstr/sergeant")
# or
devtools::install_gitlab("hrbrmstr/sergeant")
# or
devtools::install_bitbucket("hrbrmstr/sergeant")
# or
devtools::install_github("hrbrmstr/sergeant")
Usage
dplyr interface
library(sergeant)
library(tidyverse)
# use localhost if running standalone on same system otherwise the host or IP of your Drill server
ds <- src_drill("localhost") #ds
db <- tbl(ds, "cp.`employee.json`")
# without `collect()`:
count(db, gender, marital_status)
## # Source: lazy query [?? x 3]
## # Database: DrillConnection
## # Groups: gender
## gender marital_status n
## <chr> <chr> <dbl>
## 1 F S 297
## 2 M M 278
## 3 M S 276
## 4 F M 304
count(db, gender, marital_status) %>% collect()
## # A tibble: 4 x 3
## # Groups: gender [2]
## gender marital_status n
## <chr> <chr> <dbl>
## 1 F S 297
## 2 M M 278
## 3 M S 276
## 4 F M 304
group_by(db, position_title) %>%
count(gender) -> tmp2
group_by(db, position_title) %>%
count(gender) %>%
ungroup() %>%
mutate(full_desc = ifelse(gender == "F", "Female", "Male")) %>%
collect() %>%
select(Title = position_title, Gender = full_desc, Count = n)
## # A tibble: 30 x 3
## Title Gender Count
## <chr> <chr> <dbl>
## 1 President Female 1
## 2 VP Country Manager Male 3
## 3 VP Country Manager Female 3
## 4 VP Information Systems Female 1
## 5 VP Human Resources Female 1
## 6 Store Manager Female 13
## 7 VP Finance Male 1
## 8 Store Manager Male 11
## 9 HQ Marketing Female 2
## 10 HQ Information Systems Female 4
## # … with 20 more rows
arrange(db, desc(employee_id)) %>% print(n = 20)
## # Source: table<cp.`employee.json`> [?? x 20]
## # Database: DrillConnection
## # Ordered by: desc(employee_id)
## employee_id full_name first_name last_name position_id position_title store_id department_id birth_date hire_date
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 999 Beverly … Beverly Dittmar 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 2 998 Elizabet… Elizabeth Jantzer 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 3 997 John Swe… John Sweet 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 4 996 William … William Murphy 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 5 995 Carol Li… Carol Lindsay 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 6 994 Richard … Richard Burke 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 7 993 Ethan Bu… Ethan Bunosky 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 8 992 Claudett… Claudette Cabrera 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 9 991 Maria Te… Maria Terry 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 10 990 Stacey C… Stacey Case 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 11 99 Elizabet… Elizabeth Horne 18 Store Tempora… 6 18 1976-10-05 1997-01-…
## 12 989 Dominick… Dominick Nutter 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 13 988 Brian Wi… Brian Willeford 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 14 987 Margaret… Margaret Clendenen 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 15 986 Maeve Wa… Maeve Wall 17 Store Permane… 8 17 1914-02-02 1998-01-…
## 16 985 Mildred … Mildred Morrow 16 Store Tempora… 8 16 1914-02-02 1998-01-…
## 17 984 French W… French Wilson 16 Store Tempora… 8 16 1914-02-02 1998-01-…
## 18 983 Elisabet… Elisabeth Duncan 16 Store Tempora… 8 16 1914-02-02 1998-01-…
## 19 982 Linda An… Linda Anderson 16 Store Tempora… 8 16 1914-02-02 1998-01-…
## 20 981 Selene W… Selene Watson 16 Store Tempora… 8 16 1914-02-02 1998-01-…
## # … with more rows, and 6 more varia
Related Skills
oracle
351.2kBest practices for using the oracle CLI (prompt + file bundling, engines, sessions, and file attachment patterns).
prose
351.2kOpenProse VM skill pack. Activate on any `prose` command, .prose files, or OpenProse mentions; orchestrates multi-agent workflows.
Command Development
110.6kThis skill should be used when the user asks to "create a slash command", "add a command", "write a custom command", "define command arguments", "use command frontmatter", "organize commands", "create command with file references", "interactive command", "use AskUserQuestion in command", or needs guidance on slash command structure, YAML frontmatter fields, dynamic arguments, bash execution in commands, user interaction patterns, or command development best practices for Claude Code.
Plugin Structure
110.6kThis skill should be used when the user asks to "create a plugin", "scaffold a plugin", "understand plugin structure", "organize plugin components", "set up plugin.json", "use ${CLAUDE_PLUGIN_ROOT}", "add commands/agents/skills/hooks", "configure auto-discovery", or needs guidance on plugin directory layout, manifest configuration, component organization, file naming conventions, or Claude Code plugin architecture best practices.
