Dsbulk
DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Install / Use
/learn @datastax/DsbulkREADME
DataStax Bulk Loader Overview
The DataStax Bulk Loader tool (DSBulk) is a unified tool for loading into and unloading from Cassandra-compatible storage engines, such as OSS Apache Cassandra®, DataStax Astra and DataStax Enterprise (DSE).
Out of the box, DSBulk provides the ability to:
- Load (import) large amounts of data into the database efficiently and reliably;
- Unload (export) large amounts of data from the database efficiently and reliably;
- Count elements in a database table: how many rows in total, how many rows per replica and per token range, and how many rows in the top N largest partitions.
Currently, CSV and Json formats are supported for both loading and unloading data.
Installation
DSBulk can be downloaded from several locations:
- From DataStax Downloads.
- Available formats: zip, tar.gz.
- From GitHub.
- Available formats: zip, tar.gz and executable jar.
- From Maven Central: download the artifact
dsbulk-distribution, for example from here.- Available formats: zip, tar.gz and executable jar.
Please note: only the zip and tar.gz formats are considered production-ready. The executable jar is provided as a convenience for users that want to try DSBulk, but it should not be deployed in production environments.
To install DSBulk, simply unpack the zip or tar.gz archives.
The executable jar can be executed with a command like java -jar dsbulk-distribution.jar [subcommand] [options]. See below for command line options.
Documentation
The most up-to-date documentation is available [online][onlineDocs].
We also recommend reading the series of blog posts made by Brian Hess; they target a somewhat older version of DSBulk, but most of the contents are still valid and very useful:
- [DataStax Bulk Loader Pt. 1 — Introduction and Loading]
- [DataStax Bulk Loader Pt. 2 — More Loading]
- [DataStax Bulk Loader Pt. 3 — Common Settings]
- [DataStax Bulk Loader Pt. 4 — Unloading]
- [DataStax Bulk Loader Pt. 5 — Counting]
- [DataStax Bulk Loader: Examples for Loading From Other Locations]
Developers and contributors: please read our Contribution Guidelines.
Basic Usage
Launch the tool with the appropriate script in the bin directory of your distribution. The help text of the tool provides summaries of all supported settings.
The dsbulk command takes a subcommand argument followed by options:
# Load data
dsbulk load <options>
# Unload data
dsbulk unload <options>
# Count rows
dsbulk count <options>
Long options
Any DSBulk or Java Driver setting can be entered on the command line as a long-option argument of the following general form:
--full.path.of.setting "some-value"
DSBulk settings always start with dsbulk; for convenience, this prefix can be omitted in a long
option argument, so the following two options are equivalent and both map to DSBulk's
dsbulk.batch.mode setting:
--dsbulk.batch.mode PARTITION_KEY
--batch.mode PARTITION_KEY
Java Driver settings always start with datastax-java-driver; for convenience, this prefix can be
shortened to driver in a long option argument, so the following two options are equivalent and
both map to the driver's datastax-java-driver.basic.cloud.secure-connect-bundle setting:
--datastax-java-driver.basic.cloud.secure-connect-bundle /path/to/bundle
--driver.basic.cloud.secure-connect-bundle /path/to/bundle
Most settings have default values, or values that can be inferred from the input data. However, sometimes the default value is not suitable for you, in which case you will have to specify the desired value either in the application configuration file (see below), or on the command line.
For example, the default value for connector.csv.url is to read from standard input or write to
standard output; if that does not work for you, you need to override this value and specify the
source path/url of the csv data to load (or path/url where to send unloaded data).
See the Settings page or DSBulk's template configuration file for details.
Short options (Shortcuts)
For convenience, many options (prefaced with --), have shortcut variants (prefaced with -).
For example, --dsbulk.schema.keyspace has an equivalent short option -k.
Connector-specific options also have shortcut variants, but they are only available when
the appropriate connector is chosen. This allows multiple connectors to overlap shortcut
options. For example, the JSON connector has a --connector.json.url
setting with a -url shortcut. This overlaps with the -url shortcut option for the CSV
connector, that actually maps to --connector.csv.url. But in a given invocation of dsbulk,
only the appropriate shortcut will be active.
Run the tool with --help and specify the connector to see its short options:
dsbulk -c csv --help
Configuration Files vs Command Line Options
All DSBulk options can be passed as command line arguments, or in a configuration file.
Using one or more configuration files is sometimes easier than passing all configuration options via the command line.
By default, the configuration files are located under DSBulk's conf directory; the main
configuration file is named application.conf. This location can be modified via the -f
switch. See examples below.
DSBulk ships with a default, empty application.conf file that users can customize to their
needs; it also has a template configuration file that can
serve as a starting point for further customization.
Configuration files are also required to be compliant with the [HOCON] syntax. This syntax is very flexible and allows sections to be grouped together in blocks, e.g.:
dsbulk {
connector {
name = "csv"
csv {
url = "C:\\Users\\My Folder"
delimiter = "\t"
}
}
}
The above is equivalent to the following snippet using dotted notation instead of blocks:
dsbulk.connector.name = "csv"
dsbulk.connector.csv.url = "C:\\Users\\My Folder"
dsbulk.connector.csv.delimiter = "\t"
You can split your configuration in more than one file using file inclusions; see the HOCON
documentation for details. The default configuration file includes another file called
driver.conf, also located in the conf directory. This file should be used to configure
the Java Driver for DSBulk. This file is empty as well; users can customize it to their needs.
A driver template configuration file can serve as a starting
point for further customization.
Important caveats:
- In configuration files, it is not possible to omit the prefix
dsbulk. For example, to select the connector to use in a configuration file, usedsbulk.connector.name = csv, as in the example above; on the command line, however, you can use--dsbulk.connector.name csvor--connector.name csvto achieve the same effect, as stated above. - In configuration files, it is not possible to abbreviate the prefix
datastax-java-drivertodriver. For example, to select the consistency level to use in a configuration file, usedatastax-java-driver.basic.request.consistency = QUORUMin a configuration file; on the command line, however, you can use both--datastax-java-driver.basic.request.consistency = QUORUMor--driver.basic.request.consistency = QUORUMto achieve the same effect. - Options specified through the command line override options specified in configuration files. See examples for details.
Escaping and Quoting Command Line Arguments
Regardless of whether they are supplied via the command line or in a configuration file, all option values are expected to be in valid [HOCON] syntax: control characters, the backslash character, and the double-quote character all need to be properly escaped.
For example, \t is the escape sequence that corresponds to the tab character:
dsbulk load -delim '\t'
In general, string values containing special characters (such as a colon or a whitespace) also need to be properly quoted with double-quotes, as required by the HOCON syntax:
dsbulk load -h '"host.com:9042"'
File paths on Windows systems usually contain backslashes; \\ is the escape sequence for the
backslash character, and since Windows paths also contain special characters, the whole path
needs to be double-quoted:
dsbulk load -url '"C:\\Users\\My Folder"'
However, when the expected type of an option is a string, it is possible to omit the surrounding double-quotes, for convenience:
dsbulk load -url 'C:\\Users\\My Folder'
Similarly, when an argument is a list, it is possible to omit the surrounding square brackets; making the following two lines equivalent:
dsbulk load --codec.nullStrings 'NIL, NULL'
dsbulk load --codec.nullStrings '[NIL, NULL]'
The same applies for arguments of type map: it is possible to omit the surrounding curly braces, making the following two lines equivalent:
dsbulk load --connector.json.deserializationFeatures '{ USE_BIG_DECIMAL_FOR_FLOATS : true }'
dsbulk load --connector.json.deserializationFeatures 'USE_BIG_DECIMAL_FOR_FLOATS : true'
This syntactic sugar is only available for command line arguments of type string, list or map; all other option types, as well as all options specified in a configuration file must be fully compliant with HOCON syntax, and it is the user's responsibility to ensure that such options are properly escaped and quoted.
Also, note that this syntactic sugar is not capable of quoting single elements inside a list or a m
Related Skills
node-connect
338.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
338.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.4kCommit, push, and open a PR
