Xbow
range-v3 views and actions for Arrow C++
Install / Use
/learn @seertaak/XbowREADME
xbow - range-v3 views and actions for Arrow C++
Why xbow?
Because I want to specify arrow types for the rows of an Arrow table like this:
// Below: define a "record", which is just a value type which in addition wraps
// Boost.Hana's BOOST_HANA_DEFINE_STRUCT macro. This macro generates machinery
// which can be used to retrieve field names and types, as well as accessors that
// can read fields based on a compile-time integer (or string) representing the
// index (or name, respectively) of the field. Note that this dereferencing takes
// place in constant time. It's the same as accessing a struct member - it is *not*
// like accessing an item in a hash map (say). This reflective metadata is sufficient
// do the (ugly) Arrow C++ API raw lifting. For example, for the first field,
// an arrow Int32Builder will be used for writing data (what we call the *action*
// side, since the side-effect of an arrow being created is effected), and a
// Int32Array will be used for reading (the *views* side: here we take an regular
// arrow array or table, and wrap it with a range-v3 compatible range). For the
// "name" field, a arrow::StringBuilder and arrow::StringArray will be used, respectively.
// And so on.
def_record(suspect,
(int32_t, id),
(string, name),
(double, salary)
);
... instead of like this. And I want to use it like this:
auto suspects = vector<suspect>{
{1, "Keyser Söze"s, 1000.0}, {2, "Kobayashi"s, 500.0}, {3, "Fred Fenster"s, 500.0},
{4, "Jack Baer"s, 100.0}, {5, "Dean Keaton"s, 800.0}, {6, "Michael McManus"s, 100.0},
};
print("input rows: {}\n", suspects);
// below: traverse the rows, changing name to upper case, skipping every other element,
// cycling over rows so that they repeat and taking exactly 20 of these rows, and finally
// this range-v3 range is converted to a regular arrow table.
// This code shows that we can take a bog-standard range-v3 pipeline and convert it to
// an arrow object. This could later, for example, be written to a parquet file (WIP).
const auto table = suspects
| views::transform([](auto&& p) -> suspect& {
boost::to_upper(p.name);
return p;
})
| views::stride(2)
| views::cycle
| views::take(20)
| xb::arrow::actions::to_table;
// below: note that to_range<suspect>(table) returns a range consisting of chunks, each of which
// is also a range. These chunks correspond exactly to the actual low-level chunks in the
// arrow file. We view::join this range to produce a single, collated range, which we then
// convert to a std::vector<suspect> for the sole reason of printing. Note how easily we
// taped together the chunks! Normally this would be two-level for loop involving laborious
// extraction of each field, type-casting, urgh!
print("round-tripped rows: {}\n",
xb::arrow::views::to_range<suspect>(table) | views::join | to<vector<suspect>>);
...which should produce output like this:
input rows: {person[id: 1, name: Keyser Söze, salary: 1000], person[id: 2, name: Kobayashi, salary: 500], person[id: 3, name: Fred Fenster, salary: 500], person[id: 4, name: Jack Baer, salary: 100], person[id: 5, name: Dean Keaton, salary: 800], person[id: 6, name: Michael McManus, salary: 100]}
round-tripped rows: {person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500], person[id: 5, name: DEAN KEATON, salary: 800], person[id: 1, name: KEYSER SöZE, salary: 1000], person[id: 3, name: FRED FENSTER, salary: 500]}
(It does.)
Usage
Usage examples can be found in the tests, and in the examples directory.
Defining Custom Tables
Use the def_record macro to define the types of rows of tables you want to manipulate in xbow. This leverages Boost.Hana to create static introspection machinery, which is used by xbow to figure out the right concrete types for the Arrow array objects.
Example:
def_record(point,
(double, x),
(double, y)
);
def_record(person,
(int64_t, id),
(xb::date, dob),
(string, name),
(double, cost),
(array<double, 3>, cost_components),
(point, p)
);
Notes:
- In principle, the record types can be nested as in the example above. This will work for the purpose of Arrow schema generation. Aggregates such as arrays and vectors are also supported. (Again, in principle you can use arbitrary element types, with the caveat below.)
- However, round-tripping (actual range use) currently doesn't support structures within structures. It's planned, and it's certainly possible using the reflective setup.
- Currently only numeric and string types are supported, with partial support for boolean arrays (stored as bitmasks).
- Partial support, with more planned, for dates and timestamp - these map from their Arrow concrete types to suitable modern C++ counterparts: hinnant's date and std's chrono libraries, respectively.
Actions: Ranges To Arrow Objects
One use-case is that we already have a range or container with objects of our (appropriately defined - see above) row type, and our goal is to create the concomitant Arrow objects. This is accomplished using xbows's actions, so called because they are side-effecting: they create new Arrow objects, which involves memory allocation.
// convert a scalar range (i.e. a single column) into the appropriate Arrow array
// object.
template <range R>
auto to_arrow_array(R&& input) -> meta::array_obj<xb::meta::element_t<R>>;
// convert a range over an appropriately defined
template <ranges::range Range>
auto to_table(Range&& rows);
Views: Views on Arrow Ojects
In the other direction, we start with an Arrow array or table, and want to query or manipulate the data using range-v3 ranges. In this case, no allocation is necessary, and so we adopt the view terminology.
// given a suitably defined record type (see above) convert an Arrow Table to a range
// over items of the row type.
template <xb::meta::record R>
auto to_range(const std::shared_ptr<::arrow::Table>& table);
// convert a concrete Arrow array object to a view, no memory is allocated. (e.g.
// strings are represented as std::string_views whose pointer points to the Arrow
// memory itself, rather than creating copies. The range produced by this version
// does not allow null elements. If you have missing elements, use
// optional_array_view, below.
template <typename A>
auto array_view(const std::shared_ptr<A>& array) -> decltype(auto);
// Same as above, but allowing missing elements. Let's say you create A has element
// "StringType". Then optional_array_view will produce a range of
// std::optional<std::string>. That is to say, nullity/non-nullity are represented
// using std::optional.
template <typename A>
auto optional_array_view(const std::sharedhttps://en.cppreference.com/w/cpp/chrono_ptr<A>& array) -> decltype(auto);
// Arrow tables are created out of so-called chunked arrays. They're usually pretty
// horrible to deal with. This helper function creates a range of ranges; the outer
// range obviously representing the chunks.
template <ranges::semiregular T>
auto chunked_array_view(const ::arrow::ChunkedArray& chunks) -> unspecified;
Pre-requisites
- Python 3.9.1
- Clang 11 or higher (will probably all work with recent gcc)
Note: installing clang is not enough. You need to use update-alternatives on Ubuntu to make sure that c++
and cc call clang++-11 and clang-11, respectively.
Installation
- Install conan; this is used to download C++ libraries such as
boost and arrow:
# install conan pip install conan # install third-party libs in debug mode using conan... conan install . -if build/debug --profile:host=conan/profile/tools --profile:build=conan/profile/debug --build missing # ...same for release mode. conan install . -if build/release --profile:host=conan/profile/tools --profile:build=conan/profile/release --build missing conan export conan/libs/arrow.py xbow/stable # for CLion conan plugin. ln conan/profile/debug ~/.conan/profiles/debug ln conan/profile/release ~/.conan/profiles/release - Install C++ build tools (CMake, Ninja, and [nasm](https://www.na
