Getting Started with HDTree C++
Installing the HDTree C++ API
CMake and a C++17 compatible C++ compiler are required. Both of which are readily available via your system software repjositories.
- On Ubuntu derivatives:
sudo apt update && sudo apt install cmake gcc g++
The HDF5 libray exists in many Unix repositories so look there for installing it, as always, you can fall back to building the latest release from source.
- On Ubuntu derivatives:
sudo apt update && sudo apt install libhdf5-dev
- On MacOS:
brew install hdf5
The HighFive C++ Wrapper is used by HDTree so it is also required.
- Download a HighFive release
wget https://github.com/BlueBrain/HighFive/archive/refs/tags/v2.7.0.tar.gz
- Unpack the source
tar xzf v2.7.0.tar.gz
cd HighFive-v2.7.0
- Configure the build. Use the
CMAKE_INSTALL_PREFIX
if you wish to install HighFive somewhere besides/usr/local
.
cmake -DHIGHFIVE_EXAMPLES=OFF -DHIGHFIVE_UNIT_TESTS=OFF -B build -S .
- Install the interface. May require administrative (
sudo
) privileges if installing to/usr/local
.
make install
Building HDTree is similar to HighFive but a separate compilation step will be helpful since, unlike HighFive, HDTree is not a header-only C++ library.
- Download a release.
wget https://github.com/tomeichlersmith/hdtree/archive/refs/tags/cpp/v0.4.5.tar.gz
- Unpack the source
tar xzf v0.4.5.tar.gz
cd hdtree-v0.4.5/cpp
- Configure the build (again, use
CMAKE_INSTALL_PREFIX
if you wish to change the install location).
cmake -B build -S .
- Build the library.
cd build
make
- Install
make install
First Steps
There are four ways to access a HDTree with the C++ API.
They are mainly separated by different stages of processing the data.
We start with save
since you will first need to write a HDF5 file
with an HDTree in it in order to be able to go further.
The code below is copied in from the examples directory within the C++ API source. This means the code "snippets" are pretty long, but I've tried to include explanatory comments within them. The examples are compiled alongside the HDTree C++ API so you can try it out immediately after building it.
write-only (save
)
First, we are just going to write some example data to a file. This shows an example of a write-only process. After compilation, run by providing a name for the file and the tree.
hdtree-eg-save my-first-hdtree.h5 the-tree
/**
* @file save.cxx
* Example of saving a new HDTree into a file
*/
// for generating random data
#include <random>
// for interacting with HDTrees
#include "hdtree/Tree.h"
// utility functions for example programs
#include "examples.h"
int main(int argc, char** argv) try {
/**
* parse command line for arguments
*/
std::string filename, treename;
int rc = hdtree::examples::parse_single_file_args(argc, argv, filename, treename);
if (rc != 0) return rc;
/**
* Create a tree by defining what file it is in
* and where it resides within that file
*/
auto tree = hdtree::Tree::save(filename, treename);
/**
* Create branches to define what type of information will
* go into the HDTree. The hdtree::Tree::branch function
* returns a handle to the created hdtree::Branch object.
* This object can (and should) be used to interace with
* the values that will be stored in the HDTree on disk
* in order to reduce the number of in-memory copies that
* need to happen. Here, we use `auto&` to avoid typing
* out all the C++ template nonsense that hdtree::Branch
* does under-the-hood.
*
* Each branch handle can be treated as a pointer
* to the underlying type.
*
* **Note**: Branch handles are invalid after the tree they
* were created from is deleted.
*/
auto& i_entry = tree.branch<std::size_t>("i_entry");
auto& rand_nums = tree.branch<std::vector<double>>("rand_nums");
/**
* Initialization of random number generation.
* Not really applicable to HDTree, just used here to
* show that varying length vectors can be serialized
* with ease
*/
std::mt19937 rng; // no argument -> no seed
std::uniform_real_distribution<double> norm(0., 1.);
std::uniform_int_distribution<std::size_t> uniform(1, 100);
/**
* Actual update and filling of the HDTree.
*
* You can see here how we can treat `i_entry`
* as if it was a properly initialized `std::size_t *`
* and `rand_nums` * as if it was a properly
* initialized `std::vector<double> *`.
*/
for (std::size_t i{0}; i < 100; ++i) {
*i_entry = i;
std::size_t size = uniform(rng);
for (std::size_t j{0}; j < size; j++) {
rand_nums->push_back(norm(rng));
}
/**
* We choose to save each value of the loop into the tree.
*/
tree.save();
}
/**
* The final flushing of the data to disk as well as handle
* cleanup procedures will all be handled automatically by
* deconstruction.
*/
return 0;
} catch (const hdtree::HDTreeException& e) {
std::cerr << "ERROR " << e << std::endl;
return 1;
}
read and write (transform
or inplace
)
Another common task is to perform calculations on some input data
and save those calculations into the tree as well. This does not answer
the question of what should be done with the original data. Should we
(a) copy the original data and write it to a new file with the new data
or (b) write the new data into the input file alongside the original data.
In the HDTree C++ API, option (a) is achieved with transform
and option
(b) is done with inplace
. Both can be run from the same executable and
the choice is made depending of if you give a new file and tree name or
not.
# this will use hdtree::Tree::transform
hdtree-eg-transform my-first-hdtree.h5 the-tree my-second-hdtree.h5 the-second-tree
# this will use hdtree::Tree::inplace
hdtree-eg-transform my-first-hdtree.h5 the-tree
/**
* @file transform.cxx
* Example of transforming an HDTree by adding more branches
*
* This example determines whether a tree should be copied into
* a new file or simply transformed in its current file by what
* arguments are provided to the program. We assume the input
* tree was generated by the hdtree-eg-save example program
* defined in @ref save.cxx (i.e. we look for specific branches).
*/
// for interacting with HDTrees
#include "hdtree/Tree.h"
// utility functions for example programs
#include "examples.h"
int main(int argc, char** argv) try {
/**
* parse command line for arguments
*/
std::pair<std::string,std::string> src, dest;
int rc = hdtree::examples::parse_two_file_args(argc, argv, src, dest);
if (rc != 0) return rc;
/**
* Wrap an existing on-disk HDTree
*
* Here is where we make the decision on whether to copy a tree
* into a new file or not. We choose to copy the tree into
* a new file if a destination file and tree are provided on
* the command line. We use the slightly-ugly ternary operator
* in order to avoid unnecessary copying from an if-else tree.
*/
auto tree = dest.first.empty() ?
hdtree::Tree::inplace(src.first, src.second) :
hdtree::Tree::transform(src, dest);
/**
* We are going to calculate the average of the random
* numbers within each tree entry, so we create a new
* branch to store that result as well as retrieve
* the branch with the numbers we will use.
*/
auto& rand_nums = tree.get<std::vector<double>>("rand_nums");
auto& avg = tree.branch<double>("avg");
/**
* Actual update and filling of the HDTree.
*
* We use a tree helper that will make sure we go through
* each entry in the tree, calling the hdtree::Tree::load
* at the beginning and hdtree::Tree::save at the end of
* each run in the loop. This code is essentially equivalent to
* ```cpp
* for (std::size_t i{0}; i < tree.entries(); ++i) {
* tree.load();
* // the code inside the lambda function below
* if (rand_nums->size() > 0) {
* *avg = (std::reduce(rand_nums->begin(), rand_nums->end()))/rand_nums->size();
* } else {
* *avg = -1;
* }
* //
* tree.save();
* }
* ```
* Just using this example to show off some potentially-helpful
* features - if lambda functions are causing you difficulty,
* feel free to avoid them. Just make sure to remember to call
* the load and save functions!
*/
tree.for_each([&]() {
if (rand_nums->size() > 0) {
*avg = (std::reduce(rand_nums->begin(), rand_nums->end()))/rand_nums->size();
} else {
*avg = -1;
}
});
/**
* The final flushing of the data to disk as well as handle
* cleanup procedures will all be handled automatically by
* deconstruction.
*/
return 0;
} catch (const hdtree::HDTreeException& e) {
std::cerr << "ERROR " << e << std::endl;
return 1;
}
read-only (load
)
Finally, the last common task is reading in the data from the tree and using
it to do some other task (e.g. making a plot or fitting the data with some
model). In this API, that is called load
ing and the example program included
prints a simple histogram of the averages of the original data generated earlier.
Fun Fact: This is an example of the central limit theorem!
# this will error-out if you didn't run step two!
hdtree-eg-load my-first-hdtree.h5 the-tree
# the below is example output, it may change since the random data may change!
0.X | Num Entries
< 0 |
0.0 |
0.1 |*
0.2 |
0.3 |***
0.4 |********************************************
0.5 |*************************************************
0.6 |**
0.7 |
0.8 |*
0.9 |
> 1 |
/**
* @file transform.cxx
* Example of transforming an HDTree by adding more branches
*
* This example determines whether a tree should be copied into
* a new file or simply transformed in its current file by what
* arguments are provided to the program. We assume the input
* tree was generated by the hdtree-eg-save example program
* defined in @ref save.cxx (i.e. we look for specific branches).
*/
// for interacting with HDTrees
#include "hdtree/Tree.h"
// utility functions for example programs
#include "examples.h"
int main(int argc, char** argv) try {
/**
* parse command line for arguments
*/
std::string file_name, tree_name;
int rc = hdtree::examples::parse_single_file_args(argc, argv, file_name, tree_name);
if (rc != 0) return rc;
/**
* Wrap an existing on-disk HDTree
*/
auto tree = hdtree::Tree::load(file_name, tree_name);
std::cout << "This is what a missing branch exception looks like:" << std::endl;
try {
tree.get<double>("dne");
} catch (const hdtree::HDTreeException& e) {
// demonstrate what exceptions look like.
std::cout << e << std::endl;
}
std::cout << "--- end of example exception ---" << std::endl;
/**
* We want to study the average of the random data
* in each entry. This average was calculated in
* the examples/transform.cxx program so this part
* will fail if running on a file that wasn't updated
* by transform!
*/
const auto& avg = tree.get<double>("avg");
/**
* Our very simple histogram is going to be 10 bins with
* an underflow (everything below 0) and overflow (everthing
* above 1) bins.
*
* Since the random data is between 0 and 1, we can calculate
* the bin index very quickly
*
* floor(avg * 10)+1
*
* We will include the value of exactly 1 in the last bin
* and have a special bin for the entries without any data
* from which to calculate an average.
*/
std::vector<unsigned int> hist_bins(12, 0);
/**
* Actual loop over the tree.
*
* We use a tree helper that will make sure we go through
* each entry in the tree, calling the hdtree::Tree::load
* at the beginning of each run in the loop.
* This code is essentially equivalent to
* ```cpp
* for (std::size_t i{0}; i < tree.entries(); ++i) {
* tree.load();
* // the code in teh lambda function below
* }
* ```
* Just using this example to show off some potentially-helpful
* features - if lambda functions are causing you difficulty,
* feel free to avoid them. Just make sure to remember to call
* the load and save functions!
*/
tree.for_each([&]() {
std::size_t i_bin{0};
if (*avg < 0) {
i_bin = 0;
} else if (*avg > 1) {
i_bin = 11;
} else {
i_bin = floor(*avg * 10) + 1;
}
++hist_bins[i_bin];
});
printf("0.X | Num Entries\n");
for (std::size_t i_bin{0}; i_bin < 12; ++i_bin) {
std::string x;
if (i_bin == 0) {
x = "< 0";
} else if (i_bin == 11) {
x = "> 1";
} else {
x = "0."+std::to_string(i_bin-1);
}
printf("%s |", x.c_str());
for (std::size_t c{0}; c < hist_bins.at(i_bin); ++c) printf("*");
printf("\n");
}
/**
* The final flushing of the data to disk as well as handle
* cleanup procedures will all be handled automatically by
* deconstruction.
*/
return 0;
} catch (const hdtree::HDTreeException& e) {
std::cerr << "ERROR " << e << std::endl;
return 1;
}
User-Defined Data Structures
User-defined objects can also be serialized within HDTree. Simplified
schema evolution (a la ROOT's ClassDef
macro) is also available; however,
this example merely shows the required boiler-plate.
HDTree's C++ API has chosen to avoid automatically deducing the on-disk naming from the in-memory class member names. This introduces more boilerplate, but, in my opinion, is helpful for essentially documenting how on-disk data was generated.
/**
* @file user_classes.cxx
* Example of saving and loading user-defined C++ classes
*/
// for generating random data
#include <random>
// for interacting with HDTrees
#include "hdtree/Tree.h"
// utility functions for example programs
#include "examples.h"
/**
* Example user class
*/
class MyData {
float x_, y_, z_;
// grant hdtree access so we can keep the `attach` method private
friend class hdtree::access;
// this is where the name of data on disk is assigned to the
// variable name of data in memory
template <typename Branch>
void attach(Branch& b) {
b.attach("x", x_);
b.attach("y", y_);
b.attach("z", z_);
}
public:
MyData() = default;
MyData(float x, float y, float z)
: x_{x}, y_{y}, z_{z} {}
// HDTree also requires classes to have a `clear` method
// for resetting the instance to a "non-assigned" state
void clear() {
x_ = 0.;
y_ = 0.;
z_ = 0.;
}
// helper function since we know what this data means
float mag() const {
return sqrt(x_*x_+y_*y_+z_*z_);
}
};
int main(int argc, char** argv) try {
/**
* parse command line for arguments
*/
std::string filename, treename;
int rc = hdtree::examples::parse_single_file_args(argc, argv, filename, treename);
if (rc != 0) return rc;
{ // write a simple file with some random data points
auto tree = hdtree::Tree::save(filename, treename);
/**
* Once the MyData::attach method is written, it can be put
* into STL containers (or as a member of other user classes)
* like any other serializable class
*/
auto& my_data = tree.branch<std::vector<MyData>>("my_data");
// initialization of random number generator
std::mt19937 rng; // no argument -> no seed
std::uniform_real_distribution<double> norm(0., 1.);
std::uniform_int_distribution<std::size_t> uniform(1, 100);
for (std::size_t i{0}; i < 100; ++i) {
std::size_t size = uniform(rng);
for (std::size_t j{0}; j < size; ++j) {
my_data->emplace_back(norm(rng), norm(rng), norm(rng));
}
tree.save();
}
// final flushing accomplished when tree and its branches
// go out of scope and are destructed
}
{ // load back from same file and write the average mag as a new branch
auto tree = hdtree::Tree::inplace(filename, treename);
auto& my_data = tree.get<std::vector<MyData>>("my_data");
auto& avg_mag = tree.branch<float>("avg_mag");
tree.for_each([&]() {
if (my_data->size() > 0) {
float tot_mag = 0.;
for (const MyData& d : *my_data) {
tot_mag += d.mag();
}
*avg_mag = tot_mag/my_data->size();
} else {
*avg_mag = -1;
}
});
// final flushing accomplished when tree and its branches
// go out of scope and are destructed
}
return 0;
} catch (const hdtree::HDTreeException& e) {
std::cerr << "ERROR " << e << std::endl;
return 1;
}
More Intense Use Case
The C++ HDTree API is mainly implemented through its
various Branch
classes. The Tree
class is mainly there
to be a helpful interface for handling a set of Branch
es.
I point this out because if you are interested in building
a larger data processing framework around the C++ HDTree API,
I would suggest focusing on writing your own version of Tree
to accomodate your needs rather than attempting to use the Tree
that is apart of this repository.