Getting Started with HDTree C++

Installing the HDTree C++ API

CMake and a C++17 compatible C++ compiler are required. Both of which are readily available via your system software repjositories.

  • On Ubuntu derivatives: sudo apt update && sudo apt install cmake gcc g++

The HDF5 libray exists in many Unix repositories so look there for installing it, as always, you can fall back to building the latest release from source.

  • On Ubuntu derivatives: sudo apt update && sudo apt install libhdf5-dev
  • On MacOS: brew install hdf5

The HighFive C++ Wrapper is used by HDTree so it is also required.

wget https://github.com/BlueBrain/HighFive/archive/refs/tags/v2.7.0.tar.gz
  • Unpack the source
tar xzf v2.7.0.tar.gz
cd HighFive-v2.7.0
  • Configure the build. Use the CMAKE_INSTALL_PREFIX if you wish to install HighFive somewhere besides /usr/local.
cmake -DHIGHFIVE_EXAMPLES=OFF -DHIGHFIVE_UNIT_TESTS=OFF -B build -S .
  • Install the interface. May require administrative (sudo) privileges if installing to /usr/local.
make install

Building HDTree is similar to HighFive but a separate compilation step will be helpful since, unlike HighFive, HDTree is not a header-only C++ library.

wget https://github.com/tomeichlersmith/hdtree/archive/refs/tags/cpp/v0.4.5.tar.gz
  • Unpack the source
tar xzf v0.4.5.tar.gz
cd hdtree-v0.4.5/cpp
  • Configure the build (again, use CMAKE_INSTALL_PREFIX if you wish to change the install location).
cmake -B build -S .
  • Build the library.
cd build
make
  • Install
make install

First Steps

There are four ways to access a HDTree with the C++ API. They are mainly separated by different stages of processing the data. We start with save since you will first need to write a HDF5 file with an HDTree in it in order to be able to go further.

The code below is copied in from the examples directory within the C++ API source. This means the code "snippets" are pretty long, but I've tried to include explanatory comments within them. The examples are compiled alongside the HDTree C++ API so you can try it out immediately after building it.

write-only (save)

First, we are just going to write some example data to a file. This shows an example of a write-only process. After compilation, run by providing a name for the file and the tree.

hdtree-eg-save my-first-hdtree.h5 the-tree
/**
 * @file save.cxx
 * Example of saving a new HDTree into a file
 */

// for generating random data
#include <random>

// for interacting with HDTrees
#include "hdtree/Tree.h"

// utility functions for example programs
#include "examples.h"

int main(int argc, char** argv) try {
  /**
   * parse command line for arguments
   */
  std::string filename, treename;
  int rc = hdtree::examples::parse_single_file_args(argc, argv, filename, treename);
  if (rc != 0) return rc;

  /**
   * Create a tree by defining what file it is in
   * and where it resides within that file
   */
  auto tree = hdtree::Tree::save(filename, treename);

  /**
   * Create branches to define what type of information will
   * go into the HDTree. The hdtree::Tree::branch function
   * returns a handle to the created hdtree::Branch object.
   * This object can (and should) be used to interace with
   * the values that will be stored in the HDTree on disk
   * in order to reduce the number of in-memory copies that
   * need to happen. Here, we use `auto&` to avoid typing
   * out all the C++ template nonsense that hdtree::Branch
   * does under-the-hood.
   *
   * Each branch handle can be treated as a pointer
   * to the underlying type.
   *
   * **Note**: Branch handles are invalid after the tree they
   * were created from is deleted.
   */
  auto& i_entry = tree.branch<std::size_t>("i_entry");
  auto& rand_nums = tree.branch<std::vector<double>>("rand_nums");

  /**
   * Initialization of random number generation.
   * Not really applicable to HDTree, just used here to
   * show that varying length vectors can be serialized
   * with ease
   */
  std::mt19937 rng;  // no argument -> no seed
  std::uniform_real_distribution<double> norm(0., 1.);
  std::uniform_int_distribution<std::size_t> uniform(1, 100);

  /**
   * Actual update and filling of the HDTree.
   *
   * You can see here how we can treat `i_entry` 
   * as if it was a properly initialized `std::size_t *` 
   * and `rand_nums` * as if it was a properly 
   * initialized `std::vector<double> *`.
   */
  for (std::size_t i{0}; i < 100; ++i) {
    *i_entry = i;
    std::size_t size = uniform(rng);
    for (std::size_t j{0}; j < size; j++) {
      rand_nums->push_back(norm(rng));
    }

    /**
     * We choose to save each value of the loop into the tree.
     */
    tree.save();
  }

  /**
   * The final flushing of the data to disk as well as handle
   * cleanup procedures will all be handled automatically by
   * deconstruction.
   */
  return 0;
} catch (const hdtree::HDTreeException& e) {
  std::cerr << "ERROR " << e << std::endl;
  return 1;
}

read and write (transform or inplace)

Another common task is to perform calculations on some input data and save those calculations into the tree as well. This does not answer the question of what should be done with the original data. Should we (a) copy the original data and write it to a new file with the new data or (b) write the new data into the input file alongside the original data. In the HDTree C++ API, option (a) is achieved with transform and option (b) is done with inplace. Both can be run from the same executable and the choice is made depending of if you give a new file and tree name or not.

# this will use hdtree::Tree::transform
hdtree-eg-transform my-first-hdtree.h5 the-tree my-second-hdtree.h5 the-second-tree
# this will use hdtree::Tree::inplace
hdtree-eg-transform my-first-hdtree.h5 the-tree
/**
 * @file transform.cxx
 * Example of transforming an HDTree by adding more branches
 * 
 * This example determines whether a tree should be copied into
 * a new file or simply transformed in its current file by what
 * arguments are provided to the program. We assume the input
 * tree was generated by the hdtree-eg-save example program
 * defined in @ref save.cxx (i.e. we look for specific branches).
 */

// for interacting with HDTrees
#include "hdtree/Tree.h"

// utility functions for example programs
#include "examples.h"

int main(int argc, char** argv) try {
  /**
   * parse command line for arguments
   */
  std::pair<std::string,std::string> src, dest;
  int rc = hdtree::examples::parse_two_file_args(argc, argv, src, dest);
  if (rc != 0) return rc;

  /**
   * Wrap an existing on-disk HDTree
   *
   * Here is where we make the decision on whether to copy a tree
   * into a new file or not. We choose to copy the tree into
   * a new file if a destination file and tree are provided on
   * the command line. We use the slightly-ugly ternary operator
   * in order to avoid unnecessary copying from an if-else tree.
   */
  auto tree = dest.first.empty() ?
    hdtree::Tree::inplace(src.first, src.second) :
    hdtree::Tree::transform(src, dest);

  /**
   * We are going to calculate the average of the random
   * numbers within each tree entry, so we create a new
   * branch to store that result as well as retrieve
   * the branch with the numbers we will use.
   */
  auto& rand_nums = tree.get<std::vector<double>>("rand_nums");
  auto& avg = tree.branch<double>("avg");

  /**
   * Actual update and filling of the HDTree.
   *
   * We use a tree helper that will make sure we go through
   * each entry in the tree, calling the hdtree::Tree::load
   * at the beginning and hdtree::Tree::save at the end of
   * each run in the loop. This code is essentially equivalent to
   * ```cpp
   * for (std::size_t i{0}; i < tree.entries(); ++i) {
   *   tree.load();
   *   // the code inside the lambda function below
   *   if (rand_nums->size() > 0) {
   *     *avg = (std::reduce(rand_nums->begin(), rand_nums->end()))/rand_nums->size();
   *   } else {
   *     *avg = -1;
   *   }
   *   // 
   *   tree.save();
   * }
   * ```
   * Just using this example to show off some potentially-helpful
   * features - if lambda functions are causing you difficulty, 
   * feel free to avoid them. Just make sure to remember to call
   * the load and save functions!
   */
  tree.for_each([&]() {
        if (rand_nums->size() > 0) {
          *avg = (std::reduce(rand_nums->begin(), rand_nums->end()))/rand_nums->size();
        } else {
          *avg = -1;
        }
      });

  /**
   * The final flushing of the data to disk as well as handle
   * cleanup procedures will all be handled automatically by
   * deconstruction.
   */
  return 0;
} catch (const hdtree::HDTreeException& e) {
  std::cerr << "ERROR " << e << std::endl;
  return 1;
}

read-only (load)

Finally, the last common task is reading in the data from the tree and using it to do some other task (e.g. making a plot or fitting the data with some model). In this API, that is called loading and the example program included prints a simple histogram of the averages of the original data generated earlier.

Fun Fact: This is an example of the central limit theorem!

# this will error-out if you didn't run step two!
hdtree-eg-load my-first-hdtree.h5 the-tree
# the below is example output, it may change since the random data may change!
0.X | Num Entries
< 0 |
0.0 |
0.1 |*
0.2 |
0.3 |***
0.4 |********************************************
0.5 |*************************************************
0.6 |**
0.7 |
0.8 |*
0.9 |
> 1 |
/**
 * @file transform.cxx
 * Example of transforming an HDTree by adding more branches
 * 
 * This example determines whether a tree should be copied into
 * a new file or simply transformed in its current file by what
 * arguments are provided to the program. We assume the input
 * tree was generated by the hdtree-eg-save example program
 * defined in @ref save.cxx (i.e. we look for specific branches).
 */

// for interacting with HDTrees
#include "hdtree/Tree.h"

// utility functions for example programs
#include "examples.h"

int main(int argc, char** argv) try {
  /**
   * parse command line for arguments
   */
  std::string file_name, tree_name;
  int rc = hdtree::examples::parse_single_file_args(argc, argv, file_name, tree_name);
  if (rc != 0) return rc;

  /**
   * Wrap an existing on-disk HDTree
   */
  auto tree = hdtree::Tree::load(file_name, tree_name);

  std::cout << "This is what a missing branch exception looks like:" << std::endl;
  try {
    tree.get<double>("dne");
  } catch (const hdtree::HDTreeException& e) {
    // demonstrate what exceptions look like.
    std::cout << e << std::endl;
  }
  std::cout << "--- end of example exception ---" << std::endl;

  /**
   * We want to study the average of the random data
   * in each entry. This average was calculated in
   * the examples/transform.cxx program so this part
   * will fail if running on a file that wasn't updated
   * by transform!
   */
  const auto& avg = tree.get<double>("avg");

  /**
   * Our very simple histogram is going to be 10 bins with
   * an underflow (everything below 0) and overflow (everthing
   * above 1) bins.
   *
   * Since the random data is between 0 and 1, we can calculate
   * the bin index very quickly 
   *
   *  floor(avg * 10)+1
   *
   * We will include the value of exactly 1 in the last bin
   * and have a special bin for the entries without any data
   * from which to calculate an average.
   */
  std::vector<unsigned int> hist_bins(12, 0);

  /**
   * Actual loop over the tree.
   *
   * We use a tree helper that will make sure we go through
   * each entry in the tree, calling the hdtree::Tree::load
   * at the beginning of each run in the loop.
   * This code is essentially equivalent to
   * ```cpp
   * for (std::size_t i{0}; i < tree.entries(); ++i) {
   *   tree.load();
   *   // the code in teh lambda function below
   * }
   * ```
   * Just using this example to show off some potentially-helpful
   * features - if lambda functions are causing you difficulty, 
   * feel free to avoid them. Just make sure to remember to call
   * the load and save functions!
   */
  tree.for_each([&]() {
        std::size_t i_bin{0};
        if (*avg < 0) {
          i_bin = 0; 
        } else if (*avg > 1) {
          i_bin = 11;
        } else {
          i_bin = floor(*avg * 10) + 1;
        }
        ++hist_bins[i_bin];
      });

  printf("0.X | Num Entries\n");
  for (std::size_t i_bin{0}; i_bin < 12; ++i_bin) {
    std::string x;
    if (i_bin == 0) {
      x = "< 0";
    } else if (i_bin == 11) {
      x = "> 1";
    } else {
      x = "0."+std::to_string(i_bin-1);
    }
    printf("%s |", x.c_str());
    for (std::size_t c{0}; c < hist_bins.at(i_bin); ++c) printf("*");
    printf("\n");
  }

  /**
   * The final flushing of the data to disk as well as handle
   * cleanup procedures will all be handled automatically by
   * deconstruction.
   */
  return 0;
} catch (const hdtree::HDTreeException& e) {
  std::cerr << "ERROR " << e << std::endl;
  return 1;
}

User-Defined Data Structures

User-defined objects can also be serialized within HDTree. Simplified schema evolution (a la ROOT's ClassDef macro) is also available; however, this example merely shows the required boiler-plate.

HDTree's C++ API has chosen to avoid automatically deducing the on-disk naming from the in-memory class member names. This introduces more boilerplate, but, in my opinion, is helpful for essentially documenting how on-disk data was generated.

/**
 * @file user_classes.cxx
 * Example of saving and loading user-defined C++ classes 
 */

// for generating random data
#include <random>

// for interacting with HDTrees
#include "hdtree/Tree.h"

// utility functions for example programs
#include "examples.h"

/**
 * Example user class
 */
class MyData {
  float x_, y_, z_;
  // grant hdtree access so we can keep the `attach` method private
  friend class hdtree::access;
  // this is where the name of data on disk is assigned to the
  // variable name of data in memory
  template <typename Branch>
  void attach(Branch& b) {
    b.attach("x", x_);
    b.attach("y", y_);
    b.attach("z", z_);
  }
 public:
  MyData() = default;
  MyData(float x, float y, float z)
    : x_{x}, y_{y}, z_{z} {}
  // HDTree also requires classes to have a `clear` method
  // for resetting the instance to a "non-assigned" state
  void clear() {
    x_ = 0.;
    y_ = 0.;
    z_ = 0.;
  }
  // helper function since we know what this data means
  float mag() const {
    return sqrt(x_*x_+y_*y_+z_*z_);
  }
};

int main(int argc, char** argv) try {
  /**
   * parse command line for arguments
   */
  std::string filename, treename;
  int rc = hdtree::examples::parse_single_file_args(argc, argv, filename, treename);
  if (rc != 0) return rc;

  { // write a simple file with some random data points
    auto tree = hdtree::Tree::save(filename, treename);
    /**
     * Once the MyData::attach method is written, it can be put
     * into STL containers (or as a member of other user classes)
     * like any other serializable class
     */
    auto& my_data = tree.branch<std::vector<MyData>>("my_data");
    // initialization of random number generator
    std::mt19937 rng;  // no argument -> no seed
    std::uniform_real_distribution<double> norm(0., 1.);
    std::uniform_int_distribution<std::size_t> uniform(1, 100);

    for (std::size_t i{0}; i < 100; ++i) {
      std::size_t size = uniform(rng);
      for (std::size_t j{0}; j < size; ++j) {
        my_data->emplace_back(norm(rng), norm(rng), norm(rng));
      }

      tree.save();
    }

    // final flushing accomplished when tree and its branches
    // go out of scope and are destructed
  }

  { // load back from same file and write the average mag as a new branch
    auto tree = hdtree::Tree::inplace(filename, treename);
    auto& my_data = tree.get<std::vector<MyData>>("my_data");
    auto& avg_mag = tree.branch<float>("avg_mag");
    tree.for_each([&]() {
        if (my_data->size() > 0) {
          float tot_mag = 0.;
          for (const MyData& d : *my_data) {
            tot_mag += d.mag();
          }
          *avg_mag = tot_mag/my_data->size();
        } else {
          *avg_mag = -1;
        }
    });

    // final flushing accomplished when tree and its branches
    // go out of scope and are destructed
  }

  return 0;
} catch (const hdtree::HDTreeException& e) {
  std::cerr << "ERROR " << e << std::endl;
  return 1;
}

More Intense Use Case

The C++ HDTree API is mainly implemented through its various Branch classes. The Tree class is mainly there to be a helpful interface for handling a set of Branches. I point this out because if you are interested in building a larger data processing framework around the C++ HDTree API, I would suggest focusing on writing your own version of Tree to accomodate your needs rather than attempting to use the Tree that is apart of this repository.