array_hash_map implementation

The array_hash_map is an optimized cache-conscious key-value pair data structure for storing and retrieving frequency lists. It is based an a simple hash implementation, where collisions are resolved in a continuous, length-coded array. This array_hash_map implementation is largely based on an array_hash implementation from Chris Vaszauskas and Tyler Richard. This implementation was extended to store key-value pairs. The main approaches for this work will be introduced now.

Length-coded array in detail

A so called slot consists of following information:

The hash table has a size of 2^19 slots (524288).

References

The main approach of building cache-conscious data structures can be found in "Cache-conscious collision resolution for string hash tables" - an excellent work from Nikolas Askitis and Justin Zobel, see it here.

Another excellent paper is "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings" from Nikolas Askitis and Ranjan Sinja. It can be found here.

Another excellent overview on cache-conscious data structures, can be found in "Efficient Data Structures For Cache Architectures". This PhD thesis from Nikolas Askitis can be obtained from here.

The array_hash implementation from Chris Vaszauskas and Tyler Richard was initially released on GitHub, but is no longer available.

Licences

The array_hash implementation from Chris Vaszauskas and Tyler Richard was released under the GPLv3.

The basic_unittest framework from Stefan Schweter is used to build up a nice test suite. The basic_unittest framework is licensed under the GPLv3 and can be found here, see basic_unittest.

The array_hash_map implementation is then released under the GPLv3.

For more information see COPYING file.

Compiling the array_hash_map

Requirements

It was tested with GCC in version 6.0 and clang++ in version 3.6.1.

Configuring

Just run make. Now the array_hash_map example program and all unittests will be compiled.

For more information see GNU Makefile targets section.

GNU make variables

It is possible to (re-) define the following variables for the Makefile:

Variable name Description Default value
SHELL Defines what shell will be used /bin/bash
MAKESHELL Defines what (make) shell will be used /bin/bash
EXEC Name of example program array_hash_map_example
TEST_EXEC Name of uniitests program array_hash_map_tests
CXX Defines what compiler will be used g++
CXXFLAGS Compiler flags for C++ -std=c++11 -Ofast -Wall -Wextra -Wpedantic
GIT Command for Git git
BASIC_UNITTEST_GIT_ADDRESS Repository address for basic_unittest git://git.schweter.eu/array_hash_map.git
SILENT Silent output of targets @
QUIET Quiet output of commands 2>&1 >/dev/null

To use another compiler, e.g. clang++ just execute:

make CXX=clang++

GNU Makefile targets

The following Makefile targets are available:

Target name Description
all Builds the array hash map example program and executes all test cases
bin Builds the array hash map example program
tests Checks out the basic_unittest repository and builds all test cases
clean Cleans up all created files like .o or executables
distclean Same as clean target and additionally does a git clean

The default target is all - so just simple run make all or make.

Unittest

To execute all unittests just run:

./array_hash_map_tests

Notice: You can change the unittest program name by setting the EXEC variable during the compilation step:

make EXEC=array_hash_map_unittests

Example program

To run the example program just run:

./array_hash_map_example

Notice: You can change the example program name by setting the TEST_EXEC variable during the compilation step:

make EXEC=example

Using the array_hash_map

Using the array_hash_map is very simple, just include the header file and you can start adding some values to the data structure:

#include <iostream>
#include <algorithm>
#include "array_hash_map.hpp"

auto main() -> int {
  array_hash_map *array = new array_hash_map();
  array->insert("Testvalue", 1);
  array->insert("Anotherone", 2);

  std::for_each(array->begin(), array->end(),
                [](const std::pair<const char *, long unsigned int> i) {
                  std::cout << i.first << " => " << i.second << std::endl;
                });
  return 0;
}

For the complete array_hash_map interface, just have a look at array_hash_map.hpp.

array_hash_map benchmarks

As test dataset the distinct string dataset from the Nikolas Askitis website was used; it can be found here.

This dataset contains of approx. 28 million words with an approx. uncompressed file size of 300 MB.

Test machine was an AMD FX-8150, 3.11.6 as kernel version and gcc version 4.8.2.

Data structure Exec. time
std::map < std::string, uint64_t > 1m2.607s
std::unordered_map < std::string, uint64_t > 0m24.689s
std::vector < std::string > 0m5.600s
array_hash_map < std::string, uint64_t > 0m14.563s
Data structure Max. memory
std::map < std::string, uint64_t > 2.83 GB
std::unordered_map < std::string, uint64_t > 2.66 GB
std::vector < std::string > 1.57 GB
array_hash_map < std::string, uint64_t > 589.5 MB

Contributing

Feel free to send your feedback, suggestions, error reports to the author. Contact information is Stefan Schweter <stefan@schweter.it>.

All comments are very welcome!

TODO

Version 1 of the array_hash_map is just a very simple and short proof-of-concept. Therefore a 100 % STL like interface would be the next step to implement.

Another great feature would be a template/trait like interface, e.g. for using a smaller value size.

History