Data structure: The array_hash_map

The array_hash_map is an optimized cache-conscious key-value pair in-memory data structure for storing and retrieving frequency lists.

It is based an a simple hash implementation, where collisions are resolved in a continuous, length-coded array. This array_hash_map implementation is largely based on an array_hash implementation from Chris Vaszauskas and Tyler Richard.

We extend this implementation to store key-value pairs. The main approaches for this work will be introduced now.

Length-coded array in detail

A so called *slot* consists of following information:

References

The main approach of building cache-conscious data structures can be found in "Cache-conscious collision resolution for string hash tables" - an excellent work from Nikolas Askitis and Justin Zobel, see here.

Another excellent paper is "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", from Nikolas Askitis and Ranjan Sinja, see here.

Another excellent overview on cache-conscious data structures, can be found in "Efficient Data Structures For Cache Architectures", PhD thesis from Nikolas Askitis, see here.

The array_hash implementation from Chris Vaszauskas and Tyler Richard was initially released on GitHub, but is no longer available.

Licences

The array_hash implementation from Chris Vaszauskas and Tyler Richard was released under the LGPLv3.

Our array_hash_map_implementation uses methods from Felix von Leitner's excellent dietlibc, which is released under the GPLv2.

The array_hash_map implementation is then released under the GPLv2.

Building/Compiling the array_hash_map

Requirements

Configuring

We are using the cmake build system:

mkdir build && cd build
cmake ..

Then we can compile the array_hash_map and execute some test cases:

make
make test

It is also possible to use

make fast

That command will automatically detect the number of processors and will then pass it to the *-j* switch of make.

Using the array_hash_map

Using the array_hash_map is very simple, just include the header file and you can start adding some values to the data structure:

#include "array_hash_map.hpp"
auto main() -> int
{
 array_hash_map * array = new array_hash_map();
 array->insert("Testvalue", 1);
 array->insert("Anotherone", 2);
 return 0;
}

For the complete array_hash_map interface, just have a look at *array_hash_map.hpp*!

Serialization

Due to the great Boost serialization feature, it is also possible to serialize and deserialize an array_hash_map:

#include "array_hash_map.hpp"
auto main() -> int
{
 array_hash_map array;
 array.insert("Stefan", 1338);
 array.freeze("out.bin");
 array_hash_map array2;
 array2.thaw("out.bin");
 return 0;
}

Benchmarks

As test dataset we use the *distinct string* dataset from the Nikolas Askitis website; it can be found here.

This dataset contains of approx. 28 million words with an approx. uncompressed file size of 300 MB.

Test machine was an AMD FX-8150, 3.11.6 as kernel version and the latest GCC 4.8.2.

Data structure Exec. time
std::map < std::string, uint64_t > 1m2.607s
std::unordered_map < std::string, uint64_t > 0m24.689s
std::vector < std::string > 0m5.600s
array_hash_map < std::string, uint64_t > 0m14.563s

Data structure Max. memory
std::map < std::string, uint64_t > 2.83 GB
std::unordered_map < std::string, uint64_t > 2.66 GB
std::vector < std::string > 1.57 GB
array_hash_map < std::string, uint64_t > 589.5 MB

Documentation

A doxygen documentation is available, just execute:

make fast

In the build folder. A folder called *doc* will be created. Notice: doxygen needs to be installed!

An online documentation is also available, see here.

TODO

Version 1 of our array_hash_map is just a very simple and short *proof-of-concept*. Therefore a 100 % *STL* like interface would be the next step to implement.

Another great feature would be a template/trait like interface, e.g. for using a smaller *value* size.

Contact

Comments, patches, questions, development are highly welcome! Feel free to contact me: stefan at schweter dot eu

Notice: If you are a student at the LMU and you got a GitLab account from the RBG, I can give you access to the repository.

Source

History