The array_hash_map is an optimized cache-conscious key-value pair in-memory data structure for storing and retrieving frequency lists.
It is based an a simple hash implementation, where collisions are resolved in a continuous, length-coded array. This array_hash_map implementation is largely based on an array_hash implementation from Chris Vaszauskas and Tyler Richard.
We extend this implementation to store key-value pairs. The main approaches for this work will be introduced now.
A so called *slot* consists of following information:
The main approach of building cache-conscious data structures can be found in "Cache-conscious collision resolution for string hash tables" - an excellent work from Nikolas Askitis and Justin Zobel, see here.
Another excellent paper is "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", from Nikolas Askitis and Ranjan Sinja, see here.
Another excellent overview on cache-conscious data structures, can be found in "Efficient Data Structures For Cache Architectures", PhD thesis from Nikolas Askitis, see here.
The array_hash implementation from Chris Vaszauskas and Tyler Richard was initially released on GitHub, but is no longer available.
The array_hash implementation from Chris Vaszauskas and Tyler Richard was released under the LGPLv3.
Our array_hash_map_implementation uses methods from Felix von Leitner's excellent dietlibc, which is released under the GPLv2.
The array_hash_map implementation is then released under the GPLv2.
We are using the cmake build system:
Then we can compile the array_hash_map and execute some test cases:
It is also possible to use
That command will automatically detect the number of processors and will then pass it to the *-j* switch of make.
Using the array_hash_map is very simple, just include the header file and you can start adding some values to the data structure:
For the complete array_hash_map interface, just have a look at *array_hash_map.hpp*!
Due to the great Boost serialization feature, it is also possible to serialize and deserialize an array_hash_map:
As test dataset we use the *distinct string* dataset from the Nikolas Askitis website; it can be found here.
This dataset contains of approx. 28 million words with an approx. uncompressed file size of 300 MB.
Test machine was an AMD FX-8150, 3.11.6 as kernel version and the latest GCC 4.8.2.
Data structure | Exec. time |
---|---|
std::map < std::string, uint64_t > | 1m2.607s |
std::unordered_map < std::string, uint64_t > | 0m24.689s |
std::vector < std::string > | 0m5.600s |
array_hash_map < std::string, uint64_t > | 0m14.563s |
Data structure | Max. memory |
---|---|
std::map < std::string, uint64_t > | 2.83 GB |
std::unordered_map < std::string, uint64_t > | 2.66 GB |
std::vector < std::string > | 1.57 GB |
array_hash_map < std::string, uint64_t > | 589.5 MB |
A doxygen documentation is available, just execute:
In the build folder. A folder called *doc* will be created. Notice: doxygen needs to be installed!
An online documentation is also available, see here.
Version 1 of our array_hash_map is just a very simple and short *proof-of-concept*. Therefore a 100 % *STL* like interface would be the next step to implement.
Another great feature would be a template/trait like interface, e.g. for using a smaller *value* size.
Comments, patches, questions, development are highly welcome! Feel free to contact me: stefan at schweter dot eu
Notice: If you are a student at the LMU and you got a GitLab account from the RBG, I can give you access to the repository.