array_hash_map
implementationThe array_hash_map
is an optimized cache-conscious key-value pair data
structure for storing and retrieving frequency lists. It is based an a simple
hash implementation, where collisions are resolved in a continuous, length-coded
array. This array_hash_map
implementation is largely based on an array_hash
implementation from Chris Vaszauskas and Tyler Richard. This implementation was
extended to store key-value pairs. The main approaches for this work will be
introduced now.
A so called slot consists of following information:
uint32_t
- Slot size field, with is a multiple of the cache-line (in our
implementation it is a multiple of 64). This page-size approach was
chosen in order to reduce the amount of slot resizing function calls
(like in an exact-fit approach).
uint16_t
- Length field, which determines the length of the key.
char
- Char array, includes the string terminator \0.
uint64_t
- Length field, which determines the length of the value.
The hash table has a size of 2^19 slots (524288).
The main approach of building cache-conscious data structures can be found in "Cache-conscious collision resolution for string hash tables" - an excellent work from Nikolas Askitis and Justin Zobel, see it here.
Another excellent paper is "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings" from Nikolas Askitis and Ranjan Sinja. It can be found here.
Another excellent overview on cache-conscious data structures, can be found in "Efficient Data Structures For Cache Architectures". This PhD thesis from Nikolas Askitis can be obtained from here.
The array_hash implementation from Chris Vaszauskas and Tyler Richard was initially released on GitHub, but is no longer available.
The array_hash implementation from Chris Vaszauskas and Tyler Richard was released under the GPLv3.
The basic_unittest
framework from Stefan Schweter is used to build up a nice
test suite. The basic_unittest
framework is licensed under the
GPLv3 and can be found here,
see basic_unittest.
The array_hash_map
implementation is then released under the
GPLv3.
For more information see COPYING
file.
array_hash_map
make
g++
>= 4.7 or clang++
basic_unittest
(optionally, will be cloned when make tests
is executed)It was tested with GCC in version 6.0 and clang++
in version 3.6.1.
Just run make
. Now the array_hash_map
example program and all unittests
will be compiled.
For more information see GNU Makefile
targets section.
make
variablesIt is possible to (re-) define the following variables for the Makefile
:
Variable name | Description | Default value |
---|---|---|
SHELL |
Defines what shell will be used | /bin/bash |
MAKESHELL |
Defines what (make) shell will be used | /bin/bash |
EXEC |
Name of example program | array_hash_map_example |
TEST_EXEC |
Name of uniitests program | array_hash_map_tests |
CXX |
Defines what compiler will be used | g++ |
CXXFLAGS |
Compiler flags for C++ | -std=c++11 -Ofast -Wall -Wextra -Wpedantic |
GIT |
Command for Git | git |
BASIC_UNITTEST_GIT_ADDRESS |
Repository address for basic_unittest |
git://git.schweter.eu/array_hash_map.git |
SILENT |
Silent output of targets | @ |
QUIET |
Quiet output of commands | 2>&1 >/dev/null |
To use another compiler, e.g. clang++
just execute:
make CXX=clang++
Makefile
targetsThe following Makefile
targets are available:
Target name | Description |
---|---|
all |
Builds the array hash map example program and executes all test cases |
bin |
Builds the array hash map example program |
tests |
Checks out the basic_unittest repository and builds all test cases |
clean |
Cleans up all created files like .o or executables |
distclean |
Same as clean target and additionally does a git clean |
The default target is all
- so just simple run make all
or make
.
To execute all unittests just run:
./array_hash_map_tests
Notice: You can change the unittest program name by setting the EXEC
variable during the compilation step:
make EXEC=array_hash_map_unittests
To run the example program just run:
./array_hash_map_example
Notice: You can change the example program name by setting the TEST_EXEC
variable during the compilation step:
make EXEC=example
array_hash_map
Using the array_hash_map
is very simple, just include the header file and you
can start adding some values to the data structure:
#include <iostream>
#include <algorithm>
#include "array_hash_map.hpp"
auto main() -> int {
array_hash_map *array = new array_hash_map();
array->insert("Testvalue", 1);
array->insert("Anotherone", 2);
std::for_each(array->begin(), array->end(),
[](const std::pair<const char *, long unsigned int> i) {
std::cout << i.first << " => " << i.second << std::endl;
});
return 0;
}
For the complete array_hash_map
interface, just have a look at
array_hash_map.hpp
.
array_hash_map
benchmarksAs test dataset the distinct string dataset from the Nikolas Askitis website was used; it can be found here.
This dataset contains of approx. 28 million words with an approx. uncompressed file size of 300 MB.
Test machine was an AMD FX-8150, 3.11.6 as kernel version and gcc
version
4.8.2.
Data structure | Exec. time |
---|---|
std::map < std::string, uint64_t > | 1m2.607s |
std::unordered_map < std::string, uint64_t > | 0m24.689s |
std::vector < std::string > | 0m5.600s |
array_hash_map < std::string, uint64_t > | 0m14.563s |
Data structure | Max. memory |
---|---|
std::map < std::string, uint64_t > | 2.83 GB |
std::unordered_map < std::string, uint64_t > | 2.66 GB |
std::vector < std::string > | 1.57 GB |
array_hash_map < std::string, uint64_t > | 589.5 MB |
Feel free to send your feedback, suggestions, error reports to the author.
Contact information is Stefan Schweter <stefan@schweter.it>
.
All comments are very welcome!
Version 1 of the array_hash_map
is just a very simple and short
proof-of-concept. Therefore a 100 % STL like interface would be the next
step to implement.
Another great feature would be a template/trait like interface, e.g. for using a smaller value size.