NAHT: Not A Hash Table 
Joshua Reich 2006
spam@i2pi.com

SUMMARY
-------
(from naht.h)

/*
** Unlike a hash table, naht has the following properties:
**
** 1. Predefined storage size - no malloc overhead as you grow chains
** 2. No need for linked list traversal (as everything is an array)
** 3. No guarantee that things you insert in the naht will be found later
**
** Because of 1 & 2, there is no way to ensure that we won't over-write
** already inserted data with new data. If you care about the data you
** are storing in the naht, then please use the discard() function to
** deal with this situation.
**
** We use a LRU policy for discarding old data to make room for new.
** We also search most recently added items when searching.
** This makes for a nice DB table cache. Especially nice for a read-
** cache. I also use it for a write cache, using discard() to write
** stuff out as needed.
*/

PRETTY PICTURES
---------------
Here is a fairly unfair test of the speed of NAHT versus a typical hash table
implementation (libghthash). And to accompany this test, please enjoy the
following (relatively meaningless) chart:



(NB: Green means good -- that is, NAHT is faster)

The following chart, in addition to having drop-shadows, has the line in a lovely pinkish hue. While the
graph itself was generated using Excel on a fairly fast Mac, the data was generated by NAHT with a 5MB
data area, on a slow Sun X1.



DESIGN
------
Hash tables are pretty cool. They let you store data, without really knowing the distribution
of the data in advance. Given a good hashing function, and a large enough table, searching 
should be O(1). However, in most cases it approaches O(x/2) where x is the average length of 
the chain which dangles off the end of each hash bucket. To dynamically reduce this length,
a good implementation will keep the bucket fill rate below 70%, or so. This should ensure that
x is close to 1. However, to achieve this, hash table libraries need to periodically re-hash
their contents. This can be slow. Additionally, chains are usually maintained as linked lists.
This adds both space and computation overhead when traversing a chain. Not to mention the
problems they cause for CPU cache hit rates. 

Not A Hash Table is not a hash table implementation, but is very similar. It has some advantages,
and drawbacks. Your milage may vary. 

The key difference in this data structure is that we use a 2D array to store our data. It is 
fair to think of each row in this array as a hash bucket, with the chains extending along each
row. From this point, the operation is quite similar to a hash table. When looking for a piece
of data, you first hash the contents. NAHT does not provide hashing functions, but I personally
use Bob Jenkin's page as a reference (http://burtleburtle.net/bob/hash/doobs.html).

Once the hash is calculated, naht converts this unsigned long to a 'bucket' number. This is done
by bitwise ANDing against one minus the total number of buckets. This total number of buckets
is always of the form 2^b. This avoids an expensive MOD operation. 

Now that we have located our 'bucket', we search for our data item, starting with the item that
was most recently added. We do this as typical access patterns will often search for recently
added items. 

To insert, we have 2 routes. Firstly, if we have yet to use the entire space allocated for the 
chain, we simply add the item to the end of the chain. However, if we have already used all the
space available (the width of the 2D array), we must replace some item with the new one that we
wish to add. In this case, we over-write the least recently added item. This means that in such
cases, if were then to do a search for that replaced item, it would not be found, even though it
was previously stored. 

If you can't afford to lose items that were previously stored, the discard() function is provided.
This function is called whenever we are about to replace an item with a new item, with the to-be-
replaced item as a parameter. You can then use this callback to store the item elsewhere, perhaps
in a database.

This is the major downside to this algorithm. To get optimal performance, you need to understand
the trade off between naht array size (as specified in the mb parameter to naht_init), and the
cost of offline storage.

COMPILATION
-----------
The Makefile provided is purely for demonstration, running 'make' will build some test code. 
To use naht in your code, build a library, or simply copy naht.c and naht.h to your source
directory. 

The only thing to note is for thread saftey, you should add the -DTHREAD_SAFE flag.

If you are not using naht in a multi-threaded environment, its best to leave this flag out.

naht has been tested on:
- Mac OS X (PPC)
- Linux (AMD 64)
- Solaris (SPARC)

USAGE
-----
Read test.c for an example.

OBTAINING
---------
http://i2pi.com/rez/naht/download/

QUOTE
-----
<astrobe> but this is almost exactly how we implemented "my_first_hash_table.cpp" in 2nd year data structures ;)