Tuesday 14 July 2015

TRIE data structure

1.    As a replacement for other data structures
As discussed below, a trie has a number of advantages over binary search trees.A trie can also be used to replace a hash table, over which it has the following advantages:
  • Looking up      data in a trie is faster in the worst case, O(m) time (where m is the      length of a search string), compared to an imperfect hash table. An      imperfect hash table can have key collisions. A key collision is the hash      function mapping of different keys to the same position in a hash table.      The worst-case lookup speed in an imperfect hash table is O(N) time, but      far more typically is O(1), with O(m) time spent evaluating the hash.
  • There are      no collisions of different keys in a trie.
  • Buckets in      a trie, which are analogous to hash table buckets that store key      collisions, are necessary only if a single key is associated with more      than one value.
  • There is      no need to provide a hash function or to change hash functions as more      keys are added to a trie.
  • A trie can      provide an alphabetical ordering of the entries by key.
2.    Dictionary representation
A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile telephone. Such applications take advantage of a trie's ability to quickly search for, insert, and delete entries; however, if storing dictionary words is all that is required (i.e. storage of information auxiliary to each word is not required), a minimal deterministic acyclic finite state automaton would use less space than a trie. This is because an acyclic deterministic finite automaton can compress identical branches from the trie which correspond to the same suffixes (or parts) of different words being stored.
Tries are also well suited for implementing approximate matching algorithms, including those used in spell checking and hyphenation software.
3.     Algorithms
We can describe lookup (and membership) easily. Given a recursive trie type, storing an optional value at each node, and a list of children tries, indexed by the next character (here, represented as a Haskell data type):
 import Prelude hiding (lookup) import Data.Map (Map, lookup)  data Trie a = Trie { value    :: Maybe a,                      children :: Map Char (Trie a) }
We can look up a value in the trie as follows:
 find :: String -> Trie a -> Maybe a find []     t = value t find (k:ks) t = do   ct <- lookup k (children t)   find ks ct
In an imperative style, and assuming an appropriate data type in place, we can describe the same algorithm in Python (here, specifically for testing membership). Note that childrenis a map of a node's children; and we say that a "terminal" node is one which contains a valid word.
def find(node, key):    for char in key:        if char not in node.children:            return None        else:            node = node.children[char]    return node.value
Insertion proceeds by walking the trie according to the string to be inserted, then appending new nodes for the suffix of the string that is not contained in the trie. In imperative pseudocode,
algorithm insert(root : node, s : string, value : any):    node = root    i    = 0    n    = length(s)     while i < n:        if node.child(s[i]) != nil:            node = node.child(s[i])            i = i + 1        else:            break     (* append new nodes, if necessary *)    while i < n:        node.child(s[i]) = new node        node = node.child(s[i])        i = i + 1     node.value = value4.     Sorting
Lexicographic sorting of a set of keys can be accomplished with a simple trie-based algorithm as follows:
  • Insert all      keys in a trie.
  • Output all      keys in the trie by means of pre-order traversal, which results in output      that is in lexicographically increasing order. Pre-order traversal is a      kind of depth-first traversal.
This algorithm is a form of radix sort.
A trie forms the fundamental data structure of Burstsort, which (in 2007) was the fastest known string sorting algorithm.[  However, now there are faster string sorting algorithms. 
5.     Full text search
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
6.     Bitwise tries
Bitwise tries are much the same as a normal character based trie except that individual bits are used to traverse what effectively becomes a form of binary tree. Generally, implementations use a special CPU instruction to very quickly find the first set bit in a fixed length key (e.g. GCC's __builtin_clz() intrinsic). This value is then used to index a 32- or 64-entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search then proceeds by testing each subsequent bit in the key and choosing child[0] or child[1] appropriately until the item is found.
Although this process might sound slow, it is very cache-local and highly parallelizable due to the lack of register dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A red-black tree for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU speed. In comparison, a bitwise trie rarely accesses memory and when it does it does so only to read, thus avoiding SMP cache coherency overhead, and hence is becoming increasingly the algorithm of choice for code which does a lot of insertions and deletions such as memory allocators (e.g. recent versions of the famous Doug Lea's allocator (dlmalloc) and its descendents).
7.     Compressing tries
When the trie is mostly static, i.e. all insertions or deletions of keys from a prefilled trie are disabled and only lookups are needed, and when the trie nodes are not keyed by node specific data (or if the node's data is common) it is possible to compress the trie representation by merging the common branches. This application is typically used for compressing lookup tables when the total set of stored keys is very sparse within their representation space.
For example it may be used to represent sparse bitsets (i.e. subsets of a much larger fixed enumerable set) using a trie keyed by the bit element position within the full set, with the key created from the string of bits needed to encode the integral position of each element. The trie will then have a very degenerate form with many missing branches, and compression becomes possible by storing the leaf nodes (set segments with fixed length) and combining them after detecting the repetition of common patterns or by filling the unused gaps.
Such compression is also typically used in the implementation of the various fast lookup tables needed to retrieve Unicode character properties (for example to represent case mapping tables, or lookup tables containing the combination of base and combining characters needed to support Unicode normalization). For such application, the representation is similar to transforming a very large unidimensional sparse table into a multidimensional matrix, and then using the coordinates in the hyper-matrix as the string key of an uncompressed trie. The compression will then consist of detecting and merging the common columns within the hyper-matrix to compress the last dimension in the key; each dimension of the hypermatrix stores the start position within a storage vector of the next dimension for each coordinate value, and the resulting vector is itself compressible when it is also sparse, so each dimension (associated to a layer level in the trie) is compressed separately.
Some implementations do support such data compression within dynamic sparse tries and allow insertions and deletions in compressed tries, but generally this has a significant cost when compressed segments need to be split or merged, and some tradeoff has to be made between the smallest size of the compressed trie and the speed of updates, by limiting the range of global lookups for comparing the common branches in the sparse trie.
The result of such compression may look similar to trying to transform the trie into a directed acyclic graph (DAG), because the reverse transform from a DAG to a trie is obvious and always possible, however it is constrained by the form of the key chosen to index the nodes.
Another compression approach is to "unravel" the data structure into a single byte array.[ This approach eliminates the need for node pointers which reduces the memory requirements substantially and makes memory mapping possible which allows the virtual memory manager to load the data into memory very efficiently.
Another compression approach is to "pack" the trie. Liang describes a space-efficient implementation of a sparse packed trie applied to hyphenation, in which the descendants of each node may be interleaved in memo