Reliable Software Logo

C++ In Action

String Table

A string table maps integers into strings and strings into integers. The first mapping is easily implemented using an array of strings (pointers, or even better, offsets into a buffer). The second one is more difficult to implement efficiently. Our goal is to have constant time access both ways. It means that, independently of how many strings there are in the table (within some reasonable limits), the mappings of an int into a string and a string into an int take, on average, constant time (in the programmer speak O(1)). Notice that in the simplest implementation—going through the list of strings and comparing each with the string to be mapped—the execution takes longer and longer as the list grows (we say, it grows like O(N)). However, there is a data structure called the hash table that can accomplish constant time mapping. We will use it to implement an efficient string table.

With this performance goal in mind, we are ready to design the string table. Since we want to be able to add strings to it, we will need a large buffer to store them. The StringBuffer object stores a string and returns an offset at which the string can be found. This offset, in turn, is stored in another data structure, an array of offsets, that maps integer id’s to string offsets. Each string is thus assigned an id, which is simply the index into the array of offsets.

When adding a new string, the appropriate entry is also added to the hash table. (For the moment let’s treat the hash table as a black box that maps strings into short lists of id’s.)

const int idNotFound = -1;

const int maxStrings = 100;
// String table maps strings to ints
// and ints to strings

class StringTable
      StringTable ();
      int ForceAdd (char const * str);
      int Find (char const * str) const;
      char const * GetString (int id) const;
      HTable         _htab;  // string -> short list of id’s
      int            _offStr [maxStrings]; // id -> offset
      int            _curId;
      StringBuffer   _strBuf; // offset -> string

The translation from an id into a string is done in the GetString method, the translation from the string into an id is done in Find, and the addition of new strings without looking for duplicates is done in ForceAdd.

example 1

Figure 11. The hash table stores indexes into the offset array. The offset array stores indexes into the string buffer. The string buffer stores strings.

StringTable::StringTable ()
    : _curId (0)

int StringTable::ForceAdd (char const * str)
      int len = strlen (str);
      // is there enough space?
      if (_curId == maxStrings || !_strBuf.WillFit (len))
            return idNotFound;
      // point to the place where the string will be stored
      _offStr [_curId] = _strBuf.GetOffset ();
      _strBuf.Add (str);
      // add mapping to hash table
      _htab.Add (str, _curId);
      return _curId - 1;

The hash table (still a black box for us) finds a short list of entries, one of them containing the id of the given string. We search this list in order to find the correct id.

int StringTable::Find (char const * str) const
      // Get a short list from hash table
      List const & list = _htab.Find (str);
      // Iterate over this list
      for (Link const * pLink = list.GetHead ();
           pLink != 0; 
           pLink = pLink->Next ())
            int id = pLink->Id ();
            int offStr = _offStr [id];
            if (_strBuf.IsEqual (offStr, str))
                  return id;
      return idNotFound;

// map integer into string. Must be valid id

char const * StringTable::GetString (int id) const
      assert (id >= 0);
      assert (id < _curId);
      int offStr = _offStr [id];
      return _strBuf.GetString (offStr);

String Buffer

Our string buffer is implemented as an array of characters. We keep copying null terminated strings into this array until we run out of space. The variable _curOffset is an index into the array and it always points at the next free area where a string can be copied. It is initialized to zero, thus pointing at the beginning of the buffer. Before adding a string we make sure that it will fit in the remaining space. Adding a string means copying it to the place where _curOffset points to, and moving _curOffset past the string’s terminating null. GetOffset returns the current offset, which can be later used to access the string about to be copied. IsEqual compares the string at a given offset with a given string, and GetString returns a const string given the offset.

const int maxBufSize=500;

class StringBuffer
      StringBuffer () : _curOffset (0) {}
      bool WillFit (int len) const
            return _curOffset + len + 1 < maxBufSize;
      void Add (char const * str)
            // strcpy (_buf + _curOffset, str);
            strcpy (&_buf [_curOffset], str);
            _curOffset += strlen (str) + 1;
      int GetOffset () const
            return _curOffset;
      bool IsEqual (int offStr, char const * str) const
            // char const * strStored = _buf + offStr;
            char const * strStored = &_buf [offStr];
            // strcmp returns 0 when strings are equal
            return strcmp (str, strStored) == 0;
      char const * GetString (int offStr) const
            //return _buf + offStr;
            return &_buf [offStr];
      char    _buf [maxBufSize];
      int     _curOffset;

example 2

Figure 12. Current offset in the string buffer.

There are two ways of creating a pointer that points to a particular entry in an array. One can take the address of the nth element

p = &arr [n];  // address of the n’th element
or one can use the fact that one can add an integer to a pointer in order to move it n entries ahead
p = arr + n;  // pointer plus integer

Both are correct and produce the same code-- in the example above, pointer equivalents are shown as comments. The array notation seems less confusing.

I also used a generalized assignment operator += that adds the right hand side expression to the left hand side variable. For instance,

n += i;
is equivalent to
n = n + i;

The code produced is of course the same (although originally the += operator was introduced in C as an optimization), but the shorthand remains useful, once you get used to it

Finally, I used some of the string functions from the standard library. The header file that contains their declaration is called cstring. You can find a detailed description of these functions in your compiler’s help.

NextNext: Hash Table