Notes on Collisions in a Common String Hashing Function

Consider the following simple string hashing function suitable for use with open hashing structures that incorporate some other method for collision resolution, such as chaining. In C:

    
      enum
      {
              MULTIPLIER = 31         /* Note: 31 is prime. */
      };

      unsigned int
      hash(char *str, size_t tablesize)
      {
              unsigned int h;
              char *p;

              p = str;
              if (p == NULL)
                      p = "";
              h = 0;
              while (*p != '\0')
                      h = h * MULTIPLIER + *p++;

              return (h % tablesize);
      }
    
  

It is known that, for performance reasons, it is best to require that MULTIPLIER and the table size be relatively prime; but why?

The reason is distribution of entries in the table. In particular, one would like to spread them as evenly possible in order to minimize the collisions. Ideally, one would like a situation where, assuming "sufficiently random" data, entries are evenly distributed across every "slot" in the table.

The easiest way to do this is to make the multiplier and hash table size relatively prime (meaning they share no common factors, or that if $n$ is the table size is a positive integer greater than 1 and $m$ is the multiplier is likewise a positive integer greater than 1, then $\gcd(n, m) = 1$, where $\gcd(n, m)$ is the positive greatest common divisor of $n$ and $m$. And the easiest way to achieve that is is to make one or both of $n$ and $m$ prime numbers.

But why does the relative primality of $n$ and $m$ matter?

Suppose that we are using open hashing with the hash function at the top of this document coupled with some other method for collision resolution (e.g., chaining). Let us further assume that $n > m$. Next, recall that $a \pmod n = r$ (for $a$, $n$ positive integers) means that for some integer $s$, there exists an integer $r$ so that $a = ns + r$. Now, let $d = \gcd(n, m) \geq 1$. Given an input datum consisting of the string $a_1 a_2 a_3 \dots a_k$, where $0 \leq a_k \leq B$ for some constant positive integer $B$, the hash becomes the nonnegative integer,

    
      \[
          r = \left( \sum^k_{i = 1} m a_i \right) \mod n
      \]
    
  

Let $u$ be the sum over $i = 1$ to $k$ of $a_i$; then the above becomes:

    
      \[
          r = (mu) \mod n
      \]
    
  

since $m$ is constant with respect to the sum, and thus can be factored out.

But, since $m$ has $d$ as a factor, we know there exists some positive integer $q$ such that $m = dq$ and thus the hash is equivalent to $r = (dqu) \mod n$ which implies that

    
      \[
          dqu = sn + r \implies dqu - r = sn \implies n \mid (dqu - r)
      \]
    
  

where $a \mid b$ stands for "$a$ divides $b$" (or $b$ has $a$ as a factor), and $s$ is as before.

But, recalling that $n$ also has $d$ as a factor, we see that there exists some positive integer $t$ such that $n = dt$. Then, $n \mid (dqu - r) \implies d \mid (dqu - r)$, and,

    
      \[
          \frac{sn}{d} = \frac{sdt}{d} = st = \frac{(dqu - r)}{d} =
              \frac{dqu}{d} - \frac{r}{d}
      \]
    
  

is an integer by closure of multiplication (since $s$ and $t$ are integers), cancellation, substitution and distribution in the integers. But, clearly $dqu/d = qu$ is an integer by cancellation, and since $st$ is an integer, then by addition we see that $r/d = du - st$ is also an integer and hence $d \mid r$. Thus, every hash value must be a multiple of $d$.

Thus, when $n$ and $m$ are not relatively prime and $d > 1$ only every $d$th slot in the table will be used. It follows that every slot in the table that has index not an integer multiple of $d$ is unused, decreasing the effectiveness of the hash. For example, consider the case where $n = 8$ and $m = 4$. Here, $\gcd(n, m) = 4$ and only the $0$th and $4$th slots in the table will be used, and thus the table is only as effective as one with two slots, regardless of the data we store.

On the other hand, if $n$ and $m$ are relatively prime, then by definition, $\gcd(n, m) = 1$ and every slot in the table will be used (for sufficiently random input).

The parenthetical in the last paragraph implies that we assume the data we are hashing cooperates and is not adversarial or pathological. For instance, if the hashed data is always some multiple of the modulus when shifted and summed, we would end up with the previously described behavior. One can try and ameliorate the effects of such behavior by careful choice of collision resolution techniques, using double hashing with different algorithms, etc. It gets complicated quickly.

Note however that just setting the multiplier and table size to be relatively prime is not enough to ensure good performance. Suppose that multiplier is a power of two, and the table size is a prime number. Clearly these are relatively prime, but then the multiplication is equivalent to an arithmetic left shift and, for sufficiently long input strings, the bits contributed by the characters at the beginning of the string would be shifted out of the result. Now suppose some set of inputs shares a common suffix that is sufficiently long to induce this behavior: every input value from that set would be assigned to the same slot in the table, regardless of prefix. Choosing the multiplier carefully is critical to producing well-distributed hash values regardless of input string length.