Hashing Language

How do you build a language model with a million dimensions?

The so-called “hashing trick” is a programming technique frequently used in statistical natural language processing for dimensionality reduction. The trick is so elegant and powerful that it would have warranted a Turing Award, if the first person to use the trick understood its power. John Langford cites a paper by George Forman & Evan Kirshenbaum from 2008 that uses the hashing trick, but it may have been discovered even earlier.[1] [2] Surprisingly most online tutorials and explanations of the hashing trick gloss over the main insights or get buried in notation. At the time of this writing, the Wikipedia entry on the hashing trick contains blatant errors.[3] Hence this post.

Hash, Man

A hash function is a programming routine that translates arbitrary data into a numeric representation. Hash functions are convenient, and useful for a variety of different purposes such as lookup tables (dictionaries) and cryptography, in addition to our hashing trick. An example of a (poor) hash function would map the letter “a” to 1, “b” to 2, “c” to 3 and so on, up to “z” being 26 — and then sum up the numbers represented by the letters. For the Benjamin Franklin quote “beware the hobby that eats” we get the following hash function output:

(beware) 2 + 5 + 23 + 1 + 18 + 5 +
(the) 20 + 8 + 5 +
(hobby) 8 + 15 + 2 + 2 + 25 +
(that) 20 + 8 + 1 + 20 +
(eats) 5 + 1 + 20 + 19
= 233

Any serious hashing function will limit the range of numbers it outputs. The hashing function we used on Benjamin Franklin could simply take the first two digits of its sum, the “modulo 100” in programming terms, and provide that lower number as its output. So in this case, the number 233 would be lopped-off, and the hash function would return just 33. We have a blunt quantitative representation or mapping of the input that is hopefully useful in a statistical model. The range of this hashing function is therefore 100 values, 0 to 99.

Now a big reason to choose one hashing function over another is the statistical distribution of the output across the function’s range, or uniformity. If you imagine feeding in a random quote, music lyric, blog post or tweet into a good hashing function, the chance of the output being any specific value in the range should be the same as every other possible output. For our hashing function with a 0-99 range, the number 15 should be output about 1% of the time, just like every other number between 0 and 99. Note that our letter-summing hash function above does not have good uniformity, and so you should not use it in the wild. As an aside, keep in mind that certain hash functions are more uniform on bigger input data, or vice-versa.

Another reason to favor one hashing function over another is whether or not a small change in the input produces a big change in the output. I call this concept cascading. If we tweak the Benjamin Franklin quote a little bit and feed “beware the hobby that bats” into our silly hash function, the sum is now 230, which gets lopped-off to 30 within the hash’s output range. This modest change in output from 33 or 30 is another sign that our toy hash function is indeed just a toy. A small change in the input data did not cascade into a big change in the output number.

Here the important point is that a good hashing function will translate your input into each number in its output range with same probability (uniformity), and a small change in your input data will cause a big change in the output (cascading).

That’s Just Zipf-y

In human languages, very few words are used very frequently while very many words are very rare. For example, the word “very” turns up more than the word “rosebud” in this post. This relationship between word and frequency is very convex, non-linear or curved. This means that the 25th most common word in the English language (“from”) is not just used a little more frequently than the 26th most common word (“they”), but much more than the lower ranked word (26th).

This distribution of words is called Zipf’s Law. If you choose a random word from a random page in the Oxford English Dictionary, chances are that word will be used very rarely in your data. Similarly if you were to choose two words from the OED, chances are both of those words will not be common.

The Trick

If you are doing “bag-of-words” statistical modeling on a large corpus of English documents, it is easy find yourself accommodating thousands or millions of distinct words or ngrams. For example the classic 20 newsgroup corpus from Ken Lang contains over 61,000 different single words, and exponentially more two-word bigrams. Training a traditional statistical model with 61,000 independent variables or dimensions is computationally expensive, to say the least. We can slash the dimensionality of a bag-of-words model by applying Zipf’s Law and using a decent hashing function.

First we identify a hashing function with an output range that matches the dimensionality we wish the data had. Our silly hashing function above output a number from 0 to 99, so its range is 100. Using this function with the hashing trick means our statistical bag-of-words model will have a dimensionality of 100. Practically speaking we usually sit atop an existing high-quality hashing function, and use just a few of the least significant bits of the output. And for computational reasons, we usually choose a power of two as our hash function output range and desired dimensionality, so lopping-off the most significant bits can be done with a fast bitwise AND.

Then we run every word or ngram in the training data through our adapted hashing function. The output of the hash becomes our feature, a column index or dimension number. So if we choose 28 (two -to-the-power-of- eight) as our hashing function’s range and the next ngram has a hash of 23, then we set our 23rd independent variable to the frequency count (or whatever) of that word. If the next hash is the number 258, we map to the output 3 at the bit level for the third dimension, or 258 = 255 + 3 = 255 + (258 MOD 255) more mathematically. Our statistical NLP model of the 20 newsgroup corpus suddenly goes from 61,000 to only 256 dimensions.

Wait a Sec’…!

Hold on, that cannot possibly work… If we use the numeric hash of a word, phrase or ngram as an index into our training data matrix, we are going to run into too many dangerous hash collisions, right?

A hash collision occurs when two different inputs hash to the same output number. Though remember that since we are using a good hashing function, the uniformity and cascading properties make the chance of a hash collision between any two words independent of how frequently that word is used. Read that last sentence again, because it is a big one.

The pair of words “from” & “rosebud” and “from” & “they” each have the same chance of hash collision, even though the frequency with which the four words turn up in English is varied. Any pair of words chosen at random from the OED has the same chance of hash collision. However Zipf’s Law says that if you choose any two words randomly from the OED, chances are one of the words will be very rare in any corpus of English language documents. Actually both words will probably be infrequent. Therefore if a collision in our hash function’s output occurs, the two colliding words are probably oddballs.

Two Reasons it Still Works

Statistical NLP bag-of-words models that use the hashing trick have roughly the same accuracy as models that operate on the full bag-of-words dimensionality. There are two reasons why hash collisions in the low-dimensional space of the hash function’s output range do not trash our models. First any collisions that do occur, probably occur between two rare words. In many models, rare words do not improve the model’s regression / classification accuracy and robustness. Rare words and ngrams are said to be non-discriminatory. Now even if rare words are discriminatory in your problem domain, probability suggests the rare words do not co-occur in the same document. For this reason, the two rare words can be thought of as “sharing” the same representation in the model, whether this is decision tree sub-trees or a coefficient in a linear model. The Forman & Kirshenbaum paper says “a colliding hash is either unlikely to be selected as a feature (since both words are infrequent) or will almost always represent the word that led the classifier to select it.”

We cannot use the hashing trick for dimensionality reduction in every statistical model. Zipf’s Law means most features or independent variables in a bag-of-words representation equal zero. In other words, a point in the dimensional space of the bag-of-words (a “word vector”) is generally sparse. Along these lines, John Langford says the hashing trick “preserves sparsity.” For a random specific word, the chance of two random examples both having a non-zero value for that feature is low. Again this is because most words are rare.

The hashing trick is Zipf’s Law coupled with the uniformity & cascading properties of a good hash function, and using these to reduce the dimensionality of a sparse bag-of-words NLP model.

Notes

[1] Actually the first public version of the hashing trick John Langford knew of was in the first release of Vowpal Wabbit in back in 2007. He also points out that the hashing trick enables very efficient quadratic features to be added to a model.
[2] Jeshua Bratman pointed out that the Sutton & Barto classic textbook on reinforcement learning mentions the hashing-trick way back in 1998. This is the earliest reference I have yet found.
[3] “the number of buckets [hashing function output range] M usually exceeds the vocabulary size significantly” from the “Hashing-Trick” Wikipedia entry, retrieved on 2013-01-29.

Sequential Learning Book

Things have been quiet around here since the winter because I have been focusing my modest writing and research skills on a new book for O’Reilly. We signed the contract a few days ago, so now I get to embrace a draconian authorship schedule over the next year. The book is titled Sequential Machine Learning and will focus on data mining techniques that train on petabytes of data. That is, far more training data that can fit in the memory of your entire Hadoop cluster.

Sequential machine learning algorithms do this by guaranteeing constant memory footprint and processing requirements. This ends up being an eyes-wide-open compromise in accuracy and non-linearity for serious scalability. Another perk is trivial out-of-sample model validation. The Vowpal Wabbit project by John Langford will be the book’s featured technology, and John Langford has graciously offered to help out with a foreword. Therefore the book will also serve as a detailed tutorial on using Vowpal Wabbit, for those of us who are more coder or hacker than statistician or academic.

The academic literature often uses the term “online learning” for this approach, but I find that term way too confusing given what startups like Coursera and Khan Academy are doing. (Note the terminology at the end of Hilary Mason’s excellent Bacon talk back in April.) So, resident O’Reilly geekery-evangelist Mike Loukides and I are going to do a bit of trailblazing with the more descriptive term “sequential.” Bear with us.

From the most basic principles, I will build up the foundation for statistical modeling. Several chapters assume statistical natural language processing as a typical use case, so sentiment analysis experts and Twitter miners should also have fun. My readers should know computers but need not be mathematicians. Although I have insisted that the O’Reilly toolstack support a few of my old-school LaTeX formulas…