How do you build a language model with a million dimensions?
The so-called “hashing trick” is a programming technique frequently used in statistical natural language processing for dimensionality reduction. The trick is so elegant and powerful that it would have warranted a Turing Award, if the first person to use the trick understood its power. John Langford cites a paper by George Forman & Evan Kirshenbaum from 2008 that uses the hashing trick, but it may have been discovered even earlier.[1] [2] Surprisingly most online tutorials and explanations of the hashing trick gloss over the main insights or get buried in notation. At the time of this writing, the Wikipedia entry on the hashing trick contains blatant errors.[3] Hence this post.
Hash, Man
A hash function is a programming routine that translates arbitrary data into a numeric representation. Hash functions are convenient, and useful for a variety of different purposes such as lookup tables (dictionaries) and cryptography, in addition to our hashing trick. An example of a (poor) hash function would map the letter “a” to 1, “b” to 2, “c” to 3 and so on, up to “z” being 26 — and then sum up the numbers represented by the letters. For the Benjamin Franklin quote “beware the hobby that eats” we get the following hash function output:
(the) 20 + 8 + 5 +
(hobby) 8 + 15 + 2 + 2 + 25 +
(that) 20 + 8 + 1 + 20 +
(eats) 5 + 1 + 20 + 19
= 233
Any serious hashing function will limit the range of numbers it outputs. The hashing function we used on Benjamin Franklin could simply take the first two digits of its sum, the “modulo 100” in programming terms, and provide that lower number as its output. So in this case, the number 233 would be lopped-off, and the hash function would return just 33. We have a blunt quantitative representation or mapping of the input that is hopefully useful in a statistical model. The range of this hashing function is therefore 100 values, 0 to 99.
Now a big reason to choose one hashing function over another is the statistical distribution of the output across the function’s range, or uniformity. If you imagine feeding in a random quote, music lyric, blog post or tweet into a good hashing function, the chance of the output being any specific value in the range should be the same as every other possible output. For our hashing function with a 0-99 range, the number 15 should be output about 1% of the time, just like every other number between 0 and 99. Note that our letter-summing hash function above does not have good uniformity, and so you should not use it in the wild. As an aside, keep in mind that certain hash functions are more uniform on bigger input data, or vice-versa.
Another reason to favor one hashing function over another is whether or not a small change in the input produces a big change in the output. I call this concept cascading. If we tweak the Benjamin Franklin quote a little bit and feed “beware the hobby that bats” into our silly hash function, the sum is now 230, which gets lopped-off to 30 within the hash’s output range. This modest change in output from 33 or 30 is another sign that our toy hash function is indeed just a toy. A small change in the input data did not cascade into a big change in the output number.
Here the important point is that a good hashing function will translate your input into each number in its output range with same probability (uniformity), and a small change in your input data will cause a big change in the output (cascading).
That’s Just Zipf-y
In human languages, very few words are used very frequently while very many words are very rare. For example, the word “very” turns up more than the word “rosebud” in this post. This relationship between word and frequency is very convex, non-linear or curved. This means that the 25th most common word in the English language (“from”) is not just used a little more frequently than the 26th most common word (“they”), but much more than the lower ranked word (26th).
This distribution of words is called Zipf’s Law. If you choose a random word from a random page in the Oxford English Dictionary, chances are that word will be used very rarely in your data. Similarly if you were to choose two words from the OED, chances are both of those words will not be common.
The Trick
If you are doing “bag-of-words” statistical modeling on a large corpus of English documents, it is easy find yourself accommodating thousands or millions of distinct words or ngrams. For example the classic 20 newsgroup corpus from Ken Lang contains over 61,000 different single words, and exponentially more two-word bigrams. Training a traditional statistical model with 61,000 independent variables or dimensions is computationally expensive, to say the least. We can slash the dimensionality of a bag-of-words model by applying Zipf’s Law and using a decent hashing function.
First we identify a hashing function with an output range that matches the dimensionality we wish the data had. Our silly hashing function above output a number from 0 to 99, so its range is 100. Using this function with the hashing trick means our statistical bag-of-words model will have a dimensionality of 100. Practically speaking we usually sit atop an existing high-quality hashing function, and use just a few of the least significant bits of the output. And for computational reasons, we usually choose a power of two as our hash function output range and desired dimensionality, so lopping-off the most significant bits can be done with a fast bitwise AND
.
Then we run every word or ngram in the training data through our adapted hashing function. The output of the hash becomes our feature, a column index or dimension number. So if we choose 28 (two -to-the-power-of- eight) as our hashing function’s range and the next ngram has a hash of 23, then we set our 23rd independent variable to the frequency count (or whatever) of that word. If the next hash is the number 258, we map to the output 3 at the bit level for the third dimension, or 258 = 255 + 3 = 255 + (258 MOD 255)
more mathematically. Our statistical NLP model of the 20 newsgroup corpus suddenly goes from 61,000 to only 256 dimensions.
Wait a Sec’…!
Hold on, that cannot possibly work… If we use the numeric hash of a word, phrase or ngram as an index into our training data matrix, we are going to run into too many dangerous hash collisions, right?
A hash collision occurs when two different inputs hash to the same output number. Though remember that since we are using a good hashing function, the uniformity and cascading properties make the chance of a hash collision between any two words independent of how frequently that word is used. Read that last sentence again, because it is a big one.
The pair of words “from” & “rosebud” and “from” & “they” each have the same chance of hash collision, even though the frequency with which the four words turn up in English is varied. Any pair of words chosen at random from the OED has the same chance of hash collision. However Zipf’s Law says that if you choose any two words randomly from the OED, chances are one of the words will be very rare in any corpus of English language documents. Actually both words will probably be infrequent. Therefore if a collision in our hash function’s output occurs, the two colliding words are probably oddballs.
Two Reasons it Still Works
Statistical NLP bag-of-words models that use the hashing trick have roughly the same accuracy as models that operate on the full bag-of-words dimensionality. There are two reasons why hash collisions in the low-dimensional space of the hash function’s output range do not trash our models. First any collisions that do occur, probably occur between two rare words. In many models, rare words do not improve the model’s regression / classification accuracy and robustness. Rare words and ngrams are said to be non-discriminatory. Now even if rare words are discriminatory in your problem domain, probability suggests the rare words do not co-occur in the same document. For this reason, the two rare words can be thought of as “sharing” the same representation in the model, whether this is decision tree sub-trees or a coefficient in a linear model. The Forman & Kirshenbaum paper says “a colliding hash is either unlikely to be selected as a feature (since both words are infrequent) or will almost always represent the word that led the classifier to select it.”
We cannot use the hashing trick for dimensionality reduction in every statistical model. Zipf’s Law means most features or independent variables in a bag-of-words representation equal zero. In other words, a point in the dimensional space of the bag-of-words (a “word vector”) is generally sparse. Along these lines, John Langford says the hashing trick “preserves sparsity.” For a random specific word, the chance of two random examples both having a non-zero value for that feature is low. Again this is because most words are rare.
The hashing trick is Zipf’s Law coupled with the uniformity & cascading properties of a good hash function, and using these to reduce the dimensionality of a sparse bag-of-words NLP model.
Notes
I doubt there is ever a hashing function that doesn’t desire uniformity, but hashing functions that produce similar outputs for similar inputs (‘locality-sensitive hashing’) are often useful, for example, in duplicate detection or similarity-based retrieval. Not true for the “hashing trick,” of course.
In fact, any function does produce similar outputs for similar inputs (this is a definition of a function), I think you meant “that produces similar outputs for distinct inputs )
Another sensible way to choose a good hashing function is to pick it from a strongly universal hash family. There are hash families with this property that are ridiculously fast over strings (e.g., the Multilinear family).
Good point, Will. I suppose cascading is still necessary for sparsity preservation, right? Or is uniformity really “the” main requirement?
Great writeup, thank you! One question remains, although: How does the hashing trick compare to using feature selection (FS)? I.e., how different is the hashing trick from using IG or chi2, etc., to find the “top” 256 terms? I’d say the hypothesis is that the hashing trick does worse – but the question is, how much worse?
Hi Florian — well I would not assume that the hashing trick does worse than traditional dimensionality reduction (feature selection) methods for NLP problems, because Zipf’s Law makes NLP models “different” in terms of sparsity. Also I think information gain and chi-squared dimensionality reduction usually use the entire set of training data. In other words, they are not online or sequential techniques. Therefore they will have trouble scaling to big datasets.
Just a thought on this. I think you could consider hashing an unsupervised feature selection technique. Consider all of the low-frequency feature indices. These will essentially get washed out in the model training due to overlaps with other features, and only the weightings on the common features will contain any reliable statistics. So, in this way, hashing lets you automatically choose the most prevalent features from the dataset without ever actually counting them manually which could be expensive with, say 1 million features and a million data points. This is in contrast to normal feature selection which measures the effect of different features on the ultimate model (thus you could call these supervised feature selection methods).
Very interesting idea! So those feature indices that end up discriminative are probably where important features “got hashed.” Though we would still need to train some sort of learner to identify the discriminative features. Since we are thinking unsupervised, we could do (online) sequential k-means and then look for features in the hashing space with large magnitude weights. Right?
Thanks for the talk today Ben, and interesting post.
Here’s a link to what I mentioned to you earlier about the use of feature hashing in reinforcement learning:
Sutton & Barto
One nice thing about hashing in the sequential decision making scenario is that you may not have a large data set a priori and thus you cannot make decision about the feature representation before observing data. When hashing a large multi-dimensional discretization (tile coding) you end up sequentially building a smaller discretization that reflects the distribution of observed features.
The Sutton & Barto book is indeed the earliest mention I have found of the hashing-trick. Well spotted!
Don’t we need to choose the size of the hashing range (i.e. bit mask) as an a priori model complexity parameter?
Thanks for the wonderfully clear exposition!
Do you have any comments on Kilian Weinberger’s use [*] of a second hash function with single-bit output ξ which chooses whether to increment or feature decrement count (rather than always incrementing)? I understand this serves to eliminate bias, but cannot qualitatively or intuitively understand how this helps in the event of collisions – surely a result of 0 is as incorrect as a result of 2.
[*] Weinberger, Kilian, et al. “Feature hashing for large scale multitask learning.” Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
Hi Apu — a good question! I think the unbiased variance derivation in the proof of Lemma 2 (in the appendix) is not satisfied if the ξ passed over the sum is unsigned, but am not entirely sure.
This is the best clarification of the hashing trick concept.
Thanks very much
I’ve been using the hashing trick with huge ad-click prediction models and very aggresive regularization. Interestingly Zipf’s Law seems to be true of every model I’ve trained in practice and also of a big number of random generated models, although the domain is not natural language (there are lots of entity ids and things like that). I’ve even used the trick to get a two-lines cache with about 98% hit rate!
Nevertheless I think one point deserves more attention. While it’s true that rare features are rare by definition, and that a collision between some specific rare feature and an any useful feature (which are relatively few) has very low probability, this is not true of a collision between any rare feature and any useful feature. Say the useful feature occupation level of the hash table is about 5%. Now every incoming observation probably contains a rare feature and thus there is 5% chance of hashing it to the same position a relevant feature already occupies; this position is not “washed out” because of constant rare feature competition or because of regularization. So I usually keep track of the “owner” feature for every hash position by incrementing a count when that feature was previously seen and decrementing it when a colliding feature arrives. If the colliding feature makes the count zero, now it owns that position. Usually relevant features become owners of their positions once and forever and I can discard rare features colliding with them (say 5% of the time).
Excellent point, and I like that your per-bucket owner & counter is still linear in processing & memory use.
Dear Ben,
This is a fantastic post. With your permission I will use the examples you used in my consulting sessions.
Thanks a bunch
Thanks, Rupayan. Go ahead, but please cite me!