What is a Promise Worth?

How do you prevent hyperinflation without destroying the economy? The answer ain’t Bitcoin.

A virtual currency like Bitcoin uses a decentralized proof-of-work ledger (the block chain) to solve the the double-spending problem. “Satoshi Nakamoto” deserve serious accolades for this clever architecture, but Bitcoin has a few serious problems. The first is its lack of security. The infrastructure around the currency is shoddy and fragile. The website where 80% of Bitcoin trading currently occurs is called the Magic: The Gathering Online Exchange (a.k.a. Mt.Gox). Recently Mt.Gox has crashed and been cracked, and does not support easy shorting. More importantly, the Bitcoin system may never mature without a central authority spending a lot of (traditional) money to build-out the infrastructure, with negligible or negative financial return-on-investment. Without a social program, in other words.

Even if Bitcoins did have the infrastructure and liquidity of a traditional currency like U.S. dollars or Japanese yen, there is another more fundamental problem with Bitcoin becoming the money of the future. Bitcoins are intrinsically deflationary.

The future will always be in one of two states: Either Bitcoin miners are running up against the limits of Moore’s Law, and are unable to profitably mine new Bitcoins. Or some bullshit singularity has occurred, giving us all access to infinite computational power. In this state, we would run up against the Bitcoin architecture’s hard-coded monetary supply cap of twenty-one million Bitcoins.

If human desire is infinite, then people will always want more money for goods and services. (All else equal, of course!) So we have an intrinsically fixed supply of a fungible good along with increasing demand. Therefore a Bitcoin is guaranteed to increase in value over time. Any fraction of a Bitcoin is guaranteed to increase in value over time. This may sound good if you happen to have a lot of BTC (Bitcoin) in your wallet. However at a macroeconomic level deflation is catastrophic, which I will explain.

A Hamburger on Tuesday
Would you trade something today that is certain to be worth more tomorrow? What about if the “something” is a currency, a good that has no intrinsic value other than it being money? (You cannot heat your house with the digital dollars in your checking account. Gotta pay the utility company first.) In an emergency you might spend your deflating currency, but in general you should hold onto your BTC as long as possible. And since there is uncertainty about the degree to which Bitcoin will deflate, the market will not instantly price BTC correctly. The BTC price of goods and services will not instantly adjust to match the level of computational power available to miners.

Some Bitcoin proponents think we can instantly discount the BTC price of all goods and services to sync-up with systematic BTC deflation, but this would need a seriously high-tech payment infrastructure. Square and Stripe are trying, but does anyone seriously believe the prices of all goods and services can be discounted in real-time by a macroeconomic indicator? We can’t even ditch the wasteful dollar bill!

The Bitcoin bulls also emphasize a currency’s dual role as a means of transaction and a store-of-value, but intrinsic deflation trashes both roles simultaneously. As a means of transaction, deflation makes allocating capital (money) across projects and activities difficult, and again, requires that perfect payment infrastructure. Since systematic deflation destroys every asset’s value and discourages economic activity, deflationary currencies do badly as stores-of-value. Less economic activity means GDP contraction and decreased livelihood. Yes, despite what Professor von Nimby may have spewed in your Postmodern Marxist Studies class, GDP is a very strong indicator for overall human happiness. Perpetual economic contraction makes your savings account irrelevant. You might have a zillion super-valuable BTC in your digital wallet, but you have nothing to spend them on. In other words, if you think (hyper-) inflation is bad, deflation is even worse…

Passing Notes
Let us go back to a few of the original Bitcoin goals. Bitcoin proponents want an efficient, liquid currency immune from the distortion caused by a government or central bank’s monetary policy. This is reasonable since inflationary monetary policy has a sad history of trashing peoples’ savings accounts, in places like the Weimar Republic or more recently in Argentina. So how can we build the decentralized, non-deflationary currency of the future?

Notes are an ancient monetary concept desperate for rethinking in the Internet age. At its most basic level, a note is a promise to exchange money, goods or services at some point in the future. However a note is not quite a futures contract, because the promise need not ever be exercised. And a note is not really an options contract, because a note need not ever expire. The most obvious form of a note is what a U.S. dollar bill used to represent when we were on the gold standard. It was a promise that the holder of the note (dollar bill) could exchange the note for a dollar’s worth of physical gold at any time. Notes are a lot easier to store and deal with than gold, and so they make a lot of sense for getting work done efficiently. We could also talk about the fungibility of notes, but that is less important at this point. And notes are definitely easier to move around than loaves of bread, head of cattle, barrels of oil, or other physical stuff with intrinsic value.

A hoard of notes would also be a decent store-of-value in your savings account, as long as the writer of the notes remains solvent and trusted. For example, a million dollars worth of U.S. gold-convertible notes is a great retirement nest-egg, since most normal people expect the U.S. government to honor its promises for a long time.

When the entity writing the note is trusted by just about everyone — expected to honor its contract — then the writer can declare the notes to be unconvertible, all at once. The notes become fiat currency, currency that is not explicitly backed by anything but the trust that the note writer will not issue too many notes and inflate away peoples’ savings.

Why does most global economic activity happen using a handful of fiat currencies, like the U.S. dollar or Euro? Nations have traditionally supported their (fiat) currencies through policy and war, because before the Internet trust did not scale. Imagine a small town. Mel and Stannis are neighbors in this town. Mel trusts Stannis to honor his promises, and accepts a note from Stannis in return for mowing Stannis’s lawn for the next year. Stannis’s note he writes for Mel says something like “Stannis promises to give the bearer of this note 100 loaves of bread, anytime.” Mel’s landlord Dave also trusts Stannis, and so he has no problem taking Mel’s note as rent. Stannis has essentially printed his own money that is a lot more convenient that baking 100 loaves of bread. Now in the next town over, no one really knows Stannis. Therefore Dave will have a hard time making use of Stannis’s note when he visits there to spend time with his grandparents. Dave and Mel trust Stannis, but the people living in the next town over do not.

In this parochial example, trust has not scaled across the network of transactions and relationships. The money Stannis created, the note he wrote, is not all that useful to Mel. Instead she could insist on being compensated by a note from an entity more trusted the world over, say the First Bank of Lannister which has a branch in both towns. Mel, Stannis, Dave and his grandparents all probably trust the First Bank of Lannister to pay its debts.

If Dave wants to spend Mel’s note written by Stannis in the next town over, he can ask a third party to guarantee or sign-off on the note. This can be done by exchanging Stannis’s promise for a promise by the First Bank of Lannister, which is more trusted throughout the realm. The First Bank of Lannister would be compensated for extending its trust by taking a cut of the promise from Stannis.

So before he leaves on his trip, Dave takes his rent check (note) from Mel into the First Bank of Lannister. They write a new note saying “The First Bank of Lannister promises to give the bearer of this note 95 loaves of bread, anytime” and gives this note to Dave in exchange for the note written by Stannis. The bank has decided to take responsibility for chasing down Stannis if he turns out to reneg on his promise, and in return they are compensated with the value of five loaves of bread. Here the Bank of Lannister has also issued its own currency, but more as a middle-man than someone doing economic activity like Mel’s lawnmowing or Dave’s landlording.

This middle-man role is very important but also difficult to scale across a physical economy. Eventually someone refuses to trust the First Bank of Lannister, and then the chain of economic activity halts. This is why the world’s global economy has consolidated onto a few currencies, for reasons of both efficiency and trust.

The Internets
In the age of the Internet and pervasive social networks like Facebook and Linkedin, everyone is connected in a global network. This is the famous degrees -of- Kevin Bacon or Erdös Number concept. Any two people are connected by just a few steps along the network. Most of Stannis’s friends on Facebook would be willing to accept a note or promise from Stannis, and the same holds true for Dave, Mel and the First Bank of Lannister’s social networks. Since the whole of humanity is probably connected in a trust network, software can automatically write those middle-man notes along the chain of connections. Therefore any two people can automatically find a chain of trust for spending money.

Back to our example, but in the age of the Internet. Mel, Dave and Stannis all trust each other, since they are Linkedin contacts. Peter reneged on a note a few months ago, so no one really trusts Peter except Stannis. Everyone unfriended Peter but Stannis, so Peter has a very isolated social network. This time around we do not need to care about geography and small towns, since everyone is connected via the Internet and social networks. Let’s say Peter wants to buy an old iPad from Dave, and Dave thinks the iPad is worth about a hundred loaves of bread. Peter could try to write a note promising a hundred loaves of bread, but Dave would not accept this note since he does not trust Peter. Now for the cool part.

Peter goes to a notes exchange website (NoteEx), and asks for a hundred-loaf note that Dave will trust. The website knows that Stannis trusts Peter, and that Dave trusts Stannis. (See the triangle?) Through the website, Stannis writes Peter a note for one hundred loaves of bread that Peter gives to Dave in exchange for the iPad. Dave has a note he trusts in exchange for his good, at the price he wanted. Similarly Stannis receives a note written by Peter, whom he trusts. This note might be for 105 loaves of bread, giving Stannis a little cut in exchange for trusting the dodgy Peter. This five loaf interest, cut or edge is Stannis’s compensation as a middle-man.

This can all be done automatically by the NoteEx server with a list of middle-men volunteers. People volunteer to be middle-men up to a maximum amount of exposure or risk (i.e. one thousand loaves of bread total). Or middle-men could even offer to guarantee up to two degrees of Kevin Bacon away, for a much higher cut. After a bunch of people volunteer to be middle-men in the NoteEx process, all economic activity could be subsumed, with social networks ensuring that you only ever receive payment (promises) from people you trust. A NoteEx transaction could have more than one middle-man, up to the six degrees of Kevin Bacon maximum that we assume connects all people.

Ironically, the good or service underlying the notes is not all that important, since notes are very rarely redeemed. In the same way that powerful governments can support fiat currencies backed by nothing, fiat notes backed by loaves of bread will not actually turn everyone into a baker. Usually notes are exchanged with their value being the trusted promise, but not necessarily the realization. Heavy stuff here.

Decentralized Bakery
The NoteEx website would be built atop an open and standard protocol, and competing notes exchanges could borrow from the Bitcoin architecture to be decentralized (i.e. the shared ledger). More importantly, there would be a natural level of inflation in the system as the cuts or interest that middle-men demand increase the total value of all promises across the economy. And of course, notes are an excellent store-of-value because who would you trust more to support you in an emergency or retirement than your tightest friends & family?

So! We have a theoretical monetary system free from government interference, and one that encourages economic activity through modest and natural inflation.

Posted in market-microstructure, politics, quant, quantitative-analysis, trading | 2 Comments

Hashing Language

How do you build a language model with a million dimensions?

The so-called “hashing trick” is a programming technique frequently used in statistical natural language processing for dimensionality reduction. The trick is so elegant and powerful that it would have warranted a Turing Award, if the first person to use the trick understood its power. John Langford cites a paper by George Forman & Evan Kirshenbaum from 2008 that uses the hashing trick, but it may have been discovered even earlier.[1] [2] Surprisingly most online tutorials and explanations of the hashing trick gloss over the main insights or get buried in notation. At the time of this writing, the Wikipedia entry on the hashing trick contains blatant errors.[3] Hence this post.

Hash, Man

A hash function is a programming routine that translates arbitrary data into a numeric representation. Hash functions are convenient, and useful for a variety of different purposes such as lookup tables (dictionaries) and cryptography, in addition to our hashing trick. An example of a (poor) hash function would map the letter “a” to 1, “b” to 2, “c” to 3 and so on, up to “z” being 26 — and then sum up the numbers represented by the letters. For the Benjamin Franklin quote “beware the hobby that eats” we get the following hash function output:

(beware) 2 + 5 + 23 + 1 + 18 + 5 +
(the) 20 + 8 + 5 +
(hobby) 8 + 15 + 2 + 2 + 25 +
(that) 20 + 8 + 1 + 20 +
(eats) 5 + 1 + 20 + 19
= 233

Any serious hashing function will limit the range of numbers it outputs. The hashing function we used on Benjamin Franklin could simply take the first two digits of its sum, the “modulo 100” in programming terms, and provide that lower number as its output. So in this case, the number 233 would be lopped-off, and the hash function would return just 33. We have a blunt quantitative representation or mapping of the input that is hopefully useful in a statistical model. The range of this hashing function is therefore 100 values, 0 to 99.

Now a big reason to choose one hashing function over another is the statistical distribution of the output across the function’s range, or uniformity. If you imagine feeding in a random quote, music lyric, blog post or tweet into a good hashing function, the chance of the output being any specific value in the range should be the same as every other possible output. For our hashing function with a 0-99 range, the number 15 should be output about 1% of the time, just like every other number between 0 and 99. Note that our letter-summing hash function above does not have good uniformity, and so you should not use it in the wild. As an aside, keep in mind that certain hash functions are more uniform on bigger input data, or vice-versa.

Another reason to favor one hashing function over another is whether or not a small change in the input produces a big change in the output. I call this concept cascading. If we tweak the Benjamin Franklin quote a little bit and feed “beware the hobby that bats” into our silly hash function, the sum is now 230, which gets lopped-off to 30 within the hash’s output range. This modest change in output from 33 or 30 is another sign that our toy hash function is indeed just a toy. A small change in the input data did not cascade into a big change in the output number.

Here the important point is that a good hashing function will translate your input into each number in its output range with same probability (uniformity), and a small change in your input data will cause a big change in the output (cascading).

That’s Just Zipf-y

In human languages, very few words are used very frequently while very many words are very rare. For example, the word “very” turns up more than the word “rosebud” in this post. This relationship between word and frequency is very convex, non-linear or curved. This means that the 25th most common word in the English language (“from”) is not just used a little more frequently than the 26th most common word (“they”), but much more than the lower ranked word (26th).

This distribution of words is called Zipf’s Law. If you choose a random word from a random page in the Oxford English Dictionary, chances are that word will be used very rarely in your data. Similarly if you were to choose two words from the OED, chances are both of those words will not be common.

The Trick

If you are doing “bag-of-words” statistical modeling on a large corpus of English documents, it is easy find yourself accommodating thousands or millions of distinct words or ngrams. For example the classic 20 newsgroup corpus from Ken Lang contains over 61,000 different single words, and exponentially more two-word bigrams. Training a traditional statistical model with 61,000 independent variables or dimensions is computationally expensive, to say the least. We can slash the dimensionality of a bag-of-words model by applying Zipf’s Law and using a decent hashing function.

First we identify a hashing function with an output range that matches the dimensionality we wish the data had. Our silly hashing function above output a number from 0 to 99, so its range is 100. Using this function with the hashing trick means our statistical bag-of-words model will have a dimensionality of 100. Practically speaking we usually sit atop an existing high-quality hashing function, and use just a few of the least significant bits of the output. And for computational reasons, we usually choose a power of two as our hash function output range and desired dimensionality, so lopping-off the most significant bits can be done with a fast bitwise AND.

Then we run every word or ngram in the training data through our adapted hashing function. The output of the hash becomes our feature, a column index or dimension number. So if we choose 28 (two -to-the-power-of- eight) as our hashing function’s range and the next ngram has a hash of 23, then we set our 23rd independent variable to the frequency count (or whatever) of that word. If the next hash is the number 258, we map to the output 3 at the bit level for the third dimension, or 258 = 255 + 3 = 255 + (258 MOD 255) more mathematically. Our statistical NLP model of the 20 newsgroup corpus suddenly goes from 61,000 to only 256 dimensions.

Wait a Sec’…!

Hold on, that cannot possibly work… If we use the numeric hash of a word, phrase or ngram as an index into our training data matrix, we are going to run into too many dangerous hash collisions, right?

A hash collision occurs when two different inputs hash to the same output number. Though remember that since we are using a good hashing function, the uniformity and cascading properties make the chance of a hash collision between any two words independent of how frequently that word is used. Read that last sentence again, because it is a big one.

The pair of words “from” & “rosebud” and “from” & “they” each have the same chance of hash collision, even though the frequency with which the four words turn up in English is varied. Any pair of words chosen at random from the OED has the same chance of hash collision. However Zipf’s Law says that if you choose any two words randomly from the OED, chances are one of the words will be very rare in any corpus of English language documents. Actually both words will probably be infrequent. Therefore if a collision in our hash function’s output occurs, the two colliding words are probably oddballs.

Two Reasons it Still Works

Statistical NLP bag-of-words models that use the hashing trick have roughly the same accuracy as models that operate on the full bag-of-words dimensionality. There are two reasons why hash collisions in the low-dimensional space of the hash function’s output range do not trash our models. First any collisions that do occur, probably occur between two rare words. In many models, rare words do not improve the model’s regression / classification accuracy and robustness. Rare words and ngrams are said to be non-discriminatory. Now even if rare words are discriminatory in your problem domain, probability suggests the rare words do not co-occur in the same document. For this reason, the two rare words can be thought of as “sharing” the same representation in the model, whether this is decision tree sub-trees or a coefficient in a linear model. The Forman & Kirshenbaum paper says “a colliding hash is either unlikely to be selected as a feature (since both words are infrequent) or will almost always represent the word that led the classifier to select it.”

We cannot use the hashing trick for dimensionality reduction in every statistical model. Zipf’s Law means most features or independent variables in a bag-of-words representation equal zero. In other words, a point in the dimensional space of the bag-of-words (a “word vector”) is generally sparse. Along these lines, John Langford says the hashing trick “preserves sparsity.” For a random specific word, the chance of two random examples both having a non-zero value for that feature is low. Again this is because most words are rare.

The hashing trick is Zipf’s Law coupled with the uniformity & cascading properties of a good hash function, and using these to reduce the dimensionality of a sparse bag-of-words NLP model.

Notes

[1] Actually the first public version of the hashing trick John Langford knew of was in the first release of Vowpal Wabbit in back in 2007. He also points out that the hashing trick enables very efficient quadratic features to be added to a model.
[2] Jeshua Bratman pointed out that the Sutton & Barto classic textbook on reinforcement learning mentions the hashing-trick way back in 1998. This is the earliest reference I have yet found.
[3] “the number of buckets [hashing function output range] M usually exceeds the vocabulary size significantly” from the “Hashing-Trick” Wikipedia entry, retrieved on 2013-01-29.
Posted in machine-learning, natural-language-processing, vowpal-wabbit | 17 Comments

Under the Hood of Buying & Selling Predictions

How do those futures markets like Betfair and Intrade work?

The managers of a prediction market decide upon a finite number of prediction contracts. A contract is essentially a description of some hypothetical future event. For example, a contract might be “Fadebook trades at least $50 per share in 2012.” Another important aspect of the contract specification is the anticipation and preemptive resolution of ambiguity. If Fadebook does a 2:1 split, does the event become “…at least $25 per share?” What if the asking price for Fadebook shares reaches $50, but the bid price does not? Contracts also specify an expiration date, a time by which the event must occur. In the case of the Fadebook contract, the obvious expiration date would be January 1st of 2013. For other contracts, the expiration will be arbitrary to allow for a final decision on the event’s occurrence or nonoccurrence.

Virtual Currency
A prediction is a single user’s opinion on the likelihood of the contract’s event occurring. The prediction market encourages users to form opinions about contract events, and encourages users to wager virtual currency on a contract. Users purchase virtual currency (henceforth “credits”) with real money, or perhaps are granted an amount of virtual currency in a freemium offering.

Contract Size
Each contract also specifies a size, which is both the minimum number of credits a user can wager on a contract, and the marginal increase in the size of a wager. If the prediction market choose a size of “5 credits” for the Fadebook contract, then an individual user can wager only 5, 10, 15… credits on the contract. Contract sizes are necessary in order better match both sides of the wager. Somewhat confusingly, experienced traders refer to each increment of the contract size as “a contract.” If I wager 15 credits on the Fadebook contract with a size of 5 credits, then I am said to be wagering “3 contracts.”

Direction
When making a wager on a contract, the user specifies the number of credits she will risk, in contract size units, and the direction of the wager. If a user believes the event is certain to occur and the user is eventually proven correct, she will profit from buying a contract whenever its likelihood is less than 100%. If a user believes an event is certain not to occur and turns out to be right, the user will profit from selling the contract when its likelihood is greater than 0%. The buyer of a contract is said to be long and the seller of a contract we call short.

Payoff
When the contract’s event occurs, the long side earns the contract size in credits for each contract, less the likelihood level where she initially make the wager. So if Fadebook’s stock hits $50 in October of 2012 while I am long 3 contracts bought at 25% likelihood, then the prediction market would immediately close the contract and credit my account with 5 credits (size), times 3 contracts, less 25% of the total, or 11.25 credits. The opposite also occurs for the short side. In this case, the 3 contracts the short side sold are closed out worthless at 0%, and the short’s account is reduced by 5 credits size, times 3 contracts, times 25% of the total. Again this is 11.25 credits but deducted from the short side of the wager.

If the contract expires as the year 2012 comes to a close, and I shorted 2 contracts at 60% back in February of 2012, then I will earn a profit of 5 credits size, times 2 contracts, times 60% or 6 credits. This is exactly what the long side of those 2 contracts would lose. The long side may be 1 user, or 2 different users each long 1 contract.

By definition, the long and short sides of a contract will always balance. A user is not able to be long a contract unless another user is short. This would be similar to a real futures contract traded in Chicago or New York, but where the long side is committed to “buying” the event for 100% if it occurs.

Orders
The current likelihood (henceforth “price”) of a contract is determined by looking at the contract’s order book on the prediction market. Order books are how buyers and sellers of contracts are matched, and indicate a prediction contract’s current price or likelihood. An order book is an ordered list of buying and selling prices. If there is no market in a contract and a user wishes to make a wager, her estimated likelihood becomes the best buying or selling price for the contract. From now on we drop the price or likelihood’s percent sign for brevity.

Say the Fadebook contract has just been listed and publicized on the prediction market. If I believe the event is very likely to occur, then I might offer 99 to anyone willing to sell at 99. If I am correct, then I will eventually earn 1 credits per contract for my trouble. I will have paid 99 for something that will earns me 100. Though perhaps I want to leave myself more room to profit, and instead offer 25 to anyone willing to sell at 25. This would mean a profit of 75 per contract when it expires. The prices at which every user is willing to buy forms one side of the order book, the buying prices or bids.

A similar process happens on the selling side. I want to be short a contract when I do not believe the event will occur. So I would sell to anyone for more than 0 likelihood or price. If I end up correct in my prediction and the event does not occur when I went short at 90 likelihood, then I earn 90% of the contract size for each contract. The collected prices at which all users are willing to sell forms the other side of the order book, the selling prices or asks.

Market Orders
When a wager occurs at a price matching one of the buy or sell orders currently available in the market, the order is instantly made complete by matching a long and a short side of a wager. If I want to buy the Fadebook contract at 30 or 35 and there is already a user with a selling price or ask of 32, my wager will be immediately matched. In this sense, my order never actually appears in the prediction market order book for the contract, since the wagers are instantly matched.

Often a user will want to buy or sell at whatever likelihood or price the market is currently offering. In this market order, the user takes whatever happens to be the best price available at the time. Market orders are risky because the user may not know the exact price at which she commits to the wager. Market orders are also said to leduce liquidity in the market, and may be considered less healthy for the prediction market than regular limit orders.

Trading Out Early
The fundamental advantage to prediction markets over traditional oddsmaking is the option for a user to exit out of a profitable or loser wager early, before the contract event’s expiration. A user who is currently long or short a contract is free to trade the position to another party at any point, even if the user has only held the contract for a few minutes. This is advantageous for a user who wants to cut their losses early on a losing contract, or wants to take their profit early on a contract that becomes profitable.

If I went long 5 of the Fadebook contracts at 35 a few months ago, but that contract now has an order book centered around 50 bid/ask, then I can sell the 5 contracts for a 15 profit per contract right now. I do not need to hold my Fadebook contracts until 2013. This is not going short the contract, but selling to a willing buyer in order to net out my position with a profit.

Liquidity & Accurate Predictions
The more users are actively participating in wagering on a contract, the more accurate the collective estimate of likelihood. The most recent trade price or likelihood of a very busy and popular contract is an excellent estimate of the real likelihood of the contract event occurrence. If a single user believes a contract has a 35% likelihood, but a thousand other users are trading that contract around 75% likelihood, chances are the first user is wrong!

Prizes & Incentives
Prediction markets are more powerful when users are incentivized with real compensation. Therefore even if the prediction market’s virtual currency simplifies the regulatory aspects of the project, credits need to be closely tied to real money or prizes, so users assume actual risk when making wagers. Also any prizes need to incentivize users to make careful wagers and not necessarily “swing for the fences” on each wager. In other words, each bit of virtual currency must contribute to winning prizes.

Users who risk nothing of value when making wagers will not turn out to form a particularly accurate prediction market.

Posted in forecasting, market-microstructure, trading | Leave a comment

Sequential Learning Book

Things have been quiet around here since the winter because I have been focusing my modest writing and research skills on a new book for O’Reilly. We signed the contract a few days ago, so now I get to embrace a draconian authorship schedule over the next year. The book is titled Sequential Machine Learning and will focus on data mining techniques that train on petabytes of data. That is, far more training data that can fit in the memory of your entire Hadoop cluster.

Sequential machine learning algorithms do this by guaranteeing constant memory footprint and processing requirements. This ends up being an eyes-wide-open compromise in accuracy and non-linearity for serious scalability. Another perk is trivial out-of-sample model validation. The Vowpal Wabbit project by John Langford will be the book’s featured technology, and John Langford has graciously offered to help out with a foreword. Therefore the book will also serve as a detailed tutorial on using Vowpal Wabbit, for those of us who are more coder or hacker than statistician or academic.

The academic literature often uses the term “online learning” for this approach, but I find that term way too confusing given what startups like Coursera and Khan Academy are doing. (Note the terminology at the end of Hilary Mason’s excellent Bacon talk back in April.) So, resident O’Reilly geekery-evangelist Mike Loukides and I are going to do a bit of trailblazing with the more descriptive term “sequential.” Bear with us.

From the most basic principles, I will build up the foundation for statistical modeling. Several chapters assume statistical natural language processing as a typical use case, so sentiment analysis experts and Twitter miners should also have fun. My readers should know computers but need not be mathematicians. Although I have insisted that the O’Reilly toolstack support a few of my old-school LaTeX formulas…

Posted in book, machine-learning, sequential-machine-learning, vowpal-wabbit | 8 Comments

What is There to Eat Around Here?

Or, why clams are bourgeois — the presence of clams on menus is indicative of a place where people spend a lot of their money on housing. This is how I found out.

We have all played the proportional rent affordability game. How much of my income should I spend on where I live? One rule of thumb is “a third,” so if you take home $2,400 per month you aim to spend about $800 on rent or a mortgage payment. Some play the hypothetical budgeting version of the game. We might pay more of our income for housing if it means being able to live in a particularly desirable area.

Expensive Housing
Here is a map of income normalized by housing expense, for a bunch of Bay Area neighborhoods. This information is from our Altos Research active market real estate data. More technically, each dot on the map represents the ratio of a zipcode’s household income to the weighted average of single family home list prices and multi-family home list prices. I used median numbers, to minimize the impact of foreclosures or extremely wealthy households. Single and multi-family home prices were weighted by listing inventory, so urban condos matter as much as those McMansions in the ‘burbs. The green dots are areas where proportionally more income is spent on housing, and blue dots are the opposite.

Bay Area Housing Proportional Housing Expense

The data shows that people living in the city of San Francisco spend a much larger proportion of their income on housing than Oaklanders or those in San Jose. If we assume that the real estate market is somewhat efficient, then those who choose to live in certain neighborhoods forgo savings and disposable income. Why is it that housing expenses for living in San Francisco are so much higher than San Jose, even when we control for income disparity?

The Real Estate Menu
Like a proper hack economist, I am going to gloss over the obvious driving factors of proportionally expensive housing, such as poor labor mobility, lack of job opportunities, and a history of minority disenfranchisement. I am a chef by training — culinary arts degree from CHIC, the Le Cordon Bleu school in Chicago — and remain fascinated by the hospitality industry. So instead of diving into big social problems, I focused on something flippant and easy to measure: Where people go out to eat, across areas with different levels of proportional housing expense.

I analyzed the menus of a random selection of 5,400 sit-down and so-called “fast casual” restaurants across the United States. This menu population is hopefully large and diverse enough to represent dining out in general, though it is obviously biased toward those restaurants with the money and gumption to post their menus online. However there is not a disproportionate number of national chain restaurants, since even the most common restaurant, T.G.I. Friday’s, is only about 2.5% of the population:

Restaurant Histogram

Menu Words
The next step in my analysis was counting the common words and phrases across the menus. Here are the top fifty:

1. sauce, 2. chicken, 3. cheese, 4. salad, 5. grilled, 6. served, 7. fresh, 8. tomato, 9. shrimp, 10. roasted, 11. served-with, 12. garlic, 13. cream, 14. red, 15. fried, 16. onions, 17. tomatoes, 18. beef, 19. rice, 20. onion, 21. bacon, 22. topped, 23. mushrooms, 24. topped-with, 25. steak, 26. vinaigrette, 27. spinach, 28. lettuce, 29. pork, 30. green, 31. potatoes, 32. spicy, 33. white, 34. salmon, 35. in-a, 36. soup, 37. peppers, 38. mozzarella, 39. lemon, 40. sweet, 41. with-a, 42. menu, 43. beans, 44. dressing, 45. fries, 46. tuna, 47. black, 48. greens, 49. chocolate, 50. basil

Pervasive ingredients like “chicken” turn up, as do common preparation and plating terms like “sauce” and “topped-with”. Perhaps my next project will be looking at how this list changes over time. For example, words like “fried” were taboo in the 90’s, but more common during this post-9/11 renaissance of honest comfort food. Now-a-days chicken can be “fried” again, not necessarily “crispy” or “crunchy”.

A Tasty Model
Next I trained a statistical model using the menu words and phrases as independent variables. My dependent variable was the proportional housing expense in the restaurant’s zipcode. The model was not meant to be predictive per se, but instead to identify the characteristics of restaurant menus in more desirable areas. The model covers over five thousand restaurants, so menu idiosyncrasy and anecdote should average out. The algorithm used was our bespoke version of least-angle regression with the lasso modification. It trains well on even hundreds of independent variables, and highlights which are most informative. In this case, which of our many menu words and phrases are correlated with proportional housing expense?

Why Clams are Bourgeois

The twenty menu words and phrases most correlated with low proportional housing expense (the bluer dots) areas:

1. tortilla, 2. cream-sauce, 3. red-onion, 4. thai, 5. your-choice, 6. jumbo, 7. crisp, 8. sauce-and, 9. salads, 10. oz, 11. italian, 12. crusted, 13. stuffed, 14. marinara, 15. broccoli, 16. egg, 17. scallops, 18. roast, 19. lemon, 20. bean

Several of these words of phrases are associated with ethnic cuisines (i.e. “thai” and “tortilla”), and others emphasize portion size (i.e. “jumbo” and “oz” for ounce). Restaurants in high proportional housing expense areas (greener dots) tend to include the following words and phrases on their menus:

1. clams, 2. con, 3. organic, 4. mango, 5. tofu, 6. spices, 7. eggplant, 8. tomato-sauce, 9. cooked, 10. artichoke, 11. eggs, 12. toast, 13. roll, 14. day, 15. french-fries, 16. duck, 17. seasonal, 18. oil, 19. steamed, 20. lunch, 21. chips, 22. salsa, 23. baby, 24. arugula, 25. red, 26. braised, 27. grilled, 28. chocolate, 29. avocado, 30. dressing

These words reflect healthier or more expensive food preparation (i.e. “grilled” or “steamed”), as well as more exotic ingredients (i.e. “mango” and “clams”). Also, seasonal and organic menus are associated with low proportional housing expense. The word “con” turns up as a counter-example for Latin American cuisine, as in “con huevos” or “chili con queso”.

Food Crystal Ball
This sort of model for restaurant menus could also be used for forecasting, to statistically predict the sort of food that will be more successful in a particular neighborhood. This predictive power would be bolstered by the fact that the population of menus has a survivorship bias, because failed or struggling restaurants are less likely to post their menus online.

This confirms my suspicion that housing expense is counter-intuitive when it comes to dining out. People who spend more of their income on housing in order to live in a desirable location have less disposable income, but these are the people who pay more for exotic ingredients and more expensive food preparation. Maybe these folks can’t afford to eat in their own neighborhood?

Posted in altos-research, natural-language-processing, politics, real-estate, restaurant-menus, restaurants | 1 Comment

Redots

Dorkbot is a semi-monthly meeting of “people doing strange things with electricity.” They have been chugging along in several cities for a decade-or-so. Back in 2005 I presented at a Dorkbot in London, so I have an enduring soft spot for these quirky gatherings. At this month’s Dorkbot in San Francisco, a meteorologist named Tim Dye presented a brilliant visualization called WeatherDots. It summarizes the weather data he collects near his home in wine country.

Inspired by how much time-series information Dye was able to squeeze onto a few pretty circles, I spent the plane ride to ABS East in Miami throwing together a “dot” visualization of the Altos Research weekly active market data. Here is a visualization of a year’s worth of real estate data:

Redots Screenshot
http://www.altosresearch.com/customer/labs/redots.html

My Redots updates every week, and can be pointed at any of the Altos Research local markets by entering a city, state, and zipcode. Your web browser needs to play nicely with the amazing Raphaël visualization library, or you will just get a blank screen. I recommend using Google Chrome.

The Legend, or What Is It?
Each dot of color represents a week in a local residential real estate market, so each column is a month. The main color of the dot shows the week-on-week change in the median price of single family homes in a particular zipcode. A red dot means house prices have decreased since the previous week (or dot), while green dots are increasing weeks. The summer seasonality effect is pretty clear in our Mountain View, CA example.

The “halo” of a dot is the ratio of new listings to listings in general. If the newest listings coming onto a market are priced higher than the typical listing, then the halo will be green. This suggests a seller’s market, when new listings are asking for a premium. The price of these new listings will be absorbed into the market the following week, so you might imagine a dot’s halo merging with the main color.

A dot’s angle is the year-on-year change in market prices. Aiming northeastward means prices have increased since the year before, while southeast is a decrease. These angles strip away seasonality from the market, and show how secular real estate trends. Our Silicon Valley example is a bit down year-on-year. The thickness of a weekly dot represents the week-on-week change in the number of listings, put more simply, the inventory. Thin dots that are more ellipsoid are a shrinking market, where fewer listings are available for sale at any price.

A Thousand Words
Information visualization is a buzzy field with smart people doing striking work. For me the line between the big data and infovis communities blurs when a pretty picture enables statistical inference without necessarily running the numbers.

Posted in altos-research, information-visualization, real-estate | 2 Comments

A Different House Hedge

Where do stock market winners buy houses?

There are many ways to predict how the price of an asset will change in the future. For stocks, one approach is based on fundamental analysis and another approach uses portfolio diversification theory. A third approach to predicting stock movement is so-called “technical analysis,” which is too silly for more than a mention. There are also statistical arbitrageurs in the high-frequency market-making and trading arms race, who make minute predictions thousands of times per day. If we pretend real estate acts as a stock, we can stretch the analogy into a new mathematical tool for hedging house prices.

Fundamentalism

Fundamental analysis is usually what people think about when picking stocks. This is the Benjamin Graham philosophy of digging into a company’s internals and financial statements, and then guessing whether or not the current stock price is correct. The successful stock picker can also profit from an overpriced share by temporarily borrowing the stock, selling it, and then later buying it back on the cheap. This is your classic “short,” which may or may not be unethical depending on your politics. Do short trades profit from misery, or reallocate wasted capital?

Fundamental analysis is notoriously difficult and time-consuming, yet it is the most obvious way to make money in the stock market. Fundamental analysis is also what private equity and venture capitalists do, but perhaps covering an unlisted company or even two guys in a garage in Menlo Park. When you overhear bankers talking about a “long/short equity fund” they probably mean fundamental analysis done across many stocks and then managing (trading) a portfolio that is short one dollar for every dollar it is long. This gives some insulation against moves in a whole sector, or even moves in the overall economy. If you are long $100 of Chevron and short $100 of BP, the discovery of cheap cold fusion will not trash your portfolio since that BP short will do quite well. However for conservative investors like insurance companies and pension funds, government policy restricts how much capital can be used to sell assets short. These investors are less concerned about fundamental analysis, and more about portfolio diversification and the business cycle.

Highly Sensitive Stuff

If a long-only fund holds just automobile company stocks, the fund should be very concerned about the automobile sector failing as a whole. The fund is toast if the world stops driving, even if their money is invested in the slickest, most profitable car companies today. Perfect diversification could occur if an investor bought a small stake in every asset in the world. Though huge international indices try to get close, with so many illiquid assets around, perfect diversification remains just a theory. How can an investor buy a small piece of every condominium in the world? How could I buy a slice of a brand like Starbucks? Even worse, as time goes by companies recognize more types of illiquid assets on their balance sheets. Modern companies value intellectual property and human capital, but these assets are difficult to measure and highly illiquid. What currently unaccounted-for asset will turn up on balance sheets in 2050?

Smart fund managers understand that perfect diversification is impossible, and so they think in terms of a benchmark. A fund benchmark is usually a published blend of asset prices, like MSCI’s agricultural indices. The fund manager’s clients may not even want broad diversification, and may be happy to pay fund management fees for partial diversification across a single industry or country. Thinking back to our auto sector fund, they are concerned with how the fortune’s of one car company are impacted by the automobile industry as a whole. An edgy upstart like Tesla Motors is more sensitive to the automobile industry than a stalwart like Ford, which does more tangential business like auto loans and servicing.

Mathematically we calculate the sensitivity of a company to a benchmark by running a simple linear regression of historic stock returns against changes in the benchmark. If a company’s sensitivity to the benchmark is 2.5, then a $10 stock will increase to $12.50 when the benchmark goes up by one point. A sensitivity of 0.25 means the stock would just edge up to $10.25 in the same scenario. A company can have negative sensitivity, especially against a benchmark in another related industry. Tesla probably has a negative sensitivity to changes in an electricity price index, since more expensive electricity would hurt Tesla’s business. No sensitivity (zero) would turn up against a totally unrelated benchmark. Sensitivity has a lot in common with correlation, another mathematical measure of co-movement.

One type of sensitivity is talked about more than any other. “Beta” is the sensitivity of a stock to the theoretical benchmark containing every asset in the world. Data providers like Bloomberg and Reuters probably estimate beta by regressing stock returns against one of those huge, international asset indices. An important model in finance and economics is called the Capital Asset Pricing Model, which earned a Nobel Prize for theorizing that higher beta means higher returns, since sensitivity to the world portfolio is the only sort of risk that cannot be diversified away. Though the CAPM beta is a poor model for real-life inefficient markets, sensitivities in general are a simple way to think about how a portfolio behaves over time. For instance, it turns out that sensitivities are additive. So $100 in a 0.25 sensitive stock and $50 in two different -0.25 sensitive stocks should be hedged against moves in the index and in the industry the index measures.

Back to Real Estate

Prices in certain local real estate markets are bolstered by a rally in the stock market. The recent murmurings of another IPO bubble suggest that newly minted paper millionaires will soon be shopping for homes in Los Altos Hills and Cupertino. We can put numbers behind this story by calculating real estate price sensitivity to a stock market benchmark. If we choose the S&P 500 as the benchmark, the sensitivity number will be a sort of real estate beta. Since real estate is far less liquid than most stocks, I regressed quarterly changes in our Altos Research median ask price against the previous quarter’s change in the S&P 500. Historically speaking, those real estate markets with a high beta have gotten a boost in prices after a good quarter in the stock market. Those markets with a low, negative beta are not “immune” to the stock market, but tend to be depressed by a stock market rally.

Below is a map of the Bay Area’s real estate betas. These numbers were calculated using prices from Altos Research and benchmark levels from Yahoo! Finance. The darker red a zipcode, the greater an increase in the market’s home prices after a quarterly stock market rally. As we might expect, the betas in Silicon Valley are above average. However there are also some surprises in Visalia and Wine Country.

Real Estate Beta, Bay Area

Our hypothesis for positive real estate beta is easy: those IPO millionaires. But what could cause a real estate market to tank after a good run in the stocks? Perhaps negative real estate betas are in more mobile labor markets, where stock market wealth triggers a move away from home ownership. Or maybe negative real estate betas turn up in markets where the condo stock is higher quality than single-family homes, like in some college towns. Remember the betas mapped above are based on only single-family home prices.

Real estate remains a difficult asset to hedge, an asset almost impossible to short by non-institutions. This is unfortunate, because a short hedge would be a convenient way for people with their wealth tied up in real estate to ride out a depressed market cycle. However like long-only fund managers, real estate investors could benefit from thinking in terms of benchmark sensitivity. If we choose a benchmark that represents the broader real estate market, we could hedge real estate buy purchasing non-property assets that have negative real estate betas. You would want your value-weighted real estate beta to net out to about zero. Now there is a plethora of problems and assumptions around making investment decisions with a crude linear sensitivity number, but at least real estate beta gives us another tool for thinking about housing risk.

(An abbreviated version of this post can found be at http://blog.altosresearch.com/a-different-house-hedge/ on Altos Research’s blog)

Posted in altos-research, hedging, quant, quantitative-analysis, real-estate, trading | 2 Comments

Fungal Houses

Ever wondered why your flat’s Zestimate bounces around so much?

In high school economics class you might have learned about fungible goods. This strange word refers to things that could be swapped without the owners especially caring. A dollar is almost perfectly fungible, and so is an ounce of pure silver. Paintings and emotional knick knacks are not at all fungible. Fungible stuff is easy to trade on a centralized market, since a buyer should be happy to deal with any seller. This network effect is so important that markets “push back,” and invent protocols to force fungibility. Two arbitrary flatbeds of lumber at Home Depot are probably not worth the same amount of cash. However the CME’s random length lumber contract puts strict guidelines on how that lumber could be delivered to satisfy the obligation of the future contract’s short trader.

Real estate is seriously non-fungible. Even a sterile McMansion in the suburbs can have a leaky roof, quirky kitchen improvements, or emotional value for the house-hunting recent college grads. If we consider many similar homes as a basket, or a portfolio of the loans secured by the homes, then the idiosyncrasies of each home should net out to zero overall. Across those ten thousand McMansions, there should be a few people willing to pay extra for a man cave, but also a few people who would dock the price. This is the foundation of real estate “structured products,” such as the residential mortgage backed securities (RMBS) of recent infamy. Like flatbed trucks delivering a certain sort of wood for a lumber futures contract, a RMBS makes a non-fungible good more fungible.

The Usual Place
The combined idiosyncrasies of non-fungible things rarely net out to exactly zero, especially during a financial crisis. Nonetheless traders and real estate professionals want to think about a hypothetical, “typical” property. We define a local real estate market by city, neighborhood or even zipcode. How do we decide the value of a typical property? There is an entire industry built around answering this question. One simple, clean approach is to sample a bunch of real estate prices in a local market at a certain point in time, and then average the prices. Or maybe use a more robust descriptive statistic like the median price.

The most readily available residential home prices in the U.S. market are “closed” transactions, the price a home buyer actually paid for their new place. Using a closed transaction price is tricky, because it is published several months after a property is sold. Can a typical home price really be representative if it is so stale?

Sampling
Even if we ignore the time lag problem, there is another serious challenge in using transactions to calculate a typical home price. Within any local real estate market worth thinking about, there are very few actual transactions compared with overall listing activity and buzz. Your town may have a hundred single-family-homes listed for sale last week, but only four or five closed purchases. A surprise during the buyer’s final walkthrough could wildly swing the average, “typical” home price. For the statistically inclined, this is a classic sample size problem.

There are plenty of ways to address the sample size problem, such as rolling averages and dropping outliers. Or you could just include transactions from a wider area like the county or state. However the wider the net you cast, the less “typical” the price!

Another approach is to sample from the active real estate market, those properties currently listed for sale. You get an order of magnitude more data and the sample size problem goes away. However everyone knows that listing prices do not have a clear cut relationship with closing price. Some sellers are unrealistic and ask too much, and some ask for too little to start a bidding war. What is the premium or discount between listing price and actual value? We spend a lot of time thinking about this question. Even closed transaction prices are not necessarily the perfect measure of typical “value” since taxes and mortgage specifics can distort the final price. Our solution is to assume that proportional changes in listing prices over time will roughly match proportional changes in the value of a typical house, especially given a larger sample from the active market.

A Picture
Below is a chart of Altos Research‘s real estate prices back through 2009, across about 730 zipcodes. For each week on the horizontal axis, and for each zipcode, I calculate the proportional change in listing price (blue) and in sold price (red) since the previous week. Then I average the absolute value of these proportional changes, for a rough estimate of volatility. The volatility of sold prices is extreme.

Price Volatility

Posted in altos-research, real-estate | Leave a comment

Sarah Palin Email Word Cloud

After three years of legal wrangling, the diligent folks at Mother Jones released another set of Sarah Palin’s emails on Friday. There are plenty of subtleties to the story. Should a personal Yahoo! email account be used for government work? And why the frustrating digital / analog loop of printing emails to be scanned at the other end, like a fax machine?

For my own snickering, I spent a couple hours over the weekend downloading the email PDF’s, converting them to text, and then parsing out the choice “holy moly’s” and tender bits about Track in the army. Here is a word cloud of the former governor’s emails, via the amazing Wordle project.

Sarah Palin's Email Word Cloud

Posted in natural-language-processing, politics | Leave a comment

Case-Shiller April Forecasts

Another finger in the air, in the beginning of the month lull.

My forecasts for the March, 2011 Case-Shiller index levels were quite rushed. They were released quickly so I could publicly compare the forecasts with the CFE futures contracts about to expire. However, since the statistical models use active market data, there is no mathematical reason to wait on our forecasts until the end of the month. The April, 2011 index levels will be released on June 28th, but here are my forecasts given what the real estate markets were doing a few months ago:

City Confidence Forecast Predicted HPI
Minneapolis, MN +1 -10.52% 94.46
Phoenix, AZ +1 -2.85% 97.42
Las Vegas, NV +3 -1.56% 95.67
Atlanta, GA +2 -1.45% 96.93
Boston, MA 0 -1.32% 145.42
Los Angeles, CA -2 -1.22% 165.73
Seattle, WA +3 -0.46% 132.35
New York, NY -1 -0.21% 163.15
San Francisco, CA -3 -0.20% 129.56
Chicago, IL +2 -0.06% 110.50
San Diego, CA -3 +0.18% 154.16
Detroit, MI 0 +0.41% 67.34
Charlotte, NC 0 +0.50% 107.50
Miami, FL 0 +1.01% 138.66
Dallas, TX +1 +1.62% 114.72
Cleveland, OH +1 +2.12% 98.85
Denver, CO 0 +2.27% 123.29
Tampa, FL +1 +2.28% 129.98
Portland, OR +1 +4.71% 138.92
(The confidence score ranges from negative three for our weakest signals, up to positive three for strength. Unfortunately I am still sorting out a bug in our Washington, DC model.)

Posted in altos-research, forecasting, real-estate, trading | Leave a comment