New Sentiment Dataset

The good folks in Stanford’s Natural Language Processing Group have built a powerful new dataset for a paper being presented at the EMNLP conference in Seattle next month. The underlying foundation of the dataset is not particularly exciting, being yet another corpus of labeled movie reviews: The review sentence “Stealing Harvard doesn’t care about cleverness, wit or any other kind of intelligent humor” is provided along with its negative sentiment label, for example. What is more interesting is the corpus providing sentiment labels at every level of composition. So for the same sentence, the dataset also provides a distinct sentiment label for the sub-phrase “any other kind of intelligent humor” which is actually positive. Hence the dataset is a treebank, not just your typical corpus. A lot of Mechanical Turk wrangling went into this! This compositional and recursive labeling is a great resource for training contextual models, especially ones that go beyond the bag-of-words legacy.

Here at Trending we are experimenting with an online, regularized, high-dimensional linear approximation to the Stanford paper’s tensor RNN model, one that lets us use the whole Vowpal Wabbit stack. Next month they plan to release some (Matlab) code to parse the treebank, but have already released the data itself. Therefore I put together a simple Ruby module to parse the treebank, for your own statistical NLP, sentiment and machine learning projects. It includes a bit of Graphviz logic to render phrase trees and their sentiment as SVG:

The module is hosted on Github at “http://github.com/someben/treebank/” under a nice Free license.