Things have been quiet around here since the winter because I have been focusing my modest writing and research skills on a new book for O’Reilly. We signed the contract a few days ago, so now I get to embrace a draconian authorship schedule over the next year. The book is titled Sequential Machine Learning and will focus on data mining techniques that train on petabytes of data. That is, far more training data that can fit in the memory of your entire Hadoop cluster.
Sequential machine learning algorithms do this by guaranteeing constant memory footprint and processing requirements. This ends up being an eyes-wide-open compromise in accuracy and non-linearity for serious scalability. Another perk is trivial out-of-sample model validation. The Vowpal Wabbit project by John Langford will be the book’s featured technology, and John Langford has graciously offered to help out with a foreword. Therefore the book will also serve as a detailed tutorial on using Vowpal Wabbit, for those of us who are more coder or hacker than statistician or academic.
The academic literature often uses the term “online learning” for this approach, but I find that term way too confusing given what startups like Coursera and Khan Academy are doing. (Note the terminology at the end of Hilary Mason’s excellent Bacon talk back in April.) So, resident O’Reilly geekery-evangelist Mike Loukides and I are going to do a bit of trailblazing with the more descriptive term “sequential.” Bear with us.
From the most basic principles, I will build up the foundation for statistical modeling. Several chapters assume statistical natural language processing as a typical use case, so sentiment analysis experts and Twitter miners should also have fun. My readers should know computers but need not be mathematicians. Although I have insisted that the O’Reilly toolstack support a few of my old-school LaTeX formulas…