*Data Smart*, is pretty much done. Phew! So that took about 8 full months to write. It's got 10 very thorough chapters on aspects of data science -- if you ever wondered how a Big M constraint in optimization is like a dead squirrel or how Megaman X is related to naive Bayes then this book is for you. The book will come out October 28, so preorder now. And

**if you'd like a sample chapter**, just sign up for my newsletter, and I'll send you one shortly.

So! Everything is turned in, edited, spreadsheets are checked, R code is checked. I've even assembled a playlist to go along with the middle school dance floor clustering tutorial.

*Data Smart*?

**1.**At every step along the way in an algorithm, I could show exactly what was happening to the data. You get to see every state of your data. It's like seeing hotdogs being assembled from beef, to slurry, to wiener. An example:

**2.**Some algorithms just feel natural using the "drag the formula down to fill the cell" approach that you have in Excel. It's like an artisanal apply() function. ;-) For example, when looking at the error correction formulas for Holt-Winters, you can do a single time period, and then a second one, and then drag everything down. It feels a bit like induction.

**3.**Spreadsheets are great for teaching predictive modeling/forecasting, data mining/graphing,

*and*

**optimization modeling**. While many of the techniques are opaque in R when you use packages, if you do them by hand in R, they're actually pretty clear. Except for optimization. If you want to teach other modeling techniques

*plus*optimization in R then you're kinda screwed, because all the optimization hooks in R just take a full-on constraint matrix and a right hand side vector. Contrast this with Excel Solver where you get to build constraints individually. It's totally better for teaching. Now, that said, Python has some nice hooks into optimization modeling that would be similar to Excel. Since spreadsheets are so nice for viewing data, then prepping data, objective functions, and constraints, and then optimizing, it means that algorithms such as modularity maximization using branch and bound plus divisive clustering can be taught there, and it's actually easier to see than it would be in nearly any other environment. Plus, if you're careful you can actually cluster data better than even Gephi's native Louvain method implementation can. Bam!

**4.**Quite simply, I didn't need to teach any code in the book. Yes, in two places I have the reader record a macro of some clicks and then press the macro shortcut key a couple times, but that's it. And actually watching this loop run using keypresses is in itself a valuable lesson for those who don't intuitively get how something like a monte carlo simulation works.

So there are a few things I really enjoyed about using spreadsheets to teach data science. Where did the spreadsheets fail?

**1.**Visualization. Visualization in Excel is nice when there's native support for the particular type of graph you want. But if you want a fan chart or a correlogram with critical values marked, then things get slightly annoying. You can often graph what you need by doing formatting cart wheels. Grrrrr.

**2.**Spreadsheets are ugly for true matrix math. The beauty of something like R versus Excel becomes most apparent not when performing boosting or bagging or clustering or any of these more complex things. The place where it's most glaring is in taking a t test in a multiple regression by hand. Why? Because you have to do matrix inversions on large portions of data in order to get the standard error of the regression coefficients. And that's just unattractive in Excel. Sure, Excel does it for you using the LINEST function, but I wanted to teach t tests from the ground up. R would have been better there.

**3.**Spreadsheets are occasionally slow. While Solver is awesome for teaching, its simplex and evolutionary algo implementations aren't going to blind anyone with speed. That's why in the book I recommend using OpenSolver plugged into Excel any time the reader can.

Anyway, I think that on balance the book is extremely powerful as a teaching tool, especially for a particular type of student...a student like me. Someone who has a deep seated fear of script-kiddie-ness. Someone who needs to teach and see the data in order to believe. I am the Doubting Thomas of data scientists, but once I do work through a problem piece by piece, then I'm able to internalize a confidence in the technique. I know when and how to use it. Then and only then am I happy to stand on the shoulders of R packages and get work done.