Keith’s Stats Book Recommendations
Quick Links: Statistics | Data Mining | Other
Multivariate Statistical Analysis: A Conceptual Introduction
by Sam “Kash” Kachigan
As a statistics software trainer, most folks that I meet are not looking for a book that explains how to use a method by hand, without the computer. Unfortunately almost all the books you will find on Multivariate Stats will be very large tomes bloated with formulas and exercises. This book in contrast is: 1) an easy read; 2) focusing on the concepts behind the technique; 3) is reasonably priced; and 4) is not a door stopper. Check out the table of contents. You will probably find the technique you need. Factor Analysis, Cluster Analysis, etc. are all represented. If you are an SPSS user, as I am, you won’t find any pictures of SPSS, or step by step instructions. You will find only the theory behind the techniques, but there is more than enough great content here to warrant adding it to your library. Not only that, most step by step guides skimp on the theory, so it is best that you get one of each. Check out Norusis for good books for “step by step”. She also does a good job at the theory, but Kachigan gets the highest grade for explaining theory.
SPSS Guide to Data Analysis
by Marija Norusis
When I am approached by a brand new SPSS user – often a college student or new hire – I recommend this book. It starts with the very basics of the software as well as basic statistics. I don’t see the harm in using it to learn SPSS even if you have mastered the theory because if you were to click on every instruction with the mouse, you would learn a lot. If you don’t know the theory, it is even more helpful. Median and mean, exploring data, and basic inferential statistics are all covered. Not comprehensive treatment of every subject for sure, but perhaps the very best place to start for the novice. As an added bonus, it comes with data so that you can really explore the exercises. This is a better choice that buying a user’s guide if you don’t have that already. It is also a better choice than here Statistical Procedures Companion, if you are new to SPSS.
Note on version: I reviewed 14.0, SPSS software is in version 18.0. Don’t accidentally buy a dusty old copy of an out of date version, even at a discount. It’s worth it to buy the new one. And certainly don’t pay full price for an old version!
SPSS Advanced Statistical Procedures Companion
by Marija Norusis
I make a living as a consultant helping people understand their SPSS results, among other things. I have always been a fan of this author's books, and I am glad I own this one. See in particular SPSS 15.0 Statistical Procedures Companion. The table of consents for this book can be found at http://www.norusis.com/book_ASPC_v15.php. The critical detail is that the word "Advanced" refers to a module in SPSS that performs particular tasks - the same tasks listed in the Table of Contents. Some tasks in the book require the "Regression" module.
Why should you be careful about this book? Simply, it is not a novice book. Some might take it to be the right choice for those who already have grounding in basic statistics. Not so. One of the other books would be a better choice. Or if you are interested in a review of the basics leading up to intermediate level, the Field book Discovering Statistics Using SPSS (Introducing Statistical Methods S.) (2nd Edition) After all, Multidimensional Scaling and PLUM are not everyday tasks even if you have mastered statistics thoroughly.
The strongest aspect of the book is that the topics covered here are covered nowhere else! They are more thorough than the help files, but not as much more thoroughly as you might hope. Complete, but without handholding. You have to be ready. It also comes with a CD. Some techniques are brand new, and covered only in this edition - frankly, the reason I felt I had to own this book. The 32 pages on GzLM being is a noteworthy example.
The weakest aspect of the book (although perhaps necessary) is that these examples can only serve to supplement a lot of prior knowledge, and an existing statistics library. Those with that will benefit, others need to supplement this book with others mentioned and maybe more. (I, among others, have assembled a listmania list of good SPSS basic books). Having said all that, some chapters are really quite extensive - note 80 pages on MDS by Young and Harris of UNC.
Bottom line: If you are a power user that owns the stats modules, and needs help setting up the menus, and interpreting menus you really need this book. Anyone fitting this description probably has access to several (dozens??) of stats books, and might need them nearby to look up terms, to get more detail, etc. Perhaps one could get one of the 81 citations? A good bibliography like this one can be a great start.
Regression: A Primer
by Paul Allison
This starts with the very basics. What is correlation? What is regression? By the time it gets to multiple regression it is nearly over. ALL to its credit. It is eminently readable. It is non-technical and clear. It doesn’t have any SPSS step by step, but that is not the point of the book. Too basic for some, it is perfect for the novice. I benefited mostly from inspiration on how best to explain regression to others. Brief and relatively inexpensive, probably worth having on hand even if you are not a novice.
SPSS Survival Manual, Julie Pallant, 2nd Edition, 318 pages, 2005.
This book is often mentioned to me as a great introduction to SPSS. I have looked at it briefly when folks have brought it to class. Then I will get asked to endorse it to other members of the class. At first glance, it was hard to look past the copyright. The first edition – which is still in circulation – is quite old, and even the second edition is for version 12.0. Version 16.0 is around the corner. I was quite pleased when I saw the "visual bander", which is a fantastic feature that we got in 12.0.
I recently bought a copy to review. I was pleasantly surprised. Ms. Pallant has managed to keep the essentials to such a degree that the version change was barely noticeable. When I introduce someone to SPSS, they are learning it for work, and I have the better part of a week to teach them. Under those circumstances, due diligence requires that I teach all of the "latest and greatest".
This book knows its audience – undergraduates who are trying to survive a SPSS lab with a small chance of ever seeing Statistics again. This is not to slight the book in anyway. The title and description make this explicit. As a result, the author wisely spends a number of chapters on setting up SPSS on the assumption that one will be typing from a survey in a university setting. This is done carefully and I am sure that the author wins the loyalty of her readers in these pages. Then, there are about 50 pages on descriptive statistics, and 150 pages on multivariate. The emphasis is clearly on the basics: the fundamentals often get more pages then advanced techniques. This is as it should be. Also, do not use this book to explain all the choices, and all the output in each technique. If you need this – and some do – buy Andy Field’s Discovering Statistics Using SPSS. Field’s book is also the better choice if this is the first of several statistics courses for you.
If you have a decent textbook, but need help better understanding it, and you need help in the lab, this is a good choice.
Discovering Statistics Using SPSS (Introducing Statistical Methods S.) (2nd Edition) by Andy Field
Best choice for the novice that is going to be studying Stats for awhile.
This book comes up in conversation a lot. It is outstanding. I have come to the conclusion that if a serious user of SPSS's statistical features is to get only one reference; this is it. Something I have noticed is that when I meet someone that has spent time with the book, the are invariably quite good at SPSS. Even if they may not have mastered all the techniques in this large book, they know their stuff.
Pallant's SPSS Survival manual, which I have also reviewed here, is designed to help survive a first course (and presumably last) in basic statistics. The Field book, however, could be revisited again and again, each time reaching a deeper understanding.
I already know the statistics in this book well, so I can't claim that this book has taught me the basics, but it simultaneously covers all the major topics of interest while keeping it as simple as possible. I wish I had existed earlier in my career. The main advantage to users of SPSS is that all of the examples are SPSS examples. However, make no mistake, this is a serious introduction to statistics, not merely a point and click guide. It is not current with version 15.0, but I don't think this is a major strike against it, given the excellent review of theory. If, however, you really need to keep up on the current features like I do, you will want to consider books in addition this one. Consider one or more of the three Norusis books depending on your level and needs.
Data Mining
Data Preparation for Data Mining (The Morgan Kaufmann Series in Data Management Systems)
by Dorian Pyle
I have been helping folks learn Clementine - a data mining package - for several years. I have read a number of related books, but never got to this one until recently. That was a mistake. This may be an important book for you if you are new to Data Mining, even if, especially if, you already have expertise in statistics and/or data base technology.
Although I still believe if someone is brand new to the field that they begin with Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, this should be the second book that they read. Far too many books in this area read like statistics books (notably Data Mining Methods and Models).
Statistics training can be of enormous benefit to data miners, but leads to certain predictable errors. Not only that, many data miners already have statistics training and that just compounds the likelihood that they will make these mistakes when the book author fails to show the difference clearly. Pyle performs consistently well in this regard. He consistently focuses on the kinds of problems data miners are likely to see in their work.
To give just a couple of examples: Few variables will be already stored as continuous, normally distributed variables; principle components analysis might sometimes be a problematic way to eliminate predictors and even be dangerous; missing versus "empty" data; constantly present non-linearity.
His practice data set has a real variety of variable types, and dozens of predictors. If you are figuring out if Data Mining can help you, start with the Berry/Linoff book. But .. if you are about to begin in earnest read this book. Then, time permitting; start reading specific books on modeling or software. For instance, another Larose book has good, detailed coverage of algorithms, and some information on Clementine. Discovering Knowledge in Data: An Introduction to Data Mining
Discovering Knowledge in Data
by Daniel T. Larose
I have softened my opinion of this book - seek out my more recent review on Amazon.
Here is the original review:
I have mixed feeling about this book! I don’t think I can recommend it, but there are some sections that I like. Here in a nut shell is what I don’t like about it: he strongly implies that the way to avoid over reaching with automated techniques is for the human to hand pick variables after a rigorous bivariate analysis. The author, Larose, would not put it that way in a single sentence, but his logic forces you there. First, data mining is “easy to do badly” and “there are no automatic data mining tools that will solve your problems mechanically ‘while you wait’”. Second, almost 40 pages are spent on data cleaning and prep., but only a couple of pages are spent on hold-out validation. Third, if you screw up, “the wrong analysis is worse than no analysis, since it leads to policy recommendations that will turn out to be expensive failures.” Conclusion: if a well prepared domain expert with good, clean data is at the reigns you have a good chance of a positive result if they will liberally apply their business experience to the model. Although, it seems impossible to disagree with this wisdom it can actually be dangerous if taken literally. Read carefully, Larose says many of the right things, but if you aren’t careful, you will get the impression that business experience and their exploration is driving the model.In contrast, I would suggest that the critical piece is the hold-out validation! A good validation should be able to show you that you developed theory/model works. At the risk of being too colorful, I don’t care if the insight comes to you in the shower, or is whispered to you by a tarot card reader, if you do a careful validation you are confirm that the model works. The trick is to use data that the machine has never seen – or better that you have never worked directly with either (it must be clean). Those, like Larose, that emphasize the human pre-processing of the data seem to neglect the very real fear that immersion in the data can prevent surprise if you don’t take the validation very seriously. It is the linchpin of data mining. What is surprise? The discovery of something that you didn’t expect, but that you can prove is real. Not merely a fluke, but a true, real discovery. If you aren’t careful, you could ‘clean’ your surprises right out of data if you drop variables that have a weak relationship to the dependent, or remove all of your outliers. In Larose’s defense, he explicitly says not to do this, but after dozens of pages on outliers and bivariate analyses.
What do I like about the book? One major piece, but I like it A LOT. He has the brilliant insight to use tiny, tiny data sets to walk through every step of potentially complicated algorithms like neural nets, and CART. These data sets have only a dozen or so cases, but it makes the walk through easy to understand. I don’t think is enough to warrant buying the book since it is thin and expensive, but I have reread those sections several times, and I say more about the details than I used to in lecture because I have become convinced that they are easier to explain than I thought, and provide real insight into the outcomes.
Overall, unless you book budget for this stuff is virtually unlimited, buy Berry and Linoff’s Data Mining Techniques. Note: I have since purchased two other books in this series, and this author's books has value in a large collection.
Cluster Analysis
by Brian S. Everitt
Let’s face it. Few SPSS users need a 200+ page on just cluster analysis. As a trainer, I am a happy owner of this book, but partly as a reference to look up rare questions. The method for choosing this book will help clarify this. I took terms that were poorly described in SPSS help, and then looked them up in the index of this book on Amazon. I found several, so I bought the book.
I was pleased with the result. It put cluster in a much broader context than SPSS classes or user’s guides do. It talks about techniques that SPSS can’t do. If obviously goes into greater detail including more than a few formulas, but it reads fairly well. I still don’t think that more than a handful of the folks I meet in class need this. Kachigan, reviewed elsewhere, and SPSS’s Marketing Segementation course or book would be more relevant to a wide audience.
Note that you won’t find any explicit references except for an appendix which lists stats software and the related cluster features. This part is quite out of date. There are no SPSS pictures or examples. Still, if you want the whole story, this is a fine choice.
Berry and Linoff, Data Mining Techniques, 2nd edition, 2004, Wiley
This the best single volume on Data Mining you can buy. The table of contents, of course, says a lot. As one who mostly teaches methodologies I was quick to notice that all the major topics are here: neural nets, market basket, cluster, and trees. But there are also techniques that SPSS and Clementine can not do like “link analysis”. Good background to know. Also, unlike Larose (see review), the data preparation reads like preparing data for data mining, not a carbon copy of preparing data for statistics. I have pretty much concluded that a data mining book that does not make clear that data mining and OLAP are not the same is not a great book. This book has an extended section on just that. It is highly readable and comprehensive.
What No One Ever Tells You About Blogging and Podcasting: Real-Life Advice from 101 People Who Successfully Leverage the Power of the Blogosphere
This book will take most readers about 101 minutes to read. It is designed in 101 very short chapters of about a page or two each. In fact each chapter it much like a Blog post. It is a series of interviews with commentary based on the author's discussions with 101 blog writers/experts. I got several ideas from it. It sent me off to the computer to find some of these bloggers. It inspired me to make a post or two. A couple hours reading following by a couple of hours surfing made the purchase well worth it. The content on podcasting is a handful of these short chapters. It is well worth a look, but it is not likely something one would read a second time.
Data Mining with Decision Trees: Theory and Applications
Specialists should consider it, practitioners should look elsewhere
I am always on the look-out for books to recommend to my data mining clients. I was anxious to read this book because as a data miner (and trainer), I need to know the details. I will recommend this to one or two colleagues, but it will not be something I will recommend to clients. My interest is as an almost daily user of techniques like this, but as one that does not write the software, nor the algorithms.
The first thing you notice about this book is its very academic style. It has numbered paragraphs like 2.0, and 6.1.8, and 7.3.1.12, etc. According to its preface, it been used a graduate text, presumably for mathematicians computer scientists. I think it would be good for that purpose. I think it could work quite well for statisticians that are interested in the details of data mining algorithms. It is Vol. 69 of a series in Machine Perception and Artificial Intelligence. Other titles include “Fundamentals of Robotics”, and “Bridging the Gap Between Graph Edit Distance and Kernel Machines”. I site these titles so that you don’t confuse this book with something like Data Mining Techniques, which is written for a general audience. This is not. It opens the 2nd chapter with (condensed): “A training set is a bag instance of a bag schema. A bag instance is a collection of tuples that may contain duplicates.” The folks that I work with can instantly divide themselves into those that would consider a book like this, and those that wouldn’t. Note that the first chapter (which you may be able to see online), is of a much lighter style as one would expect with an introduction. It cites references in almost every sentence, which can be distracting to the casual reader, and eventually convinced me that I need to read the originals authors like Breiman.
So having issues a warning to the unprepared, there is plenty to like. The authors have made a real attempt to cover everything - all the methods of building and evaluating trees. I found 1/3 that I knew, 1/3 that will be quite useful, and 1/3 that is too much detail for me. Chapter 3 “Evaluation of Classification Trees” will be great for statisticians that wondered how to judge the efficacy of a tree that was build without hypothesis testing. Also, I was very pleased to see a chapter on Decision Forest, which is a discussion of “ensemble methods” in Decision Trees - in other words combining a set of tree models. Some practitioners might not know as much about this option as it is newer, and has not historically been implemented in as many data mining software packages.
I was hoping for something that would have a detailed chapter on each of the most common decision trees algorithms with briefer sections on the obscure ones. It has all this information, but in a way that I have to work pretty hard to get to it. If you want a quick overview of data mining (even if you think that trees are the method you are going to use), try Data Mining Techniques. If you want to know the details, but are content to learn the details only on the well know techniques (like CHAID and CART) then Larose is a good choice.
Six Sigma for Dummies
No substitute for training, but will help you decide if Six Sigma is for you
I started reading this book because I was assigned to a Data Mining project in an organization where Six Sigma is popular. I recommend this book to anyone that just wants a simple answer to the question: “What is Six Sigma?”. Or if you are new to data analysis, in general, you could read this as a warm up before attending training, but don’t expect one book (or at least this book) to teach everything about statistics and six sigma. I liked the explanation of the history and what ‘black belt’ learn and do.
The style is a little too informal for my tastes. It is basically 300 pages of bullet statements, but that is in keeping the idea behind the series. Also, although a minor complaint, it is littered with adjectives like ‘astounding results’, ‘unwavering focus’, and ‘so expertly skilled’.
To give you a quick sense of the similarities between six sigma and data mining, I will simply list the DMAIC steps discussed in detail in the book. Define, Measure, Analyze, Improve, and Control. These steps are even on the cover of the book. In Data Mining a popular process is CRISP-DM (www.crisp-dm.org). Its steps are Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It doesn’t make much of a leap to see the connection, and upon hours of reflection the similarities are still there. The differences are real, but harder to summarize.
Most data miners could make a real dent in this book on one or two long airplane flight, so given the small investment in time and money, I recommend it highly. If you are new to statistics but need them for six sigma, there are parts that might seem a tough slog, but that is what everyone (including the authors) seem to say about statistics.

