An old review reflecting common Data Mining mistakes.

I am updating my book reviews, and I have decided to review more books briefly, and direct everyone to Amazon for more detail. I didn't want to remove this material entirely, though. My Amazon review doesn't cover the flaws of this book in as much detail.

Let former statisticians that become Data Miners beware. The advantages of stats training for data miners are too numerous to mention here, but there are also dangers!

Here is the original review of Larose's first book: 

I have mixed feeling about this book!  I don’t think I can recommend it, but there are some sections that I like.  Here in a nut shell is what I don’t like about it: he strongly implies that the way to avoid over reaching with automated techniques is for the human to hand pick variables after a rigorous bivariate analysis.  The author, Larose, would not put it that way in a single sentence, but his logic forces you there. First, data mining is “easy to do badly” and “there are no automatic data mining tools that will solve your problems mechanically ‘while you wait’”. Second, almost 40 pages are spent on data cleaning and prep., but only a couple of pages are spent on hold-out validation. Third, if you screw up, “the wrong analysis is worse than no analysis, since it leads to policy recommendations that will turn out to be expensive failures.” Conclusion: if a well prepared domain expert with good, clean data is at the reigns you have a good chance of a positive result if they will liberally apply their business experience to the model. Although, it seems impossible to disagree with this wisdom it can actually be dangerous if taken literally.  Read carefully, Larose says many of the right things, but if you aren’t careful, you will get the impression that business experience and their exploration is driving the model.In contrast, I would suggest that the critical piece is the hold-out validation!  A good validation should be able to show you that you developed theory/model works.  At the risk of being too colorful, I don’t care if the insight comes to you in the shower, or is whispered to you by a tarot card reader, if you do a careful validation you are confirm that the model works. The trick is to use data that the machine has never seen – or better that you have never worked directly with either (it must be clean).  Those, like Larose, that emphasize the human pre-processing of the data seem to neglect the very real fear that immersion in the data can prevent surprise if you don’t take the validation very seriously. It is the linchpin of data mining. What is surprise?  The discovery of something that you didn’t expect, but that you can prove is real.  Not merely a fluke, but a true, real discovery.  If you aren’t careful, you could ‘clean’ your surprises right out of data if you drop variables that have a weak relationship to the dependent, or remove all of your outliers.  In Larose’s defense, he explicitly says not to do this, but after dozens of pages on outliers and bivariate analyses.

What do I like about the book?  One major piece, but I like it A LOT.  He has the brilliant insight to use tiny, tiny data sets to walk through every step of potentially complicated algorithms like neural nets, and CART.  These data sets have only a dozen or so cases, but it makes the walk through easy to understand.  I don’t think is enough to warrant buying the book since it is thin and expensive, but I have reread those sections several times, and I say more about the details than I used to in lecture because I have become convinced that they are easier to explain than I thought, and provide real insight into the outcomes.

Overall, unless you book budget for this stuff is virtually unlimited, buy Berry and Linoff’s Data Mining Techniques. Note: I have since purchased two other books in this series, and this author's books has value in a large collection.

One comment on “An old review reflecting common Data Mining mistakes.

  1. I would counter the argument of:

    “the wrong analysis is worse than no analysis, since it leads to policy recommendations that will turn out to be expensive failures.”

    with

    “Errors using inadequate data are much less than those using no data at all.”
    Charles Babbage

    I really like the book “How to Measure Anything” by Hubbard. In that book, Hubbard talks about the value of reducing the entropy of information. Essentially, analytics is not about having 100% clarity. In the “real world”, making a BETTER decision is the goal. Especially, when there are millions of decisions to be made (micro-level) decisions (e.g. highly automated). Propensity models are a great example. If you have to determine the likelyhood of fraud accross millions or billions of transactions, I’ll take a BETTER decision over NO decision anyday.




Leave a Reply

Your email address will not be published. Required fields are marked *

*

* Copy This Password *

* Type Or Paste Password Here *

395 Spam Comments Blocked so far by Spam Free Wordpress

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>