I’m leaving for Las Vegas tonight for IBM Insight. It is one of the best chances all year to meet with the legacy SPSS Inc. folks at IBM, and learn about what is headed our way in 2015 in SPSS. Also, of course, there is always a lot of IBM news. The subtitle this year is “The Conference for Big Data and Analytics”. I’m not a fan of the phrase “Big Data” as everybody who uses the term uses it differently, but I am very interested in IBM BigInsight and what it might mean for those of us that use Modeler. There are sessions about SPSS Statistics 23. There is no release announced, but I would imagine that this is an early sign that a new version is coming soon. There are no sessions for Modeler 17, so the lack of sessions might also be a hint as to where we are in the development cycle. Watson is going to have a real presence at the show.
I will be at the conference bookstore at noon on Monday to sign copies of the IBM SPSS Modeler Cookbook.
Kevin Spacey will be on hand. I love House of Cards, and I’m a fan, so that will be fun. So curious to know what the topic will be and if there are going to try to tie in his presentation into a conference theme. That might be a stretch, but he is such a pro that I’m sure it will be a good talk. No Doubt will be performing. I’m usually wiped out after a day of sessions – I go to something in almost every time slot – so I’m not sure that I will partake. They definitely try to put on a good show, though.
IBM has really formalized the process of watching from home. They call it InsightGO.
I will try to Tweet a few times a day.
PACKT has just posted an sampling of four recipes that I curated from the entire book. I think they are a fun sampling. Here I’ve written a little bit about my rationale for choosing the recipes that I did. Enjoy.
From Chapter Two, Data Preparation: Select I’ve chosen Using the Feature Selection node creatively to remove, or decapitate, perfect predictors, to illustrate this. It happens to be one of mine. It is not difficult, but it uses a key feature in an unexpected way.
From Chapter 6, Selecting and Building a Model, Next-best-offer for large data sets is our representative of the pushing the limits category. Most of the documentation of his subject uses a different approach that while workable on smaller data sets, is not scalable. We were fortunate to have Scott Mutchler contribute this recipe in addition to his fine work in Chapter 8.
From Chapter Seven, Modeling – Assessment, Evaluation, Deployment, and Monitoring, Correcting a confusion matrix for an imbalanced target variable by incorporating priors, by Dean Abbott, is a great example of the unexpected. The Balance Node is not the only way to deal with an out of balance target. Also from Chapter Seven, I’ve chosen Combining generated filters. This short recipe definitely invokes that reaction of “I didn’t know you could do that!” It was provided by Tom Khabaza.
Depending on how you count, it has been a two year process. It was that long ago that I asked Tom Khabaza if he would consider taking on the challenge of an Introductory Guide to SPSS Modeler (aka Clementine). We had a number of spirited discussions via Skype and a flurry of email exchanges, but we were both so busy that we barely had time to have planning meetings much less write in between them. I don’t know if our busy consulting schedules made us the best candidates or the worst candidates for undertaking the first 3rd party book on the subject.
Some weeks after starting our quest, I got a LinkedIN message from an acquisitions editor at PACKT – a well established publisher of technical books in the UK. She wondered if I would consider writing a book about Modeler. I replied the same day. In fact, I replied in minutes because I had been working online. She had a very different idea for a book, however. She recommended a ‘Cookbook’. A large number of small problem solving ‘recipes’. Tom and I felt there was still a need for an Introductory book. (Look for it in Q1 of 2014). Nonetheless we were intrigued. Encouraged by the publisher we got to work again, but in a different direction. Believing, naively, that more authors made it easier, I recruited one more, then two more, and then, eventually, a third additional author. I can now tell you, that five authors does not make it easier. However, it does make it better. I am very proud of the results.
We cover a wide variety of topics, but all the recipes have a focus on the practical step by step application of ‘tricks’ or non-obvious solutions to common problems. From the Preface: “Business Understanding, while critical, is not conducive to a recipe based format. It is such an important topic, however, that it is covered in a prose appendix. Data Preparation receives the most attention with 4 chapters. Modeling is covered, in depth, in its own chapter. Since Evaluation and Deployment often use Modeler in combination with other tools, we include somewhat fewer recipes, but that does not diminish its importance. The final chapter, Modeler Scripting, is not named after CRISP-DM phase or task, but is included at the end because its recipes are the most advanced.”
Perhaps our book it a bit more philosophical than most analysis or coding books. Certainly, the recipes are 90% of the material, but we absolutely insisted on the Business Understanding section: “Business objectives are the origin of every data mining solution. This may seem obvious, for how can there be a solution without an objective? Yet this statement defines the field of data mining; everything we do in data mining is informed by, and oriented towards, an objective in the business or domain in which we are operating. For this reason, defining the business objectives for a data mining project is the key first step from which everything else follows.” Weighing in at 20 pages, it is a substantial addition to a substantial eight chapter book with dozens of recipes including multiple data sets, and accompanying Modeler streams.
I am also terribly proud of my coauthors. We have a kind of mutual admiration society going. I am pleased that they agreed to coauthor with me. They, I suspect, were glad that they didn’t have to play the administration role that I ended up with. In the end, we produced a project where each one of us learned a great deal from the others. Our final ‘coauthor’ was kind enough to write a Forward for us, Colin Shearer. “The first lines of code for Clementine were written on New Years Eve 1992, at my parents’ house, on a DEC Station 3100 I’d taken home for the holidays.”
Colin has been a part of the story of Modeler from the very beginning, so we were terribly pleased to have him support us in this effort. All 6 of us have run into each other repeatedly over the years. The worldwide Modeler community was a very small one 15 years ago when most of us were learning Modeler. (Tom has a bit of lead on the rest of us.) With IBM’s acquisition of SPSS Inc. some years ago, the community has rapidly grown. From the Forward: “The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art.”
The book is being released in November, just a few weeks away. More information on the book, including a prerelease purchase opportunity, can be found on the PACKT website.
More information on the authors can be found here:
Scott Mutchler and I are the managers of the Advanced Analytics Team at QueBIT.
Dean Abbott is President of Abbott Analytics.
Meta Brown blogs at MetaBrown.com
More information about Tom Khabaza can be found at Khabaza.com
After many years of trying to align my calendar and travel schedule I have finally made it. I am at kdd 2013 in Chicago.
As I have always feared, it is very academic in nature – lots of graduate student papers and the like. There is not a whole lot of focus on application here. Nonetheless I think it is important to monitor what our friends in the Computer Sciences are up to. So far I have been to a Big Data Camp and a workshop focusing on Healthcare. I have been constantly reminded of the vast gap between my clients – software end users – and the academic researchers. The distance between them is matched by the gap between the software users and their colleagues. Colleagues who don’t care terribly much about the software, but must understand the solution. I feel like a fragile bridge between these very different worlds. I won’t be able to justify coming every year, but I needed to experience this first hand.
My calendar has finally allowed me to attend the Ohio State Center for Public Health Practice's summer program. They had one full length weekend course. I just complete the first day of David Hosmer's Survival Analysis class. The class follows the content of his text (coauthored with Stanley Lemeshow and Susanne May).
The class is bit intense to be honest at more than 200 slides per day, and clocking in at almost 8 hours of content. There are breaks, of course, but class started at 8:30 and ended at just a few minutes to 5. Since most reading my blog would be in industry and coming off a full week be forewarned. Having issued the warning, however, I learned a great deal. I've taught chapter length treatments of this subject in SPSS Inc's old three day Advanced Stats class. That 90 minutes of material clearly had to leave plenty of detail out. Even at a full two days, Dr. Hosmer has to leave plenty of material out of the discussion. Some of the highlights of the experience included learning more about options in Stata and SAS, and when not to trust defaults – topics that just didn't fit my presentation on the subject.
I expect to post again when I've had a chance to reduce some of my lessons learned to writing. In the meantime make a note to check out the 2014 program! It is held around this time of year each year.
IBM has just released a new SPSS brand product. I have numerous friends in the SPSS community, and I have been a frequent beta tester, but I didn't know in advance about this release. It does resemble something that I saw demonstrated at last year's IOD. What to make of this product? It is web based, and looks pretty slick: Analytic Catalyst. There is also a video on YouTube. I like the visuals, and I agree that it looks easy to use. I'm anxious to try it, and might recommend it in certain client situations.
Never forgetting that the lion's share of a Data Mining project's labor is spent on Data Prep, and since I've never been on a project that didn't need Data Prep, I think that a tool like this is most useful after a successful Data Mining project is complete. For instance, I worked like crazy on a recent churn project, but after the project the marketing manager had to explore high churn segments to come up with intervention strategies. This could be used for that purpose. Or perhaps 'repurposed' for that since the video seems to indicate that it would be used in the early stages of a project.
My reaction, not a concern exactly, is the premise. It seems to assume that the problem is business users tapping directly into Big Data to explore it, searching for 'insight'. I don't think most organizations need more insight. I think they need more deployed solutions – solutions that have been validated that are inserted into the day to day running of the business. My two cents.
Statistical Hypothesis testing does an OK job at avoiding proving the presence of effects, but it does a mediocre job (or worse) at disproving them. There are a lot of reasons for this, poor training among them, but it is largely systemic. I spent my Thanksgiving morning watching the “Vanishing of the Bees,” and my mind kept drifting to thoughts of Type II error. I know. I can grasp the obvious … maybe I need a break.
I don’t have any biological expertise in evaluating, in detail, the research on either side of the fascinating Colony Collapse Disorder debate, but I am always suspicious of negative findings of any kind unless I can read the research. In the case of this documentary, they claim (a claim that is perhaps biased) that pesticides were determined to be safe after administering a fairly large dose to an adult bee, and determining that the adult bee did not die during the research period. Was that enough? I can’t speak to the biology/ecology research, but it got me thinking about Type II.
We know well the magnitude of the risk we face in committing Type I, and it is trained into us to the point of obsession. When meeting analysts wearing this obsession on their sleeve, reminding everyone who will listen, leveling their wrath on marketing researchers daring to use exploratory techniques, I am often tempted to ask about controlling for Type II. I am often underwhelmed with the reply. There are just so many things that can go wrong when you get a non-significant result. Although I wrote about something similar in my most recent post, I’m am compelled to reduce my thoughts to writing again:
1) The effect can be too small for the sample size. Ironically, the problem is usually the opposite. Often researchers don’t have enough data even thought the effect is reasonably big. In this case, I was persuaded by the documentary’s argument that bee “birth defects” would be a serious effect. Maybe short term adult death was not subtle enough. More subtle would require more data.
2) The effect can be delayed. My own works doesn’t involve bees, but what about the effect of marketing? Do we always know when a promotion will kick in? Are we still experiencing the effects of last quarter’s campaign? Does that cloud our ability to measure the current campaign? Might the effects overlap?
3) The effect could be hidden in an untested interaction (AKA your model is too simple). The bee documentary proposed an easy to grasp hypothesis – that the pesticide accumulates over time in the adult bee. Maybe a proximity * time interaction? We may never know, but was the sample size sufficient to test for interactions, or was Power Analysis done assuming only main effects. Since they were studying bee autopsies the sample size was probably small. I don’t know the going rate for a bee autopsy, but they are probably a bit expensive since the expertise would seem rare.
4) Or its hidden in a tested interaction (AKA your model is too complex). I had a traumatic experience years ago when a friend asked me what “negative degrees of freedom” were. Since she was not able to produce a satisfactory answer to a query regarding her hypothesized interactions, her dissertation committee required here to “do all of them”. Enough said. It was horrible.
5) The effect might simply be, and what could be more obvious, not hypothesized. This, we might agree, is the real issue regarding the adult bee death hypothesis. It may not have been the real problem at all.
Statistics doesn’t help you find answers. Not really. It only helps you prove a hypothesis. When you are lucky, you might be able to disprove one. Often, we have to simply “fail to prove”. In any case, I recommend the documentary. Now that I’ve been able to vent a bit about Type II, I should watch it again and focus more of my attention on the bees.
When you get a statistical result, one too often immediately jumps to the conclusion that the finding “is statistically significant” or “is not statistically significant.” While that is literally true since we use those words to describe below .05 and above .05, it does not imply that there are only two conclusions to draw about our finding. Have we ruled out the possible ways that our statistical result might be tricking us?
Things to think about if it is below .05
Real: You might have a Real Finding on you hands. Congrats. Consider the other possibilities first, but then start thinking about who needs to know about your finding.
Small Effect: Your finding is Real, but is of no practical consequence. Did you definitively prove a result with an effect so small that there is no real world application of what you have found? Did you prove that a drug lowers cholesterol at the .001 level, but the drug only lowers it at a level so small that no Doctor or patient will care? Is your finding of a large enough magnitude to prompt action or to get attention?
Poor Sample: Your data does not represent of population. There is nothing you can do at this point. Are you sure you have a good sample? Did you start with a ‘Sampling Frame’ that accurately reflects the population? What was your response rate on this particular variable? Would the finding hold up if you had more complete data? Have you checked to see if the respondent and non-respondent status on this ‘significant’ variable is correlated with any other variable you have? Maybe you have a census, or you are Data Mining – are you sure you should be focused on p values?
Rare Event: You have encountered that 5% thing. It going to happen. The good news is we know how often it is going to happen. If you are like everyone else, you probably are operating at 95% confidence, and then each test, by definition, has a 5% chance of coming in below .05 from random forces alone. So you have a dozen findings – which ones are real? Was choosing 95% Confidence a deliberate and thoughtful decision? Have you ensured that Type I error will be rare? If you have a modest sample size did you chose a level of confidence that gave you enough Statistical Power (see below)? If you are doing lots of tests (perhaps Multiple Comparisons) did you take this into account or did you use 95% confidence out of habit?
Too Liberal: You have violated an assumption which has made your result Liberal. Your p value only appears to be below .05. For instance, did you use the usual Pearson Chi-Sq when Continuity Correction would have been better? Maybe Pearson was .045, Likelihood Ratio was .049, Continuity Correction was .051. Did you chose wisely? Did you use Independent Samples T-Test when a non-parametric would have been better? Having good Stats books around can help, because they will often tell you that a particular assumption violation tends to produce Liberal results. You could always consider a Monte Carlo simulation or Exact Test, and make this problem go away. (An interesting ponderable is to ask if we are within a generation of abandoning distributional assumptions as ordinarily outfitted computers get more powerful?)
Things to think about if it is above .05
Negative Finding: You might have disproven your hypothesis. (I know that you have ‘proven’ your ‘Null Hypothesis’, but does anyone talk that way outside of a classroom?) Congrats might be in order. Consider the other possibilities and then start thinking about who needs to know about your negative finding. If it is the real thing, a negative finding could be a valuable. Be careful however before you shout that the literature was wrong. Make sure it is a bona fide finding.
Power: You may simply have lacked enough data. Did you do a Power Analysis before you began? Was your sample size commensurate with your number of Independent Variables? Did you begin with a reasonable amount of data, but attempted every interaction term under the sun? Did you thoughtlessly include effects like 5 way interactions without measuring the impact that it had on your ability to detect true effects? If you aren’t sure what a Power Analysis is, it is best that you describe your negative results using phrases like: “We failed to prove X”, not “We were able to prove that the claim of X, believed to be true for years, was disproved by our study (N=17)”. You can also Google Jacob Cohen’s wonderful “Things I have Learned (So Far)” to learn more about Power Analysis. I mention is in my Resources section, and it has influenced my thinking for years. Its influence is certainly present in this post.
Poor Sample: Your data is not representative of the population. This one can get your p value to move, incorrectly, in either direction.
Too Conservative: You have violated an assumption which has made your result Conservative. Your p value only appears to be above .05. Did you use an adjusted test in an instance when no adjustment was needed? Did you use Scheffe for Multiple Comparisons, but aren’t quite sure how to justify your choice? Most assumptions make our tests lean Liberal, coming in too low, but the opposite can occur.
This list has served me well for a long time. Always best to report your findings thoughtfully. Statistics, at first, seems like a system of Rule Following. It is more subtle than that. It is about extracting meaning, and then persuading an audience with data. Without an audience, there would be no point. They deserve to know how certain (or uncertain) we are.
I will be speaking in Kuala Lumpur, Malaysia next week on the subject of Data Mining. I will be discussing Data Mining, in general, and then participants will get a chance to try it using the resources providing by the excellent tool neutral Elder, Miner, Nisbit book. I believe the event is at capacity, but there are already tentative plans to try this format again in January, 2012, also to be held in Kuala Lumpur, Malaysia. The event organizer stays in charge of the details, but if you are interested in finding out more about the January four day event please email me.
I am an independent Statistics and Data Mining trainer. I keep pretty busy teaching people how to use related software. I also consult when I can because I love using real data and producing results that will be put to immediate use. Read More
- Off to IBM Insight 2014
- A sampling from the new Cookbook
- Announcing the IBM SPSS Modeler Cookbook
- KDD 2013
- Ohio State Summer Epi Classes
- My take on IBM’s new SPSS Analytic Catalyst
- New Role at QueBIT Consulting
- Reflections on Statistical Non-Significance
- Your Statistical Result is Below .05: Now What?
- Seminar Series in Kuala Lumpur, Malaysia