



Data Science for Business: What You Need to Know about Data Mining and Data-Analytic Thinking
T**D
READ THIS BOOK!
Data Science for Business by Foster Provost and Tom Fawcett is a very important book about data mining and data analytic thinking. In 1971, Abbie Hoffman shocked the world when he demanded hippie readers (at the time, a likely oxymoron) "Steal This Book". While I wouldn't go so far as to encourage current and future data scientists to shoplift, I will demand that they READ THIS BOOK!Not long ago, data was difficult and expensive to come by. Today, we're living in a world of far too much data, vast amounts of cheap computing power, and way too many poorly defined questions. Mix them all together and you're guaranteed to make a mess.Going from data dearth to plethora presents substantive issues. In business, the balance between gut feel decision-making and analysis paralysis is changing, rapidly. Whether it moves too far from gut to paralysis, only time will tell. Through Data Science for Business, Provost and Fawcett offer practitioners a guide to equilibrium.Read this book and you'll find yourself moving briskly down the road towards data analytic enlightenment. While not highly technical, the authors covers each topic with enough rigor to appreciate the tools being presented and the insights being offered.From the outset, the authors are clear about the book's objectives: "The primary goals of this book are to help you view business problems from a data perspective and understand principles of extracting useful knowledge from data. There is fundamental structure to data-analytic thinking, and basic principals that should be understood. There are also particular areas where intuition, creativity, common sense, and domain knowledge must be brought to bear… As you get better at data-analytic thinking you will develop intuition as to how and where to apply creativity and domain knowledge."This paragraph makes me think of all those undergrad and graduate students studying Statistics at Universities all over the world, my daughter included, who are being bombarded by one math or statistics class after another (Calculus III, Math Stat I and II, Linear Algebra, etc.). Yet, far too often, they enter the real world lacking "data analytic thinking" or a sense of "basic principals" They do, however, have a sense of being overwhelmed and under prepared. The epic battle between "frequentists" and "Bayesians", takes a back seat to what should be the real controversy in statistics departments around the world, the balance between "application" and "theory". The book's "primary goals" should be the walking orders of every statistics program at any college or university anywhere.From the outset (page 2), the authors state, "Data mining is a craft. It involves the application of a substantial amount of science and technology, but the proper application still involves art as well." Absolutely true! It's great to read this stuff! This is followed by a concise discussion of CRISP-DM, a well-defined data mining process, whose concepts are elementary, essential, and integral to the responsible, proper, and successful practice of data mining.From this point on, the authors proceed to accomplish their primary goals. They present such topics as predictive modeling, correlation, classification, clustering, regression, logistic regression, linear discriminants, and much more. Their presentations are user friendly, their real world examples are interesting, and their guidance and insights are extremely valuable.My criticisms are limited to their website. The Data Science for Business site leaves me wanting more real world examples to enjoy, access to more resources and tools of the trade, more references to peruse, and a more rigorous approach to some of the solutions. Perhaps Data Science for Business the sequel is on the horizon?Whether you're a seasoned statistician (or, data scientist), a young aspiring novice, or an adventurous business person looking to expand his/her horizons, Data Science for Business by Foster Provost and Tom Fawcett is well worth the price of admission and the reading time you'll invest.Foster Provost and Tom Fawcett state, "[i]deally, we envision a book that any data scientist would give to his collaborators…" I'll do them one better, I'm giving it to my daughter!
G**N
The profit curve is an excellent centerpiece. The slim book is necessary and important, but nowhere near sufficient.
It's an excellent, even mandatory book for your Data Science shelf. I am glad I bought it. I am 67% of the way through reading this book. It has nowhere near enough material on some areas, though, and is just missing some material that you need for DS. That's actually OK because of course no single book is enough to cover everything you need to know in a field. Look how many books you may have bought just to get an undergrad degree, and I bet it was not just one book.So here is a list of good and bad about this excellent book.Its good points:The profit curve. After reading this book, I will never use Accuracy to select a model any more, as that's nearly a worthless metric especially when there are marginal costs and marginal profits involved in an application scenario. The book is just amazingly good on describing how to select models based on estimated profit, and foremost the profit curve, and selected other supporting curves like ROC area under curve.The expected profit computation and the cost-benefit matrix as a partner to the confusion matrix. This is great stuff. It's not even described in other data science courses that I have taken.Other good points: ...And don't worry about the other good points (there are some). The profit curve analysis, and the lead-up to that, are superior.Its bad points:p.224: "We will train on the complete dataset and then test on the same dataset we trained on." What follows next the rest of the chapter is just an inappropriate error analysis, because it is overly optimistic (but otherwise the techniques are great.) The models have seen the training data. We should never completely assess (test) -- and base the entire remainder of the chapter material -- on error (accuracy) estimates produced from data that the models have already seen.In most chapters, there is just not enough detail in the material, to enable this book to be used as a "correct reference" basis against which to write your own working code as you follow along with the text in whatever computer language you want to use for analysis.In summary:The book is outstanding. It is necessary for your DS bookshelf, but on the other hand it is nowhere near sufficient.The data science course sequence by Johns Hopkins University identifies many of the elements of a nice overall outline as to what DS practitioners need to be able to do (and this is not even sufficient either):Reproducible research; Experimental design; R programming (or python, or perhaps SAS or Octave, but some mathy language for sure); Exploratory data analysis; Regression models; Statistical inference; Practical machine learning; Scientific writing; Developing data products; Big data techniques (e.g. Apache Spark programming or at least MapReduce-style programming); SQL and NoSQL databases; Concurrent, distributed, and parallel programming; Advanced statistics (such as multiple testing corrections).This book by Provost et al gives just a part of the necessary DS material. However the part it provides, is essential. I wish the biological data scientists in academia would adopt and integrate the cost-benefit matrix idea and the profit curve idea into their model selection techniques instead of just using the accuracy metric mostly.Also a data scientist could do several follow-on added-value extensions to the profit curve chapter. You could produce Revenue curve (or Cost) since sometimes that matters more. You could quickly find alternatives which are nearly equi-profitable to the optimal profit but which exhibit (less revenue, less cost) or (more revenue, more cost). You could detail the model selection and profit consequences of fixed budgets. You could further assess the implications of marginal profit analysis on the optimal quantity when the profitability ratio changes. You could directly assess the data science solution against the best business wisdom solution and estimate what amount of profit is lost when using the old business wisdom decisions. It's a testament to this book's strong value that you can do a lot more based on its material.Nice work. Recommended.
K**R
Well-organized text
This is an excellent textbook on data science. The text itself explains concepts and theories well and provides definitions, examples, and formulas that help the reader understand and apply these concepts. The information presented is well-organized, and the visual aids include ample graphs and charts. Section breaks are obvious with well-designed titles. Chapters are easy enough to read but don't over-simplify important concepts. Inclusion of Glossary, Bibliography, and index, as well as a detailed table of contents, makes it easy to navigate. The only exception our instructor took with the text during my course was their insistence that only the best data scientists should be considered. Removing this bias, the information provided was clear, concise, and helpful for anyone working with big data or in data analytics.
Trustpilot
1 week ago
2 months ago