[General] Exploratory Data Analysis (EDA)

I have long been fascinated by Exploratory Data Analysis (EDA), a very creative new statistical methodology that differs substantially from what most people know as statistics. 

Most tools in the normal statistician's kit are intended to help analysts confirm the results of statistical experiments or to validate an hypothesis via statistical manipulation of pre-existing data.  We can classify these approaches as "confirmatory statistical analysis."  The "standard" confirmatory statistical techniques are only suitable if the problem under study meets  the very specific requirements and assumptions upon which parametric statistical theory is based.  Frequently, people -- including many professional statisticians who should know better -- blindly misuse the normal tools (e.g., mean and standard deviation) on data sets that do not come close to meeting the required conditions (such as having a normal distribution, etc.).  Only rarely can standard parametric statistical methods be used effectively to perform initial explorations on unknown batches of numbers.

John W. Tukey, in his great classic text, Exploratory Data Analysis, gave us some cool tools for exploring data.  Sometimes, you end up with a bunch of data and have absolutely no idea what might be "in there."  Tukey's methods included some very interesting graphical techniques, such as "stem and leaf diagrams" and "box plots," that stand as excellent early modern data visualization examples.  I must hasten to add that many of the EDA techniques are not only effective but fun to do.  I strongly recommend EDA to absolutely anyone who must even occasionally attempt to find that elusive "something" in a batch of numbers.

I consider it one of the canonical examples of the unfairness of the universe that Tukey's text appears to be out of print and is now somewhat difficult to find.  You can easily locate any number of derivative works but, IMNSHO, the true classics in any field should *never* be allowed to go out of print -- and Tukey's "orange book" certainly classifies as one of those.  Find it in some library somewhere and just take a look at it and I think you will agree.  Even the format and layout of this book is creative, special, and clear.  But the techniques, themselves, are things of beauty, developed by that extremely rare type of statistician, one who actually tried to do real things with real numbers.

John W. Tukey died on July 26, 2000.  He certainly deserves to be ranked as one of the most influential statisticians of the late 20th century.  Oh, and by the way, you might be interested to know that it was John W. Tukey who first coined the term "software" in 1957.

The immediate motive for this post is that I just discovered two nice introductory sites about EDA that I had not previously seen:  Exploratory Data Analysis and Data Visualization, by the unusual Dr. Alex Yu, Chong Ho (Alex), and the Exploratory Data Analysis section of the free online Engineering Statistics Handbook, provided by the Information Technology Laboratory (ITL) of NIST .  These resources give excellent introductions and give the beginner a great starting point.

Enjoy!

No Comments