探索性数据分析(eda)的原则和过程

Aug 22 2016

what’s eda?特征如下:

a)an emphasis on the substantive understanding of data that address the broad question of what is going on here?
b)an emphasis on graphic representations of data;
c) a focus on tentative model building and hypothesis generation in an iterative process of model specification, residual analysis, and model respecification;
d) use of robust measures, reexpression, and subset analysis; and
e) positions of skepticism, flexibility, and ecumenism regarding which methods to apply.
The goal of EDA is to discover patterns in data. It cannot be overemphasized that an appropriate technique for EDA is determined not by computation but rather by a procedure’s purpose and use.

简单来说,EDA的目标是要发现数据中的模式。要对数据形成更深入的理解,明白数据所代表的事情,用图形化的方式表达出来, 聚焦于探索试验性模型的构建,以及迭代过程中生成假设,结果分析。

基本的步骤主要有以下几个方面:

1 Understand the Context理解上下文

This view holds that, in quantitative data analysis, numbers map onto aspects of reality. Numbers themselves are meaning- less unless the data analyst understands the mapping process and the nexus of theory and categorization in which objects under study are conceptualized. 定量数据分析,数字是对真相各个方面的映射。数字本身是无意义的,除非数据分析家理解对象的映射化的过程以及理论上的联结和当前研究对象的概念化后的类别。

2 Use Graphic Representations of Data

Graphical analysis is central to EDA. “the greatest value of a picture is when it forces us to notice what we never expected to see”

a. “stem-and-leaf plot” 数据量小的时候比较好用,数据量大的话不好看。 The stem-and-leaf plot shown in Figure 2 repre- sents a type of frequency table organized graphically to resemble a histogram while retaining information about the exact value of each observation。 When a large number of data points are examined, the stem- and-leaf plot may become cumbersome。

b. dot plot 查看单个分布或者对比分布

c.box-plot When seeking additional structure in univariate distributions or when a number of distributions need to be compared, a box plot is often used. A dot plot can be an effective tool to examine a single distribution or compare a number of distributions.

The box plot offers a five-number summary in schematic form. The ends of a box mark the first and third quartiles, and the median is indicated with a line positioned within the boxJ The ranges of most or all of the data in the tails of the distribution are marked using lines extend- ing away from the box, creating “whiskers” or “tails.”

d.核密度曲线Kernel density smoothers are graphic devices that provide estimates of a population shape,

A major component of the detective work of EDA is the rough assessment of hunches。

3 Develop Models in an Iterative Process of Tentative Model Specification and Residual Assessment

data = fit + residual data = smooth + rough. To create quantitative descriptions of data, the ex- ploratory data analyst conducts an iterative process of suggesting a tentative model, examining residuals from the model to assess model adequacy, and modi- fying the model in view of the residual analysis.

4 Building a Two-Way Fit

5 Data Analysis: A Picture is Worth a Thousand Word

Putting It All Together: A Reexamination of the Paap and Johansen Data A first look. boxplots,histograms, density plots, and dot plots scatter plot matrix

6 A Better Description.

对数据更好的描述方式,数据的再表现,其实就是数据的变换,目标是将数据分布转换为近似高斯分布的情况。

A straightforward way to find an appropriate description for the curved function is to find a reexpression of the univariate distributions that leads them to a roughly Gaussian shape.

A choice of transformation is recommended by moving up or down the ladder in the direction of the bulk of the data on the scale. Positively skewed dis- tributions with the bulk of the data lower on the scale can be normalized by moving down the ladder of reexpression; distributions with the bulk of the data high on the scale can be normalized by moving up the ladder of reexpression.

Ref:

[]