Exploring Exploratory Data Analysis
The whole point of Exploratory Data Analysis (EDA) is to just take a step back and look at the dataset before doing anything with it. EDA is just as important as any part of a data project because real datasets are really messy and lots of things can go wrong. If you don’t know your data well enough, how are you going to know where to look for those sources of error or confusion?
Take some time to understand the dataset you are going to be working with. Here are a few things I routinely ask myself/check for whenever I work with a new dataset:
Was the data imported correctly?
I used to struggle a lot with finding the right way to import my data. The read.csv()
function can be found in R’s base library and was consequently the very first function I used to import my dataset.
# the two assignment statements below render the same output > df <- read.csv("mydataset.csv") > df <- read.csv("mydataset.csv", header = TRUE)
The default header option for read.csv()
is set as TRUE, which means that the function assigns the first row of observation in the dataset to column names.
Suppose that mydataset.csv
contains only observation values with no defined column names. If header = FALSE
is not specified within the function, then the column names of the data frame that is imported will attain the values of the first observation in the dataset. The best way to check this is by simply looking at the first few rows of the imported data. Using colnames()
can be helpful as well in this case.
# returns the first few rows of df > head(df) # returns a vector of characters that correspond to the column names > colnames(df)
One way to avoid this is by using fread()
which can be found in the data.table
library in R. The header option in this case is set to auto so the function checks automatically if the first row of observations could be column names and assigns a TRUE or FALSE value accordingly. I would suggest to look at the first few rows regardless, but it is faster and a more convenient alternative to read.csv()
Note: read.csv()
returns a data frame and fread()
returns a data table.
Are there missing and unusual values?
After successfully importing the data file, I always check for any missing values in my dataset. I find that the best way to inspect this is by summing all the TRUE values from is.na()
by column like this:
> colSums(is.na(df))
There are many ways to treat missing values. The handling of missing values is a separate topic in itself, but it is the awareness of these values that I am trying to emphasize here. Once I know there are missing values in my dataset, I can then take appropriate steps to deal with these values.
It is also worth asking yourself “Why are these values missing?”. Understanding why certain values in your dataset are missing can help you understand your data better. Remember that the whole point of this exercise is to help you get the full measure of your dataset so you know what you are dealing with.
Some datasets can also have very unusual values. It is easier to talk about this with an example so let’s look at the pima
dataset from the faraway
package. A description of the dataset can be found here.
Looking at some of the basic summary statistics of the dataset per column is a good place to start in making sure that the values in your dataset make sense. Take a look at the summary for the pima
dataset below.
Look specifically at the minimum values for glucose
, diastolic
, triceps
, insulin
, and bmi
. Does it make sense for these variables to take on a value of 0? No. It is not possible for a person to have no blood pressure or no body mass index. The only possible explanation for these values to be 0 is that those entries are instances in which no data was collected (i.e. they are missing values). Set these values to NA values so you can deal with them with any other missing values you have in the dataset. Here is an example of how you can replace those 0 values for the variable bmi
:
# pima loaded as a dataframe > pima$bmi[pima$bmi==0] <- NA
Notice the change in the summary of bmi
. These values make more sense this way.
Now take a look at test
. This variable is supposed to be a binary indication for signs diabetes. Why is there a mean value? This clearly indicates that R recognizes the 1’s and 0’s as quantitative variables instead of categorical variables. Run class()
to see how these values are treated in R:
> class(pima$test) [1] "integer"
To get R to handle this column as a categorical variable, you can use factor() :
> pima$test <- factor(pima$test) > class(pima$test) [1] "factor"
Detecting these instances are important — these unusual values can heavily bias prediction models and any conclusions that may be drawn from them.
Visualizations
Visualizing your data in various ways can help you see things you may have missed out on in your early stages of exploration. Here are some of my go-to visualizations:
Histograms
Continuing with the example earlier with the pima dataset, below is a histogram I plotted using ggplot2
and tidyr
libraries. This method of visualization helps me to look at the frequency/count of points for each variable in the dataset:
> pima %>% + gather() %>% + ggplot(aes(value)) + + facet_wrap(~ key, scales = "free") + + geom_histogram(stat="count")
Just by visualizing your data, you can see that there are clearly a dominant number of missing values for the variables triceps and insulin. This goes to show that in this case, handling of missing data must be conducted carefully. Understanding why exactly so many values are missing might be an important feature to note.
Note: You would see a similar visualization if you did not correct for the unusual 0 values previously.
For details on the code for visualization using ggplot2
and tidyr
, take a look at this website.
Scatterplot Matrix
A scatterplot matrix can help you see if some kind of a relationship exists between any two variables in your dataset. Take a look at this symmetric matrix of pairwise scatterplots using the iris
dataset.
> pairs(iris)
You can see, just by looking at the plots above, that there is some linearity between Petal.Length
and Petal.Width
. You have not even gotten to the modeling stage and you already have an insight about the variables you want to keep your eyes on!
Correlation-Matrix Plot
This correlation matrix plot provides visual aid for understanding how two numeric type variables change in relation to each other. In other words, it is just literally just a visual representation of a correlation matrix.
The function cor()
is used to evaluate the correlation matrix, and the function corrplot()
from the library corrplot
is used to create a heat map based on the correlation matrix. Take a look at an example below using the same iris
dataset.
# use only numeric type variables > correlations <- cor(iris[,1:4]) # correlation plot > corrplot(correlations)
You can see from the scale that a large positive correlation value is more blue, and a large negative correlation values is more red. So from this graph, it looks like Petal.Length
and Petal.Width
are strongly correlated and it also looks like there is a high correlation between Petal.Length
and Sepal.Length
. Note that the diagonal is perfectly positively correlated because it represents the correlation of the variable with itself.
Although these conclusions are similar to those that were made earlier with the scatterplots, this method just provides a more concrete reason to believe that two attributes are related.
Going through this process of understanding your data is vital for more than just the reasons I have mentioned. It might also eventually help you make informed decisions when it comes to selecting your model.The methods and processes I have outlined in this article are some of the ones I use most frequently whenever I get a new dataset. There are so many more visualizations that you can explore and experiment with using your dataset. Don’t hold yourself back. You’re just trying to look at your data and understand it here.
I hope this article helps provide some insight on what exploratory data analysis is and why it is important to go through this process.