Exploring With Intention - Exploratory data analysis for beginners in R

best-practices
EDA
beginners
Author

Soundarya Soundararajan

Published

March 29, 2023

Let’s explore

Photo by Andrew Neel

pacman::p_load(palmerpenguins,tidyverse,
               report, skimr, summarytools)

For the dataset

Rough and quick

str(penguins) # Far better
tibble [344 × 8] (S3: tbl_df/tbl/data.frame)
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int [1:344] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
report(penguins)
The data contains 344 observations of the following 8 variables:

  - species: 3 levels, namely Adelie (n = 152, 44.19%), Chinstrap (n = 68,
19.77%) and Gentoo (n = 124, 36.05%)
  - island: 3 levels, namely Biscoe (n = 168, 48.84%), Dream (n = 124, 36.05%)
and Torgersen (n = 52, 15.12%)
  - bill_length_mm: n = 344, Mean = 43.92, SD = 5.46, Median = , MAD = 7.04,
range: [32.10, 59.60], Skewness = 0.05, Kurtosis = -0.88, 0.58% missing
  - bill_depth_mm: n = 344, Mean = 17.15, SD = 1.97, Median = , MAD = 2.22,
range: [13.10, 21.50], Skewness = -0.14, Kurtosis = -0.91, 0.58% missing
  - flipper_length_mm: n = 344, Mean = 200.92, SD = 14.06, Median = , MAD =
16.31, range: [172, 231], Skewness = 0.35, Kurtosis = -0.98, 0.58% missing
  - body_mass_g: n = 344, Mean = 4201.75, SD = 801.95, Median = , MAD = 889.56,
range: [2700, 6300], Skewness = 0.47, Kurtosis = -0.72, 0.58% missing
  - sex: 2 levels, namely female (n = 165, 47.97%), male (n = 168, 48.84%) and
missing (n = 11, 3.20%)
  - year: n = 344, Mean = 2008.03, SD = 0.82, Median = 2008.00, MAD = 1.48,
range: [2007, 2009], Skewness = -0.05, Kurtosis = -1.50, 0% missing
psych::describe(penguins)
                  vars   n    mean     sd  median trimmed    mad    min    max
species*             1 344    1.92   0.89    2.00    1.90   1.48    1.0    3.0
island*              2 344    1.66   0.73    2.00    1.58   1.48    1.0    3.0
bill_length_mm       3 342   43.92   5.46   44.45   43.91   7.04   32.1   59.6
bill_depth_mm        4 342   17.15   1.97   17.30   17.17   2.22   13.1   21.5
flipper_length_mm    5 342  200.92  14.06  197.00  200.34  16.31  172.0  231.0
body_mass_g          6 342 4201.75 801.95 4050.00 4154.01 889.56 2700.0 6300.0
sex*                 7 333    1.50   0.50    2.00    1.51   0.00    1.0    2.0
year                 8 344 2008.03   0.82 2008.00 2008.04   1.48 2007.0 2009.0
                   range  skew kurtosis    se
species*             2.0  0.16    -1.73  0.05
island*              2.0  0.61    -0.91  0.04
bill_length_mm      27.5  0.05    -0.89  0.30
bill_depth_mm        8.4 -0.14    -0.92  0.11
flipper_length_mm   59.0  0.34    -1.00  0.76
body_mass_g       3600.0  0.47    -0.74 43.36
sex*                 1.0 -0.02    -2.01  0.03
year                 2.0 -0.05    -1.51  0.04

Neat and quick

skim(penguins)
Data summary
Name penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
species 0 1.00 FALSE 3 Ade: 152, Gen: 124, Chi: 68
island 0 1.00 FALSE 3 Bis: 168, Dre: 124, Tor: 52
sex 11 0.97 FALSE 2 mal: 168, fem: 165

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm 2 0.99 43.92 5.46 32.1 39.23 44.45 48.5 59.6 ▃▇▇▆▁
bill_depth_mm 2 0.99 17.15 1.97 13.1 15.60 17.30 18.7 21.5 ▅▅▇▇▂
flipper_length_mm 2 0.99 200.92 14.06 172.0 190.00 197.00 213.0 231.0 ▂▇▃▅▂
body_mass_g 2 0.99 4201.75 801.95 2700.0 3550.00 4050.00 4750.0 6300.0 ▃▇▆▃▂
year 0 1.00 2008.03 0.82 2007.0 2007.00 2008.00 2009.0 2009.0 ▇▁▇▁▇
penguins |> 
        group_by(species) |> 
        skim()
Data summary
Name group_by(penguins, specie…
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables species

Variable type: factor

skim_variable species n_missing complete_rate ordered n_unique top_counts
island Adelie 0 1.00 FALSE 3 Dre: 56, Tor: 52, Bis: 44
island Chinstrap 0 1.00 FALSE 1 Dre: 68, Bis: 0, Tor: 0
island Gentoo 0 1.00 FALSE 1 Bis: 124, Dre: 0, Tor: 0
sex Adelie 6 0.96 FALSE 2 fem: 73, mal: 73
sex Chinstrap 0 1.00 FALSE 2 fem: 34, mal: 34
sex Gentoo 5 0.96 FALSE 2 mal: 61, fem: 58

Variable type: numeric

skim_variable species n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
bill_length_mm Adelie 1 0.99 38.79 2.66 32.1 36.75 38.80 40.75 46.0 ▁▆▇▆▁
bill_length_mm Chinstrap 0 1.00 48.83 3.34 40.9 46.35 49.55 51.08 58.0 ▂▇▇▅▁
bill_length_mm Gentoo 1 0.99 47.50 3.08 40.9 45.30 47.30 49.55 59.6 ▃▇▆▁▁
bill_depth_mm Adelie 1 0.99 18.35 1.22 15.5 17.50 18.40 19.00 21.5 ▂▆▇▃▁
bill_depth_mm Chinstrap 0 1.00 18.42 1.14 16.4 17.50 18.45 19.40 20.8 ▅▇▇▆▂
bill_depth_mm Gentoo 1 0.99 14.98 0.98 13.1 14.20 15.00 15.70 17.3 ▅▇▇▆▂
flipper_length_mm Adelie 1 0.99 189.95 6.54 172.0 186.00 190.00 195.00 210.0 ▁▆▇▅▁
flipper_length_mm Chinstrap 0 1.00 195.82 7.13 178.0 191.00 196.00 201.00 212.0 ▁▅▇▅▂
flipper_length_mm Gentoo 1 0.99 217.19 6.48 203.0 212.00 216.00 221.00 231.0 ▂▇▇▆▃
body_mass_g Adelie 1 0.99 3700.66 458.57 2850.0 3350.00 3700.00 4000.00 4775.0 ▅▇▇▃▂
body_mass_g Chinstrap 0 1.00 3733.09 384.34 2700.0 3487.50 3700.00 3950.00 4800.0 ▁▅▇▃▁
body_mass_g Gentoo 1 0.99 5076.02 504.12 3950.0 4700.00 5000.00 5500.00 6300.0 ▃▇▇▇▂
year Adelie 0 1.00 2008.01 0.82 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▇▁▇
year Chinstrap 0 1.00 2007.97 0.86 2007.0 2007.00 2008.00 2009.00 2009.0 ▇▁▆▁▇
year Gentoo 0 1.00 2008.08 0.79 2007.0 2007.00 2008.00 2009.00 2009.0 ▆▁▇▁▇
dfSummary(penguins)
Data Frame Summary  
penguins  
Dimensions: 344 x 8  
Duplicates: 0  

--------------------------------------------------------------------------------------------------------------------
No   Variable            Stats / Values             Freqs (% of Valid)    Graph                 Valid      Missing  
---- ------------------- -------------------------- --------------------- --------------------- ---------- ---------
1    species             1. Adelie                  152 (44.2%)           IIIIIIII              344        0        
     [factor]            2. Chinstrap                68 (19.8%)           III                   (100.0%)   (0.0%)   
                         3. Gentoo                  124 (36.0%)           IIIIIII                                   

2    island              1. Biscoe                  168 (48.8%)           IIIIIIIII             344        0        
     [factor]            2. Dream                   124 (36.0%)           IIIIIII               (100.0%)   (0.0%)   
                         3. Torgersen                52 (15.1%)           III                                       

3    bill_length_mm      Mean (sd) : 43.9 (5.5)     164 distinct values       .     . :         342        2        
     [numeric]           min < med < max:                                   . : : : : :         (99.4%)    (0.6%)   
                         32.1 < 44.5 < 59.6                                 : : : : : :                             
                         IQR (CV) : 9.3 (0.1)                               : : : : : : .                           
                                                                          : : : : : : : : .                         

4    bill_depth_mm       Mean (sd) : 17.2 (2)       80 distinct values              :           342        2        
     [numeric]           min < med < max:                                         : :           (99.4%)    (0.6%)   
                         13.1 < 17.3 < 21.5                                 : . : : : .                             
                         IQR (CV) : 3.1 (0.1)                             . : : : : : :                             
                                                                          : : : : : : : . .                         

5    flipper_length_mm   Mean (sd) : 200.9 (14.1)   55 distinct values          :               342        2        
     [integer]           min < med < max:                                     . :               (99.4%)    (0.6%)   
                         172 < 197 < 231                                      : : :   . .                           
                         IQR (CV) : 23 (0.1)                                . : : :   : : :                         
                                                                            : : : : : : : : :                       

6    body_mass_g         Mean (sd) : 4201.8 (802)   94 distinct values        :                 342        2        
     [integer]           min < med < max:                                   . :                 (99.4%)    (0.6%)   
                         2700 < 4050 < 6300                                 : : : :                                 
                         IQR (CV) : 1200 (0.2)                              : : : : : .                             
                                                                          . : : : : : :                             

7    sex                 1. female                  165 (49.5%)           IIIIIIIII             333        11       
     [factor]            2. male                    168 (50.5%)           IIIIIIIIII            (96.8%)    (3.2%)   

8    year                Mean (sd) : 2008 (0.8)     2007 : 110 (32.0%)    IIIIII                344        0        
     [integer]           min < med < max:           2008 : 114 (33.1%)    IIIIII                (100.0%)   (0.0%)   
                         2007 < 2008 < 2009         2009 : 120 (34.9%)    IIIIII                                    
                         IQR (CV) : 2 (0)                                                                           
--------------------------------------------------------------------------------------------------------------------
#view(dfSummary(penguins))

For individual variables

Continuous variables

boxplots histograms/density plots

sHINY APP - Interactive

https://jgassen.shinyapps.io/expand/ # but I am unable to use this

library(ExPanDaR)
Warning: package 'ExPanDaR' was built under R version 4.2.3
#ExPanD(penguins)