Data Exploration with ggstatsplot

R Statistics

In this post, I will be performing and visualizing data exploration techniques such as Pearson’s correlation test, Chi-square Goodness of Fit test, Chi-square Test of Independence, One-sample t-test, and Paired-sample t-test.

(8 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2022-12-31

It’s been a while, eh?

Data Preprocessing and Assumption Check

Show code
Show code
df <- read.csv("student-mat.csv", header = TRUE)

df <- df %>% 
  as.data.frame() %>%
  mutate(across(c(sex, address, schoolsup, famsup), as.factor))

#drop variable

df_clean <- df %>% select (c(sex, address, schoolsup, famsup, G1, G3))
Show code
plot_intro(df_clean)

Show code
plot_bar(df_clean)

Show code
describe(df_clean)
           vars   n  mean   sd median trimmed  mad min max range
sex*          1 395  1.47 0.50      1    1.47 0.00   1   2     1
address*      2 395  1.78 0.42      2    1.85 0.00   1   2     1
schoolsup*    3 395  1.13 0.34      1    1.04 0.00   1   2     1
famsup*       4 395  1.61 0.49      2    1.64 0.00   1   2     1
G1            5 395 10.91 3.32     11   10.80 4.45   3  19    16
G3            6 395 10.42 4.58     11   10.84 4.45   0  20    20
            skew kurtosis   se
sex*        0.11    -1.99 0.03
address*   -1.33    -0.24 0.02
schoolsup*  2.20     2.86 0.02
famsup*    -0.46    -1.79 0.02
G1          0.24    -0.71 0.17
G3         -0.73     0.37 0.23
Show code
set.seed(123)
plot_qq(df_clean)

Show code
df_clean %>% 
  group_by(sex) %>%
  identify_outliers(G1)
[1] sex        address    schoolsup  famsup     G1         G3        
[7] is.outlier is.extreme
<0 rows> (or 0-length row.names)

Scatter plot - Pearson’s correlation test

Show code
set.seed(123)

ggstatsplot::ggscatterstats(
  data = df_clean, 
  x = G1, 
  y = G3, 
  type = "parametric",            # type of test that needs to be run
  conf.level = 0.95,
  
  title = "Correlation Test",
  xlab = "First period grade",       # label for x axis
  ylab = "Final grade",              # label for y axis 
  line.color = "blue", 
  messages = FALSE,
  
  label.var = famsup,
  label.expression = G3 >= 19)+
  
  ggplot2::theme(plot.title = element_text(hjust = 0.5))

Pie Chart - Chi-Square test

Show code
set.seed(123)

ggstatsplot::ggpiestats(
  data = df_clean,
  x = famsup,
  type = "parametric",
  ratio = c(0.50,0.50),
  messages = FALSE,
  paired = FALSE, #Logical indicating whether data came from a within-subjects or repeated measures design study
  conf.level = 0.95,
  title = "Chi-Square Goodness of Fit Test"
)+
  ggplot2::theme(plot.title = element_text(hjust = 0.5))

Show code
set.seed(123)

ggstatsplot::ggpiestats(
  data = df_clean,
  x = famsup,
  y = schoolsup,
  type = "parametric",
  ratio = c(0.50,0.50),
  messages = FALSE,
  paired = FALSE, #Logical indicating whether data came from a within-subjects or repeated measures design study
  conf.level = 0.95,
  title = "Chi-Square Test of Independence"
)+
  ggplot2::theme(plot.title = element_text(hjust = 0.5))

Histogram - One Sample t-test

Show code
set.seed(123)

ggstatsplot::gghistostats(
  data = df_clean,
  x = G3,
  title = "Distribution of Final Grade",
  centrality.type = "parametric",    # one sample t-test
  test.value = 10,                    
  effsize.type = "d",         
  xlab = "Students' Final score",
  ylab = "Number of Student",
  centrality.para = "mean",          # which measure of central tendency is to be plotted
  normal.curve = TRUE,
  binwidth = 1,                   # binwidth value (needs to be toyed around with until you find the best one)
  messages = FALSE                   # turn off the messages
)+
  
  ggplot2::theme(plot.title = element_text(hjust = 0.5))

Violin plot - Dependent Sample t-test

Show code
df_long <- df_clean %>% pivot_longer(cols=c('G1', 'G3'),
                        names_to='Measure_point',
                        values_to='Scores')
Show code
# for reproducibility
set.seed(123)

ggstatsplot::ggwithinstats(
  data = df_long, 
  x = Measure_point, 
  y = Scores,
  type = "parametric", # type of statistical test
  xlab = "Period of testing",
  ylab = "Students' score",
  pairwise.comparisons = TRUE, 
  sphericity.correction = FALSE, ## don't display sphericity corrected dfs and p-values
  
  outlier.tagging = TRUE, ## whether outliers should be flagged
  outlier.label = address, ## label to attach to outlier values
  outlier.label.color = "green", ## outlier point label color
  
  mean.plotting = TRUE, ## whether the mean is to be displayed
  mean.color = "darkblue", ## color for mean
  
  title = "Comparison of students' score")+
  
  ggplot2::theme(plot.title = element_text(hjust = 0.5))

Conclusion

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2022, Dec. 31). Tarid Wongvorachan: Data Exploration with ggstatsplot. Retrieved from https://taridwong.github.io/posts/2022-12-31-ggstat/

BibTeX citation

@misc{wongvorachan2022data,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Data Exploration with ggstatsplot},
  url = {https://taridwong.github.io/posts/2022-12-31-ggstat/},
  year = {2022}
}