Missing Data Analysis

R Data Visualization Statistics

For this post, I will examine missing data in a large-scale dataset and discuss about numerous ways we can clean them as a part of data preparation.

(10 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2021-12-27

Looking for what that is not there

Photo by Adam Lingelbach

Import and read the data set

Show code
library(foreign) #To read SPSS data
library(tidyverse) #datawork toolbox
library(dlookr) #for missing data diagnosis
library(visdat) #for overall missingness visualization
library(naniar) #for missingness visualization
library(VIM) #for donor-based imputation
library(simputation) #for model-based imputation
library(mice) #for multiple imputation
Show code
#Import the data set
PISA_CAN <-read.spss("PISA2018CAN.sav",to.data.frame = TRUE, use.value.labels = FALSE)

#Subset and rename the variables
PISA_Subsetted <-  PISA_CAN %>% 
  select(REPEAT, FEMALE = ST004D01T, ESCS, DAYSKIP = ST062Q01TA,
         CLASSSKIP = ST062Q02TA, LATE = ST062Q03TA,
         BEINGBULLIED, DISCLIMA, ADAPTIVITY)

#Recode variables into factor
PISA_Subsetted$DAYSKIP <-as.factor(PISA_Subsetted$DAYSKIP)
PISA_Subsetted$CLASSSKIP <-as.factor(PISA_Subsetted$CLASSSKIP)
PISA_Subsetted$LATE <-as.factor(PISA_Subsetted$LATE)
PISA_Subsetted$FEMALE <-as.factor(PISA_Subsetted$FEMALE)
PISA_Subsetted$REPEAT <-as.factor(PISA_Subsetted$REPEAT)

# Renaming factor levels with dplyr
PISA_Subsetted$FEMALE <- recode_factor(PISA_Subsetted$FEMALE, 
                                       "1" = "1", "2" = "0")

glimpse(PISA_Subsetted)
Rows: 22,653
Columns: 9
$ REPEAT       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ FEMALE       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ ESCS         <dbl> -0.7302, 0.3078, 0.5059, 1.1147, 1.3626, -0.857~
$ DAYSKIP      <fct> 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 2, 1, 2,~
$ CLASSSKIP    <fct> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, 2, 2,~
$ LATE         <fct> 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3,~
$ BEINGBULLIED <dbl> 1.2618, 1.7669, 0.1462, -0.7823, 0.2907, 0.7703~
$ DISCLIMA     <dbl> -0.4186, -1.4179, 0.6019, -0.4995, -0.1045, 1.0~
$ ADAPTIVITY   <dbl> -0.4708, 0.6350, -0.5786, -0.5786, -0.0763, 0.5~

Check for missing data

Show code
dlookr::diagnose(PISA_Subsetted)
# A tibble: 9 x 6
  variables    types   missing_count missing_percent unique_count
  <chr>        <chr>           <int>           <dbl>        <int>
1 REPEAT       factor           1926         8.50               3
2 FEMALE       factor              2         0.00883            3
3 ESCS         numeric          1163         5.13           15366
4 DAYSKIP      factor           3341        14.7                5
5 CLASSSKIP    factor           3330        14.7                5
6 LATE         factor           3316        14.6                5
7 BEINGBULLIED numeric          3833        16.9               63
8 DISCLIMA     numeric          1300         5.74             934
9 ADAPTIVITY   numeric          2197         9.70              65
# ... with 1 more variable: unique_rate <dbl>
Show code
visdat::vis_miss(PISA_Subsetted, sort_miss = TRUE)

Show code
PISA_Subsetted %>% group_by (FEMALE) %>%
  miss_var_summary()
# A tibble: 24 x 4
# Groups:   FEMALE [3]
   FEMALE variable     n_miss pct_miss
   <fct>  <chr>         <int>    <dbl>
 1 1      BEINGBULLIED   1651    14.6 
 2 1      DAYSKIP        1451    12.8 
 3 1      CLASSSKIP      1445    12.8 
 4 1      LATE           1430    12.6 
 5 1      ADAPTIVITY      907     8.02
 6 1      REPEAT          812     7.18
 7 1      DISCLIMA        563     4.98
 8 1      ESCS            527     4.66
 9 0      BEINGBULLIED   2180    19.2 
10 0      DAYSKIP        1888    16.6 
# ... with 14 more rows

Types of Missing Data

Show code
oceanbuoys %>% arrange(year) %>% vis_miss()

Visualize Missing data

Missing pattern w/Upset plot

Show code
gg_miss_upset(PISA_Subsetted, nsets = 9, nintersects = 15)

General visual summaries of missing data

Missingness in variables with gg_miss_var

Show code
PISA_Subsetted %>% miss_var_table()
# A tibble: 9 x 3
  n_miss_in_var n_vars pct_vars
          <int>  <int>    <dbl>
1             2      1     11.1
2          1163      1     11.1
3          1300      1     11.1
4          1926      1     11.1
5          2197      1     11.1
6          3316      1     11.1
7          3330      1     11.1
8          3341      1     11.1
9          3833      1     11.1
Show code
gg_miss_var(PISA_Subsetted, show_pct = TRUE) 

Missingness in cases with gg_miss_case

Show code
PISA_Subsetted %>% miss_case_table()
# A tibble: 10 x 3
   n_miss_in_case n_cases pct_cases
            <int>   <int>     <dbl>
 1              0   18327  80.9    
 2              1     870   3.84   
 3              2     140   0.618  
 4              3      81   0.358  
 5              4    1254   5.54   
 6              5      83   0.366  
 7              6     756   3.34   
 8              7      90   0.397  
 9              8    1050   4.64   
10              9       2   0.00883
Show code
gg_miss_case(PISA_Subsetted)

Missingness across factors with gg_miss_fct

Show code
gg_miss_fct(x = PISA_Subsetted, fct = REPEAT) + 
  labs(title = "Missing data by the History of Class Repetition")
Show code
gg_miss_fct(x = PISA_Subsetted, fct = LATE) + 
  labs(title = "Missing data by Lateness History")

Missingness along a repeating span with gg_miss_span

Show code
gg_miss_span(PISA_Subsetted, REPEAT, span_every = 2000) +
  theme_dark()

Cumulative missing

Show code
PISA_Subsetted %>% gg_miss_case_cumsum(breaks = 2000) + theme_bw()

Show code
PISA_Subsetted %>% gg_miss_var_cumsum() + theme_bw()

What should we do with the missing data

Complete Case Analysis

Show code
PISA_Listwise <- PISA_Subsetted[complete.cases(PISA_Subsetted), ]
glimpse(PISA_Listwise)
Rows: 18,327
Columns: 9
$ REPEAT       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
$ FEMALE       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
$ ESCS         <dbl> -0.7302, 0.3078, 0.5059, 1.1147, 1.3626, -0.857~
$ DAYSKIP      <fct> 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 3, 1, 1, 2, 1, 2,~
$ CLASSSKIP    <fct> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, 2, 2,~
$ LATE         <fct> 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3,~
$ BEINGBULLIED <dbl> 1.2618, 1.7669, 0.1462, -0.7823, 0.2907, 0.7703~
$ DISCLIMA     <dbl> -0.4186, -1.4179, 0.6019, -0.4995, -0.1045, 1.0~
$ ADAPTIVITY   <dbl> -0.4708, 0.6350, -0.5786, -0.5786, -0.0763, 0.5~
Show code
pairwise_var <- c("BEINGBULLIED", "DISCLIMA")
cov(PISA_Subsetted[pairwise_var], use="pairwise.complete.obs")
             BEINGBULLIED   DISCLIMA
BEINGBULLIED    1.1315670 -0.1982891
DISCLIMA       -0.1982891  1.1373620

Missing Data Imputation

Donor-based imputation

Mean Imputation
Show code
PISA_meanimp <- PISA_Subsetted %>%
  mutate(DISCLIMA_imp = ifelse(is.na(DISCLIMA), TRUE, FALSE)) %>%
  mutate(ADAPTIVITY_imp = ifelse(is.na(ADAPTIVITY), TRUE, FALSE))

PISA_meanimp[c("DISCLIMA_imp","ADAPTIVITY_imp")] %>% head()
  DISCLIMA_imp ADAPTIVITY_imp
1        FALSE          FALSE
2        FALSE          FALSE
3        FALSE          FALSE
4        FALSE          FALSE
5        FALSE          FALSE
6        FALSE          FALSE
Show code
PISA_meanimp <- PISA_meanimp %>%
mutate(DISCLIMA = ifelse(is.na(DISCLIMA), mean(DISCLIMA, na.rm = TRUE), DISCLIMA)) %>%
mutate(ADAPTIVITY = ifelse(is.na(ADAPTIVITY), mean(ADAPTIVITY, na.rm = TRUE), ADAPTIVITY))

PISA_meanimp %>%select(DISCLIMA, ADAPTIVITY, DISCLIMA_imp, ADAPTIVITY_imp) %>%
head()
  DISCLIMA ADAPTIVITY DISCLIMA_imp ADAPTIVITY_imp
1  -0.4186    -0.4708        FALSE          FALSE
2  -1.4179     0.6350        FALSE          FALSE
3   0.6019    -0.5786        FALSE          FALSE
4  -0.4995    -0.5786        FALSE          FALSE
5  -0.1045    -0.0763        FALSE          FALSE
6   1.0832     0.5464        FALSE          FALSE
Show code
PISA_meanimp %>% select(DISCLIMA, ADAPTIVITY, DISCLIMA_imp, ADAPTIVITY_imp) %>% marginplot(delimiter="imp")

K-Nearest Neighbor(kNN) Imputation
Show code
PISA_kNNimp <- VIM::kNN(PISA_Subsetted, k = 6, variable = c("DISCLIMA", "ADAPTIVITY"))

PISA_kNNimp[c("DISCLIMA", "ADAPTIVITY","DISCLIMA_imp","ADAPTIVITY_imp")] %>% head()
  DISCLIMA ADAPTIVITY DISCLIMA_imp ADAPTIVITY_imp
1  -0.4186    -0.4708        FALSE          FALSE
2  -1.4179     0.6350        FALSE          FALSE
3   0.6019    -0.5786        FALSE          FALSE
4  -0.4995    -0.5786        FALSE          FALSE
5  -0.1045    -0.0763        FALSE          FALSE
6   1.0832     0.5464        FALSE          FALSE

Model-based imputation

Linear Regression Imputation
Show code
PISA_lmreg <- impute_lm(PISA_Subsetted, DISCLIMA + ADAPTIVITY ~.)

PISA_lmreg %>% 
  is.na() %>%
  colSums()
      REPEAT       FEMALE         ESCS      DAYSKIP    CLASSSKIP 
        1926            2         1163         3341         3330 
        LATE BEINGBULLIED     DISCLIMA   ADAPTIVITY 
        3316         3833         1278         2056 
Logistic Regression Imputation
Show code
missing_REPEAT <- is.na(PISA_Subsetted$REPEAT)
head(missing_REPEAT, 20)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Show code
PISA_logregimp <- PISA_Subsetted

logreg_model <- glm(REPEAT ~ DAYSKIP + BEINGBULLIED + ESCS,
                data = PISA_Subsetted, family = binomial)


preds <- predict(logreg_model, type = "response")

preds <- ifelse(preds >= 0.5, 1, 0)

PISA_logregimp[missing_REPEAT, "REPEAT"] <- preds[missing_REPEAT]


table(preds[missing_REPEAT])

   0    1 
1669    1 
Show code
table(PISA_Subsetted$REPEAT)

    0     1 
19575  1152 
Show code
PISA_logregimp %>% 
  is.na() %>%
  colSums()
      REPEAT       FEMALE         ESCS      DAYSKIP    CLASSSKIP 
         256            2         1163         3341         3330 
        LATE BEINGBULLIED     DISCLIMA   ADAPTIVITY 
        3316         3833         1300         2197 

Multiple Imputation by Chained Equation

Show code
mice_model <- mice(PISA_Subsetted, method='pmm', seed = 123)

 iter imp variable
  1   1  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  1   2  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  1   3  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  1   4  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  1   5  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  2   1  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  2   2  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  2   3  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  2   4  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  2   5  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  3   1  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  3   2  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  3   3  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  3   4  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  3   5  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  4   1  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  4   2  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  4   3  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  4   4  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  4   5  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  5   1  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  5   2  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  5   3  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  5   4  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
  5   5  REPEAT  FEMALE  ESCS  DAYSKIP  CLASSSKIP  LATE  BEINGBULLIED  DISCLIMA  ADAPTIVITY
Show code
PISA_miceimp <- complete(mice_model)

psych::describe(PISA_Subsetted)
             vars     n  mean   sd median trimmed  mad   min  max
REPEAT*         1 20727  1.06 0.23   1.00    1.00 0.00  1.00 2.00
FEMALE*         2 22651  1.50 0.50   2.00    1.50 0.00  1.00 2.00
ESCS            3 21490  0.38 0.83   0.48    0.41 0.87 -6.69 4.04
DAYSKIP*        4 19312  1.31 0.66   1.00    1.16 0.00  1.00 4.00
CLASSSKIP*      5 19323  1.44 0.76   1.00    1.27 0.00  1.00 4.00
LATE*           6 19337  1.82 0.98   2.00    1.65 1.48  1.00 4.00
BEINGBULLIED    7 18820  0.17 1.06   0.15    0.03 1.38 -0.78 3.86
DISCLIMA        8 21353 -0.12 1.07  -0.04   -0.11 1.00 -2.71 2.03
ADAPTIVITY      9 20456  0.18 1.08   0.19    0.20 0.98 -2.27 2.01
             range  skew kurtosis   se
REPEAT*       1.00  3.88    13.05 0.00
FEMALE*       1.00  0.00    -2.00 0.00
ESCS         10.72 -0.47     0.85 0.01
DAYSKIP*      3.00  2.41     5.80 0.00
CLASSSKIP*    3.00  1.84     2.87 0.01
LATE*         3.00  1.01    -0.06 0.01
BEINGBULLIED  4.64  0.88     0.31 0.01
DISCLIMA      4.75 -0.13     0.16 0.01
ADAPTIVITY    4.27 -0.18    -0.19 0.01
Show code
psych::describe(PISA_miceimp)
             vars     n  mean   sd median trimmed  mad   min  max
REPEAT*         1 22653  1.06 0.23   1.00    1.00 0.00  1.00 2.00
FEMALE*         2 22653  1.50 0.50   2.00    1.50 0.00  1.00 2.00
ESCS            3 22653  0.38 0.83   0.48    0.41 0.86 -6.69 4.04
DAYSKIP*        4 22653  1.32 0.67   1.00    1.17 0.00  1.00 4.00
CLASSSKIP*      5 22653  1.45 0.76   1.00    1.27 0.00  1.00 4.00
LATE*           6 22653  1.83 0.98   2.00    1.66 1.48  1.00 4.00
BEINGBULLIED    7 22653  0.18 1.07   0.15    0.04 1.38 -0.78 3.86
DISCLIMA        8 22653 -0.12 1.07  -0.04   -0.11 1.00 -2.71 2.03
ADAPTIVITY      9 22653  0.17 1.09   0.19    0.20 0.98 -2.27 2.01
             range  skew kurtosis   se
REPEAT*       1.00  3.84    12.74 0.00
FEMALE*       1.00  0.00    -2.00 0.00
ESCS         10.72 -0.47     0.84 0.01
DAYSKIP*      3.00  2.38     5.62 0.00
CLASSSKIP*    3.00  1.83     2.81 0.01
LATE*         3.00  1.00    -0.09 0.01
BEINGBULLIED  4.64  0.88     0.35 0.01
DISCLIMA      4.75 -0.13     0.15 0.01
ADAPTIVITY    4.27 -0.17    -0.20 0.01

Concluding Remark

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2021, Dec. 27). Tarid Wongvorachan: Missing Data Analysis. Retrieved from https://taridwong.github.io/posts/2021-12-27-missingdata/

BibTeX citation

@misc{wongvorachan2021missing,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Missing Data Analysis},
  url = {https://taridwong.github.io/posts/2021-12-27-missingdata/},
  year = {2021}
}