Examining the Big 5 personality Dataset with factor analysis

R Statistics

For this entry, I will be examining the Big 5 personality Inventory data set with Exploratory Data Analysis to identify potential structures of personality trait and verify them with Confirmatory Factor Analysis.

(8 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2022-01-01

Introduction

Show code
library(parameters) #for parameter processing
library(tidymodels) #for data splitting
library(tidyverse) #toolbox for R
library(psych) #for descriptive statistics and the data set
library(ggcorrplot) #for correlation matrix
library(see) #add-on for ggplot2
library(lavaan) #for SEM
library(performance) #assessment of Regression Models Performance
library(semPlot) #to plot path model for CFA
library(dlookr) #missing data diagnosis
library(mice) #missing data imputation

Dataset

Show code
# Load the data
data <- psych::bfi[, 1:25]

head(data)
      A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1
61617  2  4  3  4  4  2  3  3  4  4  3  3  3  4  4  3  4  2  2  3  3
61618  2  4  5  2  5  5  4  4  3  4  1  1  6  4  3  3  3  3  5  5  4
61620  5  4  5  4  4  4  5  4  2  5  2  4  4  4  5  4  5  4  2  3  4
61621  4  4  6  5  5  4  4  3  5  5  5  3  4  4  4  2  5  2  4  1  3
61622  2  3  3  4  5  4  4  5  3  2  2  2  5  4  5  2  3  4  4  3  3
61623  6  6  5  6  5  6  6  6  1  3  2  1  6  5  6  3  5  2  2  3  4
      O2 O3 O4 O5
61617  6  3  4  3
61618  2  4  3  3
61620  2  5  5  2
61621  3  4  3  5
61622  3  4  3  3
61623  3  5  6  1
Show code
#diagnose for missing value
dlookr::diagnose(data)
# A tibble: 25 x 6
   variables types   missing_count missing_percent unique_count
   <chr>     <chr>           <int>           <dbl>        <int>
 1 A1        integer            16           0.571            7
 2 A2        integer            27           0.964            7
 3 A3        integer            26           0.929            7
 4 A4        integer            19           0.679            7
 5 A5        integer            16           0.571            7
 6 C1        integer            21           0.75             7
 7 C2        integer            24           0.857            7
 8 C3        integer            20           0.714            7
 9 C4        integer            26           0.929            7
10 C5        integer            16           0.571            7
# ... with 15 more rows, and 1 more variable: unique_rate <dbl>
Show code
visdat::vis_miss(data, sort_miss = FALSE)

Data preprocessing

Show code
#imputation
mice_model <- mice(data, method='pmm', seed = 123)

 iter imp variable
  1   1  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  1   2  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  1   3  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  1   4  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  1   5  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  2   1  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  2   2  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  2   3  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  2   4  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  2   5  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  3   1  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  3   2  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  3   3  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  3   4  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  3   5  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  4   1  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  4   2  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  4   3  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  4   4  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  4   5  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  5   1  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  5   2  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  5   3  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  5   4  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
  5   5  A1  A2  A3  A4  A5  C1  C2  C3  C4  C5  E1  E2  E3  E4  E5  N1  N2  N3  N4  N5  O1  O3  O4  O5
Show code
data_complete <- complete(mice_model)

visdat::vis_miss(data_complete, sort_miss = FALSE)

Show code
#check for factor structure
check_factorstructure(data_complete)
# Is the data suitable for Factor Analysis?

  - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.84).
  - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analysis (Chisq(300) = 20158.27, p < .001).
Show code
variables.to.use<-c("A1", "A2", "A3", "A4", "A5",
                    "C1", "C2", "C3", "C4", "C5",
                    "E1", "E2", "E3", "E4", "E5",
                    "N1", "N2", "N3", "N4", "N5",
                    "O1", "O2", "O3", "O4", "O5")

data.corr<-cor(data_complete[variables.to.use],
                 method = "pearson",
                 use='all.obs')
ggcorrplot(data.corr,
           p.mat=cor_pmat(data_complete[variables.to.use]),
           hc.order=TRUE, 
           type='lower',
           color=c('red3', 'white', 'green3'),
           outline.color = 'darkgoldenrod1', 
           lab=FALSE, #omit the correlation coefficient
           legend.title='Correlation',
           pch=4, 
           pch.cex=4, #size of the cross mark for non-significant indicator
           lab_size=6)+ 
  labs(title="Correlation Matrix")+
  theme(plot.title=element_text(face='bold',size=14,hjust=0.5,colour="darkred"))+
  theme(legend.position=c(0.10,0.80), legend.box.just = "bottom")

Show code
# Establish two sets of indices to split the dataset
N <- nrow(data_complete)
indices <- seq(1, N)
indices_EFA <- sample(indices, floor((.5*N)))
indices_CFA <- indices[!(indices %in% indices_EFA)]

# Use those indices to split the dataset into halves for your EFA and CFA
bfi_EFA <- data_complete[indices_EFA, ]
bfi_CFA <- data_complete[indices_CFA, ]
Show code
# Is the data suitable for Factor Analysis?

  - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.84).
  - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analysis (Chisq(300) = 10244.89, p < .001).
Show code
# Is the data suitable for Factor Analysis?

  - KMO: The Kaiser, Meyer, Olkin (KMO) measure of sampling adequacy suggests that data seems appropriate for factor analysis (KMO = 0.84).
  - Sphericity: Bartlett's test of sphericity suggests that there is sufficient significant correlation in the data for factor analysis (Chisq(300) = 10210.30, p < .001).

Exploratory Factor Analysis

Scree plot

Show code
# Calculate the correlation matrix first
bfi_EFA_cor <- cor(bfi_EFA, use = "pairwise.complete.obs") 

# Then use that correlation matrix to calculate eigenvalues
eigenvals <- eigen(bfi_EFA_cor)

# Look at the eigenvalues returned
eigenvals$values
 [1] 5.1697200 2.6134230 2.1562029 1.7907326 1.6067099 1.0566407
 [7] 0.8427603 0.7962665 0.7482662 0.7263301 0.6985177 0.6640562
[13] 0.6544223 0.5997338 0.5573358 0.5424657 0.5117708 0.4961194
[19] 0.4602673 0.4487993 0.4259742 0.4020084 0.3868776 0.3675815
[25] 0.2770178
Show code
# Then use the correlation matrix to create the scree plot

scree(bfi_EFA_cor, factors = FALSE)

Fit EFA models

Show code
# Fit an EFA
efa <- psych::fa(data, nfactors = 5) %>% 
  model_parameters(sort = TRUE, threshold = "max")

efa
# Rotated loadings from Factor Analysis (oblimin-rotation)

Variable | MR2  |  MR1  |  MR3  |  MR5  |  MR4  | Complexity | Uniqueness
-------------------------------------------------------------------------
N1       | 0.81 |       |       |       |       |    1.08    |    0.35   
N2       | 0.78 |       |       |       |       |    1.04    |    0.40   
N3       | 0.71 |       |       |       |       |    1.07    |    0.45   
N5       | 0.49 |       |       |       |       |    1.96    |    0.65   
N4       | 0.47 |       |       |       |       |    2.27    |    0.51   
E2       |      | -0.68 |       |       |       |    1.07    |    0.46   
E4       |      | 0.59  |       |       |       |    1.49    |    0.47   
E1       |      | -0.56 |       |       |       |    1.21    |    0.65   
E5       |      | 0.42  |       |       |       |    2.60    |    0.60   
E3       |      | 0.42  |       |       |       |    2.55    |    0.56   
C2       |      |       | 0.67  |       |       |    1.17    |    0.55   
C4       |      |       | -0.61 |       |       |    1.18    |    0.55   
C3       |      |       | 0.57  |       |       |    1.11    |    0.68   
C5       |      |       | -0.55 |       |       |    1.44    |    0.57   
C1       |      |       | 0.55  |       |       |    1.19    |    0.67   
A3       |      |       |       | 0.66  |       |    1.07    |    0.48   
A2       |      |       |       | 0.64  |       |    1.04    |    0.55   
A5       |      |       |       | 0.53  |       |    1.49    |    0.54   
A4       |      |       |       | 0.43  |       |    1.74    |    0.72   
A1       |      |       |       | -0.41 |       |    1.97    |    0.81   
O3       |      |       |       |       | 0.61  |    1.17    |    0.54   
O5       |      |       |       |       | -0.54 |    1.21    |    0.70   
O1       |      |       |       |       | 0.51  |    1.13    |    0.69   
O2       |      |       |       |       | -0.46 |    1.75    |    0.74   
O4       |      |       |       |       | 0.37  |    2.69    |    0.75   

The 5 latent factors (oblimin rotation) accounted for 41.48% of the total variance of the original data (MR2 = 10.28%, MR1 = 8.80%, MR3 = 8.12%, MR5 = 7.94%, MR4 = 6.34%).
Show code
summary(efa)
# (Explained) Variance of Components

Parameter                       |   MR2 |   MR1 |   MR3 |   MR5 |   MR4
-----------------------------------------------------------------------
Eigenvalues                     | 4.493 | 2.249 | 1.505 | 1.188 | 0.934
Variance Explained              | 0.103 | 0.088 | 0.081 | 0.079 | 0.063
Variance Explained (Cumulative) | 0.103 | 0.191 | 0.272 | 0.351 | 0.415
Variance Explained (Proportion) | 0.248 | 0.212 | 0.196 | 0.191 | 0.153
Show code
predict_result <- predict(efa, names = c("Neuroticism", "Conscientiousness", "Extraversion", "Agreeableness", "Opennness"))

head(predict_result)
  Neuroticism Conscientiousness Extraversion Agreeableness  Opennness
1 -0.21410935        0.06924675  -1.33208860   -0.85364725 -1.5809244
2  0.15008464        0.48139729  -0.59950262   -0.08478873 -0.1876070
3  0.62827949        0.10964162  -0.04800816   -0.55616873  0.2502735
4 -0.09425827        0.03836489  -1.05089539   -0.10394941 -1.1000032
5 -0.16368420        0.44253657  -0.10519669   -0.71857460 -0.6612203
6  0.18984314        1.08439177   1.40730835    0.39278790  0.6222356

How many factors should we retain?

The Method Agreement procedure

Show code
n_factor <- parameters::n_factors(data_complete)

n_factor
# Method Agreement Procedure:

The choice of 6 dimensions is supported by 4 (21.05%) methods out of 19 (Optimal coordinates, Parallel analysis, Kaiser criterion, Scree (SE)).
Show code
as.data.frame(n_factor)
   n_Factors              Method              Family
1          1 Acceleration factor               Scree
2          3                 CNG                 CNG
3          4                beta Multiple_regression
4          4          Scree (R2)            Scree_SE
5          4    VSS complexity 1                 VSS
6          5    VSS complexity 2                 VSS
7          5       Velicer's MAP        Velicers_MAP
8          6 Optimal coordinates               Scree
9          6   Parallel analysis               Scree
10         6    Kaiser criterion               Scree
11         6          Scree (SE)            Scree_SE
12         7                   t Multiple_regression
13         7                   p Multiple_regression
14         8                 BIC                 BIC
15        12      BIC (adjusted)                 BIC
16        22             Bentler             Bentler
17        24            Bartlett             Barlett
18        24            Anderson             Barlett
19        24              Lawley             Barlett
Show code
summary(n_factor)
   n_Factors n_Methods
1          1         1
2          3         1
3          4         3
4          5         2
5          6         4
6          7         2
7          8         1
8         12         1
9         22         1
10        24         3
Show code
plot(n_factor) + theme_bw()

Confirmatory Factor Analysis

Train test split

Show code
set.seed(999)

# Put 3/4 of the data into the training set 
data_split <- initial_split(bfi_CFA, prop = 0.7)

# Create data frames for the two sets:
training <- training(data_split) #3/4
test  <- testing(data_split)  #1/4

Create CFA structures out of EFA models

Show code
structure_big5 <- psych::fa(training, nfactors = 5)
fa.diagram(structure_big5)

Show code
plot(density(structure_big5$scores, na.rm = TRUE), 
     main = "Factor Scores")

Show code
structure_big6 <- psych::fa(training, nfactors = 6) 
fa.diagram(structure_big6)

Show code
plot(density(structure_big6$scores, na.rm = TRUE), 
     main = "Factor Scores")

Show code
#Converting EFA into a lavaan-ready syntax
cfa_big5 <- efa_to_cfa(structure_big5)

#Investigate how the model looks
cfa_big5
# Latent variables
MR2 =~ N1 + N2 + N3 + N4 + N5
MR5 =~ A1 + A2 + A3 + A4 + A5
MR1 =~ E1 + E2 + E3 + E4 + E5
MR3 =~ C1 + C2 + C3 + C4 + C5
MR4 =~ O1 + O2 + O3 + O4 + O5
Show code
cfa_big6 <- efa_to_cfa(structure_big6)
cfa_big6
# Latent variables
MR2 =~ N1 + N2 + N3 + N4 + N5
MR5 =~ A1 + A2 + A3 + A4 + A5
MR1 =~ E1 + E2 + E4 + E5
MR3 =~ C1 + C2 + C3 + C4 + C5
MR4 =~ E3 + O1 + O3 + O4
MR6 =~ O2 + O5

Fit and Compare models

Show code
model_big5 <- lavaan::cfa(cfa_big5, data = test)
model_big6 <- lavaan::cfa(cfa_big6, data = test)

fitmeasures(model_big5, fit.measures = "all", output = "text")

Model Test User Model:

  Test statistic                               948.270
  Degrees of freedom                               265
  P-value                                        0.000

Model Test Baseline Model:

  Test statistic                              3122.085
  Degrees of freedom                               300
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.758
  Tucker-Lewis Index (TLI)                       0.726
  Bentler-Bonett Non-normed Fit Index (NNFI)     0.726
  Bentler-Bonett Normed Fit Index (NFI)          0.696
  Parsimony Normed Fit Index (PNFI)              0.615
  Bollen's Relative Fit Index (RFI)              0.656
  Bollen's Incremental Fit Index (IFI)           0.761
  Relative Noncentrality Index (RNI)             0.758

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -17371.604
  Loglikelihood unrestricted model (H1)     -16897.469
                                                      
  Akaike (AIC)                               34863.208
  Bayesian (BIC)                             35105.766
  Sample-size adjusted Bayesian (BIC)        34915.367

Root Mean Square Error of Approximation:

  RMSEA                                          0.078
  90 Percent confidence interval - lower         0.073
  90 Percent confidence interval - upper         0.084
  P-value RMSEA <= 0.05                          0.000

Standardized Root Mean Square Residual:

  RMR                                            0.174
  RMR (No Mean)                                  0.174
  SRMR                                           0.083

Other Fit Indices:

  Hoelter Critical N (CN) alpha = 0.05         135.952
  Hoelter Critical N (CN) alpha = 0.01         143.725
                                                      
  Goodness of Fit Index (GFI)                    0.833
  Adjusted Goodness of Fit Index (AGFI)          0.795
  Parsimony Goodness of Fit Index (PGFI)         0.679
                                                      
  McDonald Fit Index (MFI)                       0.444
                                                      
  Expected Cross-Validation Index (ECVI)         2.537
Show code
fitmeasures(model_big6, fit.measures = "all", output = "text")

Model Test User Model:

  Test statistic                               931.725
  Degrees of freedom                               260
  P-value                                        0.000

Model Test Baseline Model:

  Test statistic                              3122.085
  Degrees of freedom                               300
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.762
  Tucker-Lewis Index (TLI)                       0.725
  Bentler-Bonett Non-normed Fit Index (NNFI)     0.725
  Bentler-Bonett Normed Fit Index (NFI)          0.702
  Parsimony Normed Fit Index (PNFI)              0.608
  Bollen's Relative Fit Index (RFI)              0.656
  Bollen's Incremental Fit Index (IFI)           0.765
  Relative Noncentrality Index (RNI)             0.762

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)             -17363.331
  Loglikelihood unrestricted model (H1)     -16897.469
                                                      
  Akaike (AIC)                               34856.662
  Bayesian (BIC)                             35119.433
  Sample-size adjusted Bayesian (BIC)        34913.168

Root Mean Square Error of Approximation:

  RMSEA                                          0.078
  90 Percent confidence interval - lower         0.073
  90 Percent confidence interval - upper         0.084
  P-value RMSEA <= 0.05                          0.000

Standardized Root Mean Square Residual:

  RMR                                            0.168
  RMR (No Mean)                                  0.168
  SRMR                                           0.081

Other Fit Indices:

  Hoelter Critical N (CN) alpha = 0.05         135.927
  Hoelter Critical N (CN) alpha = 0.01         143.771
                                                      
  Goodness of Fit Index (GFI)                    0.837
  Adjusted Goodness of Fit Index (AGFI)          0.797
  Parsimony Goodness of Fit Index (PGFI)         0.670
                                                      
  McDonald Fit Index (MFI)                       0.450
                                                      
  Expected Cross-Validation Index (ECVI)         2.522
Show code
model_comparison <-performance::compare_performance(model_big5, model_big6)

rmarkdown::paged_table(model_comparison)

Conclusion

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2022, Jan. 1). Tarid Wongvorachan: Examining the Big 5 personality Dataset with factor analysis. Retrieved from https://taridwong.github.io/posts/2022-01-01-efacfa/

BibTeX citation

@misc{wongvorachan2022examining,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Examining the Big 5 personality Dataset with factor analysis},
  url = {https://taridwong.github.io/posts/2022-01-01-efacfa/},
  year = {2022}
}