Tarid Wongvorachan: Evaluating Synthetic Data with A Psychometric Approach

Tarid Wongvorachan

Introduction

Show code

library(synthpop)
library(tidyverse)
library(CTT)
library(mirt)

Hi Everyone. It’s been a while since my last blog post. I have some time while finishing up my semester to try out new research methods and document what I learned. As a researcher or a data scientist, I often have urges to try my hands on several datasets and, if possible, share my findings to the public.
However, handling sensitive datasets, particularly in psychological research, can pose challenges related to confidentiality and ethics. Psychological studies frequently involve surveys that ask participants about their personal experiences, thoughts, and feelings. These responses can include sensitive topics like trauma, substance abuse, or mental health issues. Researchers must handle this data with care to maintain participant confidentiality and trust.
Synthetic data offers a practical alternative, allowing researchers to analyze or teach using synthetic datasets while preserving the statistical properties of the original data. This blog demonstrates synthesizing a dataset of examinees’ responses to a 12-item psychological scale using R’s synthpop package and evaluating the psychometric properties of the synthetic data.

Preparing the Dataset

The dataset (data_org) contains real examinee responses to a 12-item scale measuring a psychological construct. Each question is responded using a seven-point Likert scale, with 7 means “definitely true” to 1 means “mostly false” to examinees’ agreement to statements presented to them.
Before synthesizing the data, missing data is removed using na.omit().

Show code

data_org <- read.csv("df.csv", header = TRUE)
data_org <- na.omit(data_org)

Generating Synthetic Data

The syn function from the synthpop package synthesizes data using the CART (Classification and Regression Trees) method. The documenttation of the package can be viewed here.
This step generates a synthetic dataset (data_synth) designed to mirror the statistical properties of data_org.

Show code

synth.obj <- syn(data_org, method = "cart", seed = 12345)

Show code

data_synth <- synth.obj$syn

Comparing Original and Synthetic Data

Visualizing Differences

The compare function visually compares the distributions of selected variables between the original and synthetic datasets:

Show code

mycols <- c("darkmagenta", "turquoise")
object <- compare(synth.obj, data_org, nrow = 3, ncol = 4, cols = mycols)
object$plots

Overall, the comparison between the synthetic and original datasets shows a high degree of similarity. The bar charts for each question (Q1 to Q12) indicate that the synthetic data closely mirrors the original data in terms of the distribution of responses.

Utility Assessment

The utility.gen function quantifies the similarity between datasets using utility measures.

Show code

util_gen <- utility.gen(synth.obj, data_org)

Ideally, we would want the propensity score standardized mean-squared error (S_pMSE) to be closer to 1.0, meaning that the synthetic data has perfect semblance to the original dataset. However, we can use other methods to assess quality of the synthetic data as well Raab & Nowok, 2022.

Show code

util_gen


Utility score calculated by method: cart

Call:
utility.gen.synds(object = synth.obj, data = data_org)

Null utilities simulated from a permutation test with 50 replications.

Selected utility measures
    pMSE   S_pMSE 
0.063602 1.691113

Since creating a perfect, 100% replica of a dataset is inherently challenging, I validated the synthetic data using conventional statistical analyses and psychometric evaluation methods. This approach ensures that the synthetic dataset maintains its integrity and produces results aligned with the dataset’s intended purpose as a measurement tool.

Statistical Testing

To assess whether the synthetic data aligns with the original dataset at the item level, I conducted independent sample t-tests for each of the 12 items in the scale. This statistical test evaluates whether there are significant differences in the mean responses between the original (data_org) and synthetic (data_synth) datasets.
First, I initialized a data frame, t_test_results, to store the outcomes of the t-tests. This data frame included columns for the item number (Item), the calculated t-value (T_Statistic), the corresponding p-value (P_Value), and a logical indicator (Significant) that flagged whether the difference was statistically significant at the 5% level (p < 0.05). This setup provided a structured way to systematically record and interpret the results.
Next, I used a loop to iterate through all 12 items in the dataset. For each item, the responses from the original and synthetic datasets were extracted. An independent sample t-test was performed to compare their means, and the results were stored in the t_test_results data frame. Specifically, the t-statistic and p-value were recorded, and the Significant column was updated to indicate whether the observed differences were statistically significant. This approach ensured a consistent and transparent validation process across all items.
Finally, the completed t_test_results data frame offered a comprehensive summary of the t-test results for all items. Items with significant differences (p < 0.05) were flagged, highlighting variables where the synthetic data deviated notably from the original. Conversely, non-significant results confirmed that the synthetic data closely replicated the original for those items. These findings provided valuable insights into the synthetic dataset’s accuracy, identifying potential areas for refinement and reinforcing its suitability for statistical and psychometric applications.

Show code

# Initialize an empty data frame to store the t-test results
t_test_results <- data.frame(
  Item = 1:12,
  T_Statistic = numeric(12),
  P_Value = numeric(12),
  Significant = logical(12)
)

# Perform independent sample t-test for each item
for (i in 1:12) {
  # Extract responses for the current item from both original and synthesized datasets
  org_responses <- data_org[, i]
  synth_responses <- data_synth[, i]
  
  # Perform the independent t-test
  t_test <- t.test(org_responses, synth_responses)
  
  # Store the results in the data frame
  t_test_results$T_Statistic[i] <- t_test$statistic
  t_test_results$P_Value[i] <- t_test$p.value
  
  # Check if the p-value is less than 0.05 (significant at the 5% level)
  t_test_results$Significant[i] <- t_test$p.value < 0.05
}

Show code

# View the t-test results
print(t_test_results)

   Item T_Statistic   P_Value Significant
1     1  1.24725294 0.2125793       FALSE
2     2  0.37685436 0.7063567       FALSE
3     3  0.03819008 0.9695433       FALSE
4     4  1.04233137 0.2974951       FALSE
5     5  0.75686676 0.4492967       FALSE
6     6  0.24844901 0.8038351       FALSE
7     7 -0.63765775 0.5238332       FALSE
8     8  0.08628668 0.9312547       FALSE
9     9  0.56532140 0.5719741       FALSE
10   10  0.74877731 0.4541563       FALSE
11   11  0.46332527 0.6432256       FALSE
12   12  0.65988433 0.5094704       FALSE

Visualizing T-Test Results

To visually interpret the results of the independent sample t-tests, I created a bar plot using the ggplot2 package. This visualization illustrates the t-statistic for each item, making it easier to identify which items showed significant differences between the original and synthetic datasets. The plot serves as a clear and intuitive representation of the t-test results, providing stakeholders with a quick overview of the dataset’s alignment.

Show code

ggplot(t_test_results, aes(x = factor(Item), y = T_Statistic, fill = Significant)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.7) +
  labs(title = "Independent Sample T-Test: Item-wise Comparison at p < .05", 
       x = "Item", y = "T-Statistic") +
  scale_fill_manual(values = c("TRUE" = "turquoise", "FALSE" = "darkmagenta")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_text(aes(label = ifelse(Significant, "*", "")), 
            vjust = -0.5, color = "black", size = 5) +
  annotate("text", x = 2, y = max(t_test_results$T_Statistic) + 2, 
           label = "Negative t-statistic: Mean of original < Synthesized\nPositive t-statistic: Mean of original > Synthesized", 
           size = 4, hjust = 0, color = "black", fontface = "italic")

The T-statistics for each item range from approximately -1 to 2. This indicates the degree of difference between the means of the original and synthetic datasets for each item.
None of the comparisons are significant, as indicated by the legend (all labeled as “FALSE”). This means that the differences between the original and synthetic datasets are not statistically significant at the p < .05 level.

Psychometric Properties Comparison

To evaluate and compare the psychometric properties of the original and synthetic datasets, I conducted item analyses using the itemAnalysis function from the CTT package. This step aimed to assess whether the synthetic dataset preserved key measurement characteristics, such as item means and discrimination indices, ensuring it aligns with the original dataset’s intended purpose as a measurement instrument.
For the original dataset (data_org), the itemAnalysis function was applied to generate a detailed report of psychometric properties. This analysis included calculations of item means, item-total correlations (discrimination indices), and flagged items based on pre-specified thresholds for performance issues. The same analysis was then conducted on the synthetic dataset (data_synth) using identical parameters.
By comparing the results from both datasets, I could evaluate whether the synthetic dataset accurately replicated the original dataset’s psychometric properties. Specifically, alignment in item means would suggest that the synthetic data preserves the central tendency of responses, while similarity in discrimination indices would indicate that the synthetic data retains the original dataset’s ability to differentiate between examinees effectively. These analyses are essential for validating the synthetic dataset’s utility for psychometric research and educational applications.

Show code

# Perform item analysis on original data
org_item_analysis <- itemAnalysis(data_org, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T,  bisFlag = T, flagStyle = c("X",""))

# Perform item analysis on synthesized data
synth_item_analysis <- itemAnalysis(data_synth, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T,  bisFlag = T, flagStyle = c("X",""))

Visualizing Item-Level Properties

To visually compare the psychometric properties of the original and synthetic datasets, I created side-by-side bar charts for two key metrics: item means and item discrimination indices. These metrics are critical for understanding the central tendency of item responses and the items’ ability to differentiate between respondents based on their overall performance.
First, I extracted the item means and discrimination indices from the results of the itemAnalysis for both datasets. This involved organizing the data into a data frame that paired each item with its corresponding mean and discrimination index values for both the original and synthetic datasets. This structure enabled a direct comparison of the two datasets’ properties.
To prepare the data for visualization, I reshaped the data frame into a “long” format using the pivot_longer function from the tidyverse package. For item means, I created a data frame where each item was associated with its mean score from both datasets, labeled as either “Original_Mean” or “Synthesized_Mean.” Similarly, for item discrimination indices, I created another data frame with labels “Original_Discrimination” and “Synthesized_Discrimination.” This long-format structure is ideal for creating grouped bar charts in ggplot2.
This step laid the groundwork for creating side-by-side bar charts that clearly depict how closely the synthetic data aligns with the original data. By focusing on item means and discrimination indices, these visualizations provide critical insights into whether the synthetic dataset faithfully preserves the original data’s psychometric characteristics, ensuring it remains a valid representation for educational or research purposes.

Show code

#Plot Side-by-Side Bar Charts for Item Means and Discrimination

# Extract item mean and discrimination for original and synthetic data
item_comparison <- data.frame(
  Item = seq_along(org_item_analysis$itemReport$itemMean),
  Original_Mean = org_item_analysis$itemReport$itemMean,
  Synthesized_Mean = synth_item_analysis$itemReport$itemMean,
  Original_Discrimination = org_item_analysis$itemReport$pBis,
  Synthesized_Discrimination = synth_item_analysis$itemReport$pBis
)

# Reshape the data for ggplot
item_means_long <- item_comparison %>%
  select(Item, Original_Mean, Synthesized_Mean) %>%
  tidyr::pivot_longer(cols = c(Original_Mean, Synthesized_Mean), names_to = "Dataset", values_to = "Mean")

item_discrimination_long <- item_comparison %>%
  select(Item, Original_Discrimination, Synthesized_Discrimination) %>%
  tidyr::pivot_longer(cols = c(Original_Discrimination, Synthesized_Discrimination), names_to = "Dataset", values_to = "Discrimination")

To visualize the comparison of item means between the original and synthetic datasets, I used a grouped bar chart created with the ggplot2 package. This chart highlights how closely the synthetic dataset replicates the central tendency of responses in the original dataset.

Show code

# Plot item means
ggplot(item_means_long, aes(x = as.factor(Item), y = Mean, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Item Means Comparison", x = "Item", y = "Mean Score") +
  theme_minimal() +
  scale_fill_manual(values = c("Original_Mean" = "darkmagenta", "Synthesized_Mean" = "turquoise")) + 
  theme(plot.title = element_text(hjust = 0.5))

For most items, the mean scores of the original and synthesized datasets are quite similar. This indicates that the synthetic data closely replicates the central tendencies of the original data.
To assess the ability of items to differentiate between respondents based on their trait levels, I created a grouped bar chart to compare item discrimination (point biserial correlation - pBis) indices from the original and synthetic datasets. A solid red horizontal line is added at a discrimination index of 0.2 using geom_hline(yintercept = 0.2, color = "red", linetype = "solid"). This line marks a commonly accepted threshold for minimal acceptable discrimination, offering a benchmark against which the indices can be evaluated.

Show code

# Plot item discrimination
ggplot(item_discrimination_long, aes(x = as.factor(Item), y = Discrimination, fill = Dataset)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Item Discrimination Comparison \n The red line represents minimal acceptable value", x = "Item", y = "Discrimination Index") +
  theme_minimal() +
  scale_fill_manual(values = c("Original_Discrimination" = "darkmagenta", "Synthesized_Discrimination" = "turquoise")) + 
  geom_hline(yintercept = 0.2, color = "red", linetype = "solid") + 
  theme(plot.title = element_text(hjust = 0.5))

For most items, the discrimination indices of the original and synthesized datasets are quite similar, with both datasets meeting or exceeding the minimal acceptable value of 0.2 for most items. The chart demonstrates that the synthetic dataset closely mirrors the original dataset in terms of item discrimination, maintaining the ability to differentiate between high and low performers. This suggests that the synthetic data retains the psychometric properties of the original data, making it a somewhat reliable substitute for analysis purposes.

Reliability Analysis

To compare the reliability of the original and synthetic datasets, I constructed a bar chart that visualizes the coefficient alpha values for each dataset. Coefficient alpha is a measure of internal consistency, indicating the extent to which items in a scale measure the same construct. This comparison helps assess whether the synthetic data faithfully reproduces the reliability of the original data.
The input data for this visualization is encapsulated in the alpha_comparison data frame, which contains two rows: one for the original dataset and one for the synthesized dataset. Each row includes the respective alpha value, calculated during the item analysis process.

Show code

# Reliability Comparison Data Frame
alpha_comparison <- data.frame(
  Dataset = c("Original", "Synthesized"),
  Alpha = c(org_item_analysis$alpha, synth_item_analysis$alpha)
)

# Bar Plot with Values Displayed Above Bars
ggplot(alpha_comparison, aes(x = Dataset, y = Alpha, fill = Dataset)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = round(Alpha, 3)), vjust = -0.3, color = "black", size = 5) + # Display values above bars
  labs(title = "Reliability (Coefficient Alpha) Comparison",
       y = "Alpha Reliability Coefficient") +
  theme_minimal() +
  scale_fill_manual(values = c("Original" = "darkmagenta", "Synthesized" = "turquoise")) +
  theme(plot.title = element_text(hjust = 0.5)) # Center the title

The chart indicates that the synthetic dataset maintains a somewhat high level of internal consistency (higher than .70), comparable to the original dataset. This suggests that the synthetic data is a reliable representation of the original data in terms of internal consistency.

Item Response Theory Test Information Function

To evaluate the psychometric properties of the original and synthesized datasets under the framework of Item Response Theory (IRT), I implemented Graded Response Models (GRM) using the mirt package in R. GRMs are particularly well-suited for analyzing polytomous item responses, as they estimate parameters that describe the relationship between latent traits (e.g., ability) and the probability of item responses across multiple categories.
The GRM was fitted separately to the original and synthesized datasets using the mirt function. The syntax specifies model = 1, indicating a unidimensional model, and itemtype = "graded", which defines the GRM as the chosen IRT model. The outputs are stored in model_org and model_synth for the original and synthesized data, respectively.

Show code

# Fit a Graded Response Model (GRM) for original data
model_org <- mirt(data_org, model = 1, itemtype = "graded")

# Fit a Graded Response Model (GRM) for synthesized data
model_synth <- mirt(data_synth, model = 1, itemtype = "graded")

Estimated IRT Model of the Original Dataset

After fitting the models, I examined their structures and key parameters. The object model_org provides an overview of the original dataset’s IRT model, including the number of dimensions and items. A detailed summary using summary(model_org) displays item parameters such as discrimination and difficulty thresholds.

Show code

model_org


Call:
mirt(data = data_org, model = 1, itemtype = "graded")

Full-information item factor analysis with 1 factor(s).
Converged within 1e-04 tolerance after 31 EM iterations.
mirt version: 1.43 
M-step optimizer: BFGS 
EM acceleration: Ramsay 
Number of rectangular quadrature: 61
Latent density type: Gaussian 

Log-likelihood = -9210.903
Estimated parameters: 84 
AIC = 18589.81
BIC = 18949.52; SABIC = 18682.87
G2 (1e+10) = 11699.78, p = 1
RMSEA = 0, CFI = NaN, TLI = NaN

Show code

summary(model_org)

       F1     h2
Q1  0.726 0.5269
Q2  0.673 0.4536
Q3  0.178 0.0318
Q4  0.756 0.5723
Q5  0.250 0.0627
Q6  0.785 0.6155
Q7  0.133 0.0177
Q8  0.741 0.5497
Q9  0.743 0.5513
Q10 0.706 0.4990
Q11 0.224 0.0500
Q12 0.654 0.4283

SS loadings:  4.359 
Proportion Var:  0.363 

Factor correlations: 

   F1
F1  1

EStimated IRT Model of the Synthetic Dataset

Similarly, model_synth and its summary provide corresponding information for the synthesized dataset.

Show code

model_synth


Call:
mirt(data = data_synth, model = 1, itemtype = "graded")

Full-information item factor analysis with 1 factor(s).
Converged within 1e-04 tolerance after 22 EM iterations.
mirt version: 1.43 
M-step optimizer: BFGS 
EM acceleration: Ramsay 
Number of rectangular quadrature: 61
Latent density type: Gaussian 

Log-likelihood = -9220.621
Estimated parameters: 83 
AIC = 18607.24
BIC = 18962.67; SABIC = 18699.2
G2 (1e+10) = 11721.99, p = 1
RMSEA = 0, CFI = NaN, TLI = NaN

Show code

summary(model_synth)

       F1     h2
Q1  0.751 0.5643
Q2  0.705 0.4973
Q3  0.171 0.0294
Q4  0.811 0.6577
Q5  0.342 0.1169
Q6  0.789 0.6224
Q7  0.153 0.0233
Q8  0.732 0.5361
Q9  0.759 0.5758
Q10 0.716 0.5126
Q11 0.141 0.0199
Q12 0.632 0.3999

SS loadings:  4.556 
Proportion Var:  0.38 

Factor correlations: 

   F1
F1  1

Test Information Function Comparison

To further compare the psychometric properties of the original and synthesized datasets, I evaluated the Test Information Function (TIF), a crucial aspect in IRT models. The TIF indicates the amount of information provided by the test at various levels of the latent trait (theta), which reflects the precision with which a test can measure a person’s ability at different points on the ability scale.
Comparing the TIFs for the original and synthesized datasets allows us to assess if the synthetic data preserves the measurement capabilities of the original test, ensuring that the synthesized items are psychometrically valid across the latent trait continuum.
By plotting and comparing the TIFs, we can also identify regions where the synthesized data may overestimate or underestimate test information, providing valuable insights into its alignment with the original dataset’s performance.

Show code

# Define a range of theta values (latent trait values) to calculate test information
theta_values <- seq(-3, 3, by = 0.1)  # Adjust the range and step size as needed

# Extract the Test Information Function (TIF) for original data
TIF_org <- testinfo(model_org, Theta = theta_values)

# Extract the Test Information Function (TIF) for synthesized data
TIF_synth <- testinfo(model_synth, Theta = theta_values)


# Plot Test Information Function for original and synthesized data
plot(theta_values, TIF_org, type = "l", col = "darkmagenta", lwd = 2, 
     xlim = c(-3, 3), ylim = c(0, max(TIF_org, TIF_synth)), 
     xlab = "Theta", ylab = "Test Information", 
     main = "Test Information Function (TIF) Comparison")

# Add the TIF for synthesized data to the plot
lines(theta_values, TIF_synth, col = "turquoise", lwd = 2)

# Add a legend
legend("topright", legend = c("Original Data", "Synthesized Data"), 
       col = c("darkmagenta", "turquoise"), lwd = 2)

The purple line represents the original data, while the cyan line represents the synthesized data. The graph shows that the synthesized data generally provides higher test information than the original data, especially in the theta range from -3 to 2. This means that both dataset provide comparable measurement properties in examinees whose latent trait at high level (i.e., > 2).
However, given that the distribution of item responses in two datasets are not statistically different as indicated by t-test results, we could infer that the different in test information between the two datasets are also somewhat comparable.

Concluding Remark and Practical Implications

In conclusion, this analysis demonstrates how synthetic data, when generated thoughtfully and validated against original data, can serve as a valuable alternative for research purposes. By comparing metrics such as item means, discrimination, reliability, and the test information function (TIF), we were able to ensure that the synthesized dataset retains critical psychometric properties of the original dataset, providing robust support for further analysis.
For practical implications, the ability to generate synthetic data that closely mirrors the statistical properties of real-world data is helpful. This can be particularly beneficial in situations where access to original datasets is limited due to privacy concerns or logistical constraints. For instance, when working with sensitive data, synthetic datasets can serve as a useful substitute for testing algorithms, conducting simulations, or performing validation studies, all without compromising individual privacy.
Moreover, understanding how synthetic data behaves compared to its original counterpart allows researchers and data scientists to confidently use synthesized data for a variety of applications, such as model development, training machine learning algorithms, or performing sensitivity analyses. The ability to validate the quality of synthetic data ensures that any decisions or inferences drawn from its use are based on reliable, comparable data.
From a methodological perspective, this analysis also provides a valuable framework for evaluating synthetic data across different contexts. By applying various psychometric methods like item analysis and IRT, or using traditional means like t-test, researchers can identify areas of strength and potential improvements in synthetic data, tailoring their data generation processes to meet specific research needs.
- Suppose a researcher is studying a new psychological assessment tool. By conducting item analysis on both the original and synthesized datasets, they could determine whether the synthesized data maintains the same level of item difficulty and discrimination as the original data.
- If a researcher is using synthetic data to simulate different groups (e.g., a treatment group vs. a control group), performing t-tests can help ensure that the mean differences between groups in the synthetic dataset align with those in the original data.
However, it’s essential to remember that synthetic data is not a one-size-fits-all solution. Its effectiveness depends on the quality of the generative models and the methods used to validate the data. It’s always crucial to perform rigorous checks, as we did in this example, to ensure that the synthetic data aligns closely with the original dataset’s characteristics. If the synthetic data does not match with the original data, researchers could adjust the data generation model to better reflect the variability of responses in real-world data.
In conclusion, the thorough validation of synthesized datasets against real-world data offers reassurance that synthetic data can serve as a trustworthy tool in many research and practical applications. This not only enhances the credibility of synthetic data as an alternative to original datasets but also supports its increasing adoption across disciplines where data privacy, availability, or ethical concerns may otherwise limit the use of actual datasets.
Thank you for following along—hopefully, this post has given you some ideas about how you might use synthetic data in your own work!

Comment on this article Share:

Evaluating Synthetic Data with A Psychometric Approach

Introduction

Preparing the Dataset

Generating Synthetic Data

Comparing Original and Synthetic Data

Visualizing Differences

Utility Assessment

Statistical Testing

Visualizing T-Test Results

Psychometric Properties Comparison

Visualizing Item-Level Properties

Reliability Analysis

Item Response Theory Test Information Function

Estimated IRT Model of the Original Dataset

EStimated IRT Model of the Synthetic Dataset

Test Information Function Comparison

Concluding Remark and Practical Implications

Reuse

Citation