This blog post explores the use of psychometric methods—such as item analysis, independent t-tests, and item response theory (IRT)—to evaluate the quality of synthetic survey data.
(13 min read)
Hi Everyone. It’s been a while since my last blog post. I have some time while finishing up my semester to try out new research methods and document what I learned. As a researcher or a data scientist, I often have urges to try my hands on several datasets and, if possible, share my findings to the public.
However, handling sensitive datasets, particularly in psychological research, can pose challenges related to confidentiality and ethics. Psychological studies frequently involve surveys that ask participants about their personal experiences, thoughts, and feelings. These responses can include sensitive topics like trauma, substance abuse, or mental health issues. Researchers must handle this data with care to maintain participant confidentiality and trust.
Synthetic data offers a practical alternative, allowing researchers to analyze or teach using synthetic datasets while preserving the statistical properties of the original data. This blog demonstrates synthesizing a dataset of examinees’ responses to a 12-item psychological scale using R’s synthpop package and evaluating the psychometric properties of the synthetic data.
The dataset (data_org
) contains real examinee responses to a 12-item scale measuring a psychological construct. Each question is responded using a seven-point Likert scale, with 7 means “definitely true” to 1 means “mostly false” to examinees’ agreement to statements presented to them.
Before synthesizing the data, missing data is removed using na.omit()
.
The syn function from the synthpop
package synthesizes data using the CART (Classification and Regression Trees) method. The documenttation of the package can be viewed here.
This step generates a synthetic dataset (data_synth
) designed to mirror the statistical properties of data_org
.
synth.obj <- syn(data_org, method = "cart", seed = 12345)
data_synth <- synth.obj$syn
compare
function visually compares the distributions of selected variables between the original and synthetic datasets:utility.gen
function quantifies the similarity between datasets using utility measures.util_gen <- utility.gen(synth.obj, data_org)
util_gen
Utility score calculated by method: cart
Call:
utility.gen.synds(object = synth.obj, data = data_org)
Null utilities simulated from a permutation test with 50 replications.
Selected utility measures
pMSE S_pMSE
0.063602 1.691113
To assess whether the synthetic data aligns with the original dataset at the item level, I conducted independent sample t-tests for each of the 12 items in the scale. This statistical test evaluates whether there are significant differences in the mean responses between the original (data_org
) and synthetic (data_synth
) datasets.
First, I initialized a data frame, t_test_results
, to store the outcomes of the t-tests. This data frame included columns for the item number (Item
), the calculated t-value (T_Statistic
), the corresponding p-value (P_Value
), and a logical indicator (Significant
) that flagged whether the difference was statistically significant at the 5% level (p < 0.05). This setup provided a structured way to systematically record and interpret the results.
Next, I used a loop to iterate through all 12 items in the dataset. For each item, the responses from the original and synthetic datasets were extracted. An independent sample t-test was performed to compare their means, and the results were stored in the t_test_results
data frame. Specifically, the t-statistic and p-value were recorded, and the Significant
column was updated to indicate whether the observed differences were statistically significant. This approach ensured a consistent and transparent validation process across all items.
Finally, the completed t_test_results
data frame offered a comprehensive summary of the t-test results for all items. Items with significant differences (p < 0.05) were flagged, highlighting variables where the synthetic data deviated notably from the original. Conversely, non-significant results confirmed that the synthetic data closely replicated the original for those items. These findings provided valuable insights into the synthetic dataset’s accuracy, identifying potential areas for refinement and reinforcing its suitability for statistical and psychometric applications.
# Initialize an empty data frame to store the t-test results
t_test_results <- data.frame(
Item = 1:12,
T_Statistic = numeric(12),
P_Value = numeric(12),
Significant = logical(12)
)
# Perform independent sample t-test for each item
for (i in 1:12) {
# Extract responses for the current item from both original and synthesized datasets
org_responses <- data_org[, i]
synth_responses <- data_synth[, i]
# Perform the independent t-test
t_test <- t.test(org_responses, synth_responses)
# Store the results in the data frame
t_test_results$T_Statistic[i] <- t_test$statistic
t_test_results$P_Value[i] <- t_test$p.value
# Check if the p-value is less than 0.05 (significant at the 5% level)
t_test_results$Significant[i] <- t_test$p.value < 0.05
}
# View the t-test results
print(t_test_results)
Item T_Statistic P_Value Significant
1 1 1.24725294 0.2125793 FALSE
2 2 0.37685436 0.7063567 FALSE
3 3 0.03819008 0.9695433 FALSE
4 4 1.04233137 0.2974951 FALSE
5 5 0.75686676 0.4492967 FALSE
6 6 0.24844901 0.8038351 FALSE
7 7 -0.63765775 0.5238332 FALSE
8 8 0.08628668 0.9312547 FALSE
9 9 0.56532140 0.5719741 FALSE
10 10 0.74877731 0.4541563 FALSE
11 11 0.46332527 0.6432256 FALSE
12 12 0.65988433 0.5094704 FALSE
ggplot2
package. This visualization illustrates the t-statistic for each item, making it easier to identify which items showed significant differences between the original and synthetic datasets. The plot serves as a clear and intuitive representation of the t-test results, providing stakeholders with a quick overview of the dataset’s alignment.ggplot(t_test_results, aes(x = factor(Item), y = T_Statistic, fill = Significant)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7) +
labs(title = "Independent Sample T-Test: Item-wise Comparison at p < .05",
x = "Item", y = "T-Statistic") +
scale_fill_manual(values = c("TRUE" = "turquoise", "FALSE" = "darkmagenta")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
geom_text(aes(label = ifelse(Significant, "*", "")),
vjust = -0.5, color = "black", size = 5) +
annotate("text", x = 2, y = max(t_test_results$T_Statistic) + 2,
label = "Negative t-statistic: Mean of original < Synthesized\nPositive t-statistic: Mean of original > Synthesized",
size = 4, hjust = 0, color = "black", fontface = "italic")
The T-statistics for each item range from approximately -1 to 2. This indicates the degree of difference between the means of the original and synthetic datasets for each item.
None of the comparisons are significant, as indicated by the legend (all labeled as “FALSE”). This means that the differences between the original and synthetic datasets are not statistically significant at the p < .05 level.
To evaluate and compare the psychometric properties of the original and synthetic datasets, I conducted item analyses using the itemAnalysis
function from the CTT
package. This step aimed to assess whether the synthetic dataset preserved key measurement characteristics, such as item means and discrimination indices, ensuring it aligns with the original dataset’s intended purpose as a measurement instrument.
For the original dataset (data_org
), the itemAnalysis
function was applied to generate a detailed report of psychometric properties. This analysis included calculations of item means, item-total correlations (discrimination indices), and flagged items based on pre-specified thresholds for performance issues. The same analysis was then conducted on the synthetic dataset (data_synth
) using identical parameters.
By comparing the results from both datasets, I could evaluate whether the synthetic dataset accurately replicated the original dataset’s psychometric properties. Specifically, alignment in item means would suggest that the synthetic data preserves the central tendency of responses, while similarity in discrimination indices would indicate that the synthetic data retains the original dataset’s ability to differentiate between examinees effectively. These analyses are essential for validating the synthetic dataset’s utility for psychometric research and educational applications.
# Perform item analysis on original data
org_item_analysis <- itemAnalysis(data_org, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T, bisFlag = T, flagStyle = c("X",""))
# Perform item analysis on synthesized data
synth_item_analysis <- itemAnalysis(data_synth, itemReport=TRUE, NA.Delete=TRUE, pBisFlag = T, bisFlag = T, flagStyle = c("X",""))
To visually compare the psychometric properties of the original and synthetic datasets, I created side-by-side bar charts for two key metrics: item means and item discrimination indices. These metrics are critical for understanding the central tendency of item responses and the items’ ability to differentiate between respondents based on their overall performance.
First, I extracted the item means and discrimination indices from the results of the itemAnalysis
for both datasets. This involved organizing the data into a data frame that paired each item with its corresponding mean and discrimination index values for both the original and synthetic datasets. This structure enabled a direct comparison of the two datasets’ properties.
To prepare the data for visualization, I reshaped the data frame into a “long” format using the pivot_longer
function from the tidyverse
package. For item means, I created a data frame where each item was associated with its mean score from both datasets, labeled as either “Original_Mean” or “Synthesized_Mean.” Similarly, for item discrimination indices, I created another data frame with labels “Original_Discrimination” and “Synthesized_Discrimination.” This long-format structure is ideal for creating grouped bar charts in ggplot2
.
This step laid the groundwork for creating side-by-side bar charts that clearly depict how closely the synthetic data aligns with the original data. By focusing on item means and discrimination indices, these visualizations provide critical insights into whether the synthetic dataset faithfully preserves the original data’s psychometric characteristics, ensuring it remains a valid representation for educational or research purposes.
#Plot Side-by-Side Bar Charts for Item Means and Discrimination
# Extract item mean and discrimination for original and synthetic data
item_comparison <- data.frame(
Item = seq_along(org_item_analysis$itemReport$itemMean),
Original_Mean = org_item_analysis$itemReport$itemMean,
Synthesized_Mean = synth_item_analysis$itemReport$itemMean,
Original_Discrimination = org_item_analysis$itemReport$pBis,
Synthesized_Discrimination = synth_item_analysis$itemReport$pBis
)
# Reshape the data for ggplot
item_means_long <- item_comparison %>%
select(Item, Original_Mean, Synthesized_Mean) %>%
tidyr::pivot_longer(cols = c(Original_Mean, Synthesized_Mean), names_to = "Dataset", values_to = "Mean")
item_discrimination_long <- item_comparison %>%
select(Item, Original_Discrimination, Synthesized_Discrimination) %>%
tidyr::pivot_longer(cols = c(Original_Discrimination, Synthesized_Discrimination), names_to = "Dataset", values_to = "Discrimination")
# Plot item means
ggplot(item_means_long, aes(x = as.factor(Item), y = Mean, fill = Dataset)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Item Means Comparison", x = "Item", y = "Mean Score") +
theme_minimal() +
scale_fill_manual(values = c("Original_Mean" = "darkmagenta", "Synthesized_Mean" = "turquoise")) +
theme(plot.title = element_text(hjust = 0.5))
For most items, the mean scores of the original and synthesized datasets are quite similar. This indicates that the synthetic data closely replicates the central tendencies of the original data.
To assess the ability of items to differentiate between respondents based on their trait levels, I created a grouped bar chart to compare item discrimination (point biserial correlation - pBis) indices from the original and synthetic datasets. A solid red horizontal line is added at a discrimination index of 0.2 using geom_hline(yintercept = 0.2, color = "red", linetype = "solid")
. This line marks a commonly accepted threshold for minimal acceptable discrimination, offering a benchmark against which the indices can be evaluated.
# Plot item discrimination
ggplot(item_discrimination_long, aes(x = as.factor(Item), y = Discrimination, fill = Dataset)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Item Discrimination Comparison \n The red line represents minimal acceptable value", x = "Item", y = "Discrimination Index") +
theme_minimal() +
scale_fill_manual(values = c("Original_Discrimination" = "darkmagenta", "Synthesized_Discrimination" = "turquoise")) +
geom_hline(yintercept = 0.2, color = "red", linetype = "solid") +
theme(plot.title = element_text(hjust = 0.5))
To compare the reliability of the original and synthetic datasets, I constructed a bar chart that visualizes the coefficient alpha values for each dataset. Coefficient alpha is a measure of internal consistency, indicating the extent to which items in a scale measure the same construct. This comparison helps assess whether the synthetic data faithfully reproduces the reliability of the original data.
The input data for this visualization is encapsulated in the alpha_comparison data frame, which contains two rows: one for the original dataset and one for the synthesized dataset. Each row includes the respective alpha value, calculated during the item analysis process.
# Reliability Comparison Data Frame
alpha_comparison <- data.frame(
Dataset = c("Original", "Synthesized"),
Alpha = c(org_item_analysis$alpha, synth_item_analysis$alpha)
)
# Bar Plot with Values Displayed Above Bars
ggplot(alpha_comparison, aes(x = Dataset, y = Alpha, fill = Dataset)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = round(Alpha, 3)), vjust = -0.3, color = "black", size = 5) + # Display values above bars
labs(title = "Reliability (Coefficient Alpha) Comparison",
y = "Alpha Reliability Coefficient") +
theme_minimal() +
scale_fill_manual(values = c("Original" = "darkmagenta", "Synthesized" = "turquoise")) +
theme(plot.title = element_text(hjust = 0.5)) # Center the title
To evaluate the psychometric properties of the original and synthesized datasets under the framework of Item Response Theory (IRT), I implemented Graded Response Models (GRM) using the mirt
package in R. GRMs are particularly well-suited for analyzing polytomous item responses, as they estimate parameters that describe the relationship between latent traits (e.g., ability) and the probability of item responses across multiple categories.
The GRM was fitted separately to the original and synthesized datasets using the mirt
function. The syntax specifies model = 1
, indicating a unidimensional model, and itemtype = "graded"
, which defines the GRM as the chosen IRT model. The outputs are stored in model_org
and model_synth
for the original and synthesized data, respectively.
model_org
provides an overview of the original dataset’s IRT model, including the number of dimensions and items. A detailed summary using summary(model_org)
displays item parameters such as discrimination and difficulty thresholds.model_org
Call:
mirt(data = data_org, model = 1, itemtype = "graded")
Full-information item factor analysis with 1 factor(s).
Converged within 1e-04 tolerance after 31 EM iterations.
mirt version: 1.43
M-step optimizer: BFGS
EM acceleration: Ramsay
Number of rectangular quadrature: 61
Latent density type: Gaussian
Log-likelihood = -9210.903
Estimated parameters: 84
AIC = 18589.81
BIC = 18949.52; SABIC = 18682.87
G2 (1e+10) = 11699.78, p = 1
RMSEA = 0, CFI = NaN, TLI = NaN
summary(model_org)
F1 h2
Q1 0.726 0.5269
Q2 0.673 0.4536
Q3 0.178 0.0318
Q4 0.756 0.5723
Q5 0.250 0.0627
Q6 0.785 0.6155
Q7 0.133 0.0177
Q8 0.741 0.5497
Q9 0.743 0.5513
Q10 0.706 0.4990
Q11 0.224 0.0500
Q12 0.654 0.4283
SS loadings: 4.359
Proportion Var: 0.363
Factor correlations:
F1
F1 1
model_synth
and its summary provide corresponding information for the synthesized dataset.model_synth
Call:
mirt(data = data_synth, model = 1, itemtype = "graded")
Full-information item factor analysis with 1 factor(s).
Converged within 1e-04 tolerance after 22 EM iterations.
mirt version: 1.43
M-step optimizer: BFGS
EM acceleration: Ramsay
Number of rectangular quadrature: 61
Latent density type: Gaussian
Log-likelihood = -9220.621
Estimated parameters: 83
AIC = 18607.24
BIC = 18962.67; SABIC = 18699.2
G2 (1e+10) = 11721.99, p = 1
RMSEA = 0, CFI = NaN, TLI = NaN
summary(model_synth)
F1 h2
Q1 0.751 0.5643
Q2 0.705 0.4973
Q3 0.171 0.0294
Q4 0.811 0.6577
Q5 0.342 0.1169
Q6 0.789 0.6224
Q7 0.153 0.0233
Q8 0.732 0.5361
Q9 0.759 0.5758
Q10 0.716 0.5126
Q11 0.141 0.0199
Q12 0.632 0.3999
SS loadings: 4.556
Proportion Var: 0.38
Factor correlations:
F1
F1 1
To further compare the psychometric properties of the original and synthesized datasets, I evaluated the Test Information Function (TIF), a crucial aspect in IRT models. The TIF indicates the amount of information provided by the test at various levels of the latent trait (theta), which reflects the precision with which a test can measure a person’s ability at different points on the ability scale.
Comparing the TIFs for the original and synthesized datasets allows us to assess if the synthetic data preserves the measurement capabilities of the original test, ensuring that the synthesized items are psychometrically valid across the latent trait continuum.
By plotting and comparing the TIFs, we can also identify regions where the synthesized data may overestimate or underestimate test information, providing valuable insights into its alignment with the original dataset’s performance.
# Define a range of theta values (latent trait values) to calculate test information
theta_values <- seq(-3, 3, by = 0.1) # Adjust the range and step size as needed
# Extract the Test Information Function (TIF) for original data
TIF_org <- testinfo(model_org, Theta = theta_values)
# Extract the Test Information Function (TIF) for synthesized data
TIF_synth <- testinfo(model_synth, Theta = theta_values)
# Plot Test Information Function for original and synthesized data
plot(theta_values, TIF_org, type = "l", col = "darkmagenta", lwd = 2,
xlim = c(-3, 3), ylim = c(0, max(TIF_org, TIF_synth)),
xlab = "Theta", ylab = "Test Information",
main = "Test Information Function (TIF) Comparison")
# Add the TIF for synthesized data to the plot
lines(theta_values, TIF_synth, col = "turquoise", lwd = 2)
# Add a legend
legend("topright", legend = c("Original Data", "Synthesized Data"),
col = c("darkmagenta", "turquoise"), lwd = 2)
The purple line represents the original data, while the cyan line represents the synthesized data. The graph shows that the synthesized data generally provides higher test information than the original data, especially in the theta range from -3 to 2. This means that both dataset provide comparable measurement properties in examinees whose latent trait at high level (i.e., > 2).
However, given that the distribution of item responses in two datasets are not statistically different as indicated by t-test results, we could infer that the different in test information between the two datasets are also somewhat comparable.
In conclusion, this analysis demonstrates how synthetic data, when generated thoughtfully and validated against original data, can serve as a valuable alternative for research purposes. By comparing metrics such as item means, discrimination, reliability, and the test information function (TIF), we were able to ensure that the synthesized dataset retains critical psychometric properties of the original dataset, providing robust support for further analysis.
For practical implications, the ability to generate synthetic data that closely mirrors the statistical properties of real-world data is helpful. This can be particularly beneficial in situations where access to original datasets is limited due to privacy concerns or logistical constraints. For instance, when working with sensitive data, synthetic datasets can serve as a useful substitute for testing algorithms, conducting simulations, or performing validation studies, all without compromising individual privacy.
Moreover, understanding how synthetic data behaves compared to its original counterpart allows researchers and data scientists to confidently use synthesized data for a variety of applications, such as model development, training machine learning algorithms, or performing sensitivity analyses. The ability to validate the quality of synthetic data ensures that any decisions or inferences drawn from its use are based on reliable, comparable data.
From a methodological perspective, this analysis also provides a valuable framework for evaluating synthetic data across different contexts. By applying various psychometric methods like item analysis and IRT, or using traditional means like t-test, researchers can identify areas of strength and potential improvements in synthetic data, tailoring their data generation processes to meet specific research needs.
However, it’s essential to remember that synthetic data is not a one-size-fits-all solution. Its effectiveness depends on the quality of the generative models and the methods used to validate the data. It’s always crucial to perform rigorous checks, as we did in this example, to ensure that the synthetic data aligns closely with the original dataset’s characteristics. If the synthetic data does not match with the original data, researchers could adjust the data generation model to better reflect the variability of responses in real-world data.
In conclusion, the thorough validation of synthesized datasets against real-world data offers reassurance that synthetic data can serve as a trustworthy tool in many research and practical applications. This not only enhances the credibility of synthetic data as an alternative to original datasets but also supports its increasing adoption across disciplines where data privacy, availability, or ethical concerns may otherwise limit the use of actual datasets.
Thank you for following along—hopefully, this post has given you some ideas about how you might use synthetic data in your own work!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2024, Nov. 17). Tarid Wongvorachan: Evaluating Synthetic Data with A Psychometric Approach. Retrieved from https://taridwong.github.io/posts/2024-11-15-synth/
BibTeX citation
@misc{wongvorachan2024evaluating, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Evaluating Synthetic Data with A Psychometric Approach}, url = {https://taridwong.github.io/posts/2024-11-15-synth/}, year = {2024} }