In this post, I will be exploring factors that influence postdoctoral researchers’ perspective to career in scientific research with predictive modeling and natural language processing (NLP).
(16 min read)
Hi Everyone. I have got some time during the winter break to write another post. I came across Nature’s postdoc 2023 survey, which is a series of articles published by Nature to examine the situation of postdoctoral researchers such as their post-pandemic well being, their use of artificial intelligence chatbot such as ChatGPT, and their career outlook. The general impression I got is that postdoc researchers are pressured to secure fundings and stable jobs, despite their high educational level. Some of them event went on strike.
Aside from postdoc, I also came across some articles discussing the prospect of early career researchers who earned their Ph.D and got into a faculty position, 'Academics without publications are just like imperial concubines without sons': the 'new times' of Chinese higher education. The paper highlights how the publish or perish culture drives researchers to produce as much research works as possible to earn job security as a tenure professor; This status quo reflects the state of perpetual competition (dubbed as “academic hunger games”) that threatens mental health and work-life balance or early career researchers.
As discussed in van Dalen (2020)‘s paper titled “How the publish-or-perish principle divides a science: the case of economists”, while the publish or perish culture promotes academics’ rank and position, its serious drawbacks include the excessive number of papers that are hardly read, makes researchers turn their back on national issues as they focus on publishing, and increases the probability of unethical behavior like fraud or data manipulation.
It is interesting how Nature made available their dataset from their 2023 postdoc survey. Taking what I learned from Nature’s postdoc 2023 survey and articles around the publish or perish culture, I think it could be interesting to explore factors that influence postdoctoral researchers’ perspective to career in scientific research with predictive modeling and natural language processing. The data set as 89 variables, including both numerical and textual data. I divided the dataset into a numerical dataset for predictive machine learning model, and a textual dataset for the follow-up natural language processing analysis. The data was cleaned in R, but I will be using Python in this post.
import pandas as pd
import numpy as np
import sweetviz
import seaborn as sns
import matplotlib.pyplot as plt
#Read the data set
= pd.read_csv("postdoc_processed.csv", encoding = "ISO-8859-1") #For prediction task
df
= df[['HOWEMPLOYED', 'EMPLOYSTATUS', 'WHYMOVE', 'WHYEXPCT',
df_text 'WHYINCREASE', 'WHYDECREASE', 'WHYSECONDJOB', 'DISCRIMCAT',
'DISCRIMPER', 'DISCRIMOBSCAT', 'CHALLENGE', 'LACKEDSKILLS',
'REFLECT', 'CONTINENT', 'COUNTRY', 'AGE',
'GENDER', 'RACE']]
= df_text.dropna(subset=['REFLECT']) #For NLP df_reflect
#Read the data set
= pd.read_csv("postdoc_prediction_cleaned.csv", encoding = "ISO-8859-1")
df
# Use the analyze function to perform EDA
= sweetviz.analyze(df, target_feat = "REGRET")
analyze_df
# Render and show the results of EDA
"analyze.html") analyze_df.show_html(
sweetviz
package in Python to generate a report on exploratory data analysis as seen above. The targeted variable is named REGRET
, which is respondents’ answer to a question of “Would you recommend to your younger self that you pursue a career in scientific research?”. The answer are either 1 (Yes) or 0 (No). As seen from the data exploration report, both yes and no have similar proportion.RFECV
with a default Random forest classification model, with minimum feature number == 20, and using “roc_auc” as the performance metric.from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
= df.drop('REGRET', axis=1)
X = df['REGRET']
y
# Perform train-test split
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Initialize the RandomForestClassifier
= RandomForestClassifier(random_state=123)
rf_classifier
# Initialize the RFECV
= RFECV(estimator=rf_classifier,
rfecv_model =1, cv=5,
step='roc_auc',
scoring=20)
min_features_to_select
= rfecv_model.fit(X_train, y_train)
rfecv
print('Optimal number of features :', rfecv.n_features_)
print('Best features by RFECV with random forest:', X_train.columns[rfecv.support_])
= X_train[['POSTDOCNATIVE', 'POSTDOCMOVE', 'FIRSTPOSTDOC', 'PREVPOSTDOC',
X_train_reduced 'POSTDOCEXPCT', 'CAREERSECTOR', 'SALARY', 'SALARYCHANGE', 'PAIDVACAY',
'PAIDSICK', 'INSURANCE', 'PENSION', 'PARENTLEAVE', 'CHILDCARE',
'BECOMEPARENT', 'DEBT', 'SAVING', 'SAVINGPROBLEM', 'WKHOURS',
'OVERTIME', 'OVERNIGHT', 'WEEKENDWK', 'POSTDOCVACANCY', 'SATCHANGE',
'SATSALARY', 'SATWELLNESS', 'SATFUNDING', 'SATTIME', 'SATCAREER',
'SATTRAINING', 'SATJOBSCURITY', 'SATOPT', 'SATSUPERVISE', 'SATORG',
'SATDECISION', 'SATWLB', 'SATWKHOURS', 'SATINTEREST', 'SATACCOM',
'SATRELATION', 'SATINDE', 'SATRECOG', 'SATSAFETY', 'SATDIVERSE',
'SUPERVISETIME', 'DISCRIMEXP', 'DISCRIMOBS', 'NEEDMHHELP',
'AGREEMHSUPP', 'AGREEDIRECTSUPP', 'AGREEWKSUPP', 'AGREEWLB',
'AGREELNGWK', 'MHLEAVESCI', 'BLIVGENDEREQ', 'BLIVETHEQ', 'BLIVSAFETY',
'BLIVDIGNITY', 'JOBPRSPCT', 'PRSPCTCHNGE', 'POSTDOCLEAVE', 'MINORITY',
'LTHEALTHPROB']]
= X_test[['POSTDOCNATIVE', 'POSTDOCMOVE', 'FIRSTPOSTDOC', 'PREVPOSTDOC',
X_test_reduced 'POSTDOCEXPCT', 'CAREERSECTOR', 'SALARY', 'SALARYCHANGE', 'PAIDVACAY',
'PAIDSICK', 'INSURANCE', 'PENSION', 'PARENTLEAVE', 'CHILDCARE',
'BECOMEPARENT', 'DEBT', 'SAVING', 'SAVINGPROBLEM', 'WKHOURS',
'OVERTIME', 'OVERNIGHT', 'WEEKENDWK', 'POSTDOCVACANCY', 'SATCHANGE',
'SATSALARY', 'SATWELLNESS', 'SATFUNDING', 'SATTIME', 'SATCAREER',
'SATTRAINING', 'SATJOBSCURITY', 'SATOPT', 'SATSUPERVISE', 'SATORG',
'SATDECISION', 'SATWLB', 'SATWKHOURS', 'SATINTEREST', 'SATACCOM',
'SATRELATION', 'SATINDE', 'SATRECOG', 'SATSAFETY', 'SATDIVERSE',
'SUPERVISETIME', 'DISCRIMEXP', 'DISCRIMOBS', 'NEEDMHHELP',
'AGREEMHSUPP', 'AGREEDIRECTSUPP', 'AGREEWKSUPP', 'AGREEWLB',
'AGREELNGWK', 'MHLEAVESCI', 'BLIVGENDEREQ', 'BLIVETHEQ', 'BLIVSAFETY',
'BLIVDIGNITY', 'JOBPRSPCT', 'PRSPCTCHNGE', 'POSTDOCLEAVE', 'MINORITY',
'LTHEALTHPROB']]
RandomizedSearchCV
to randomly search the search space of hyperparameter value for the combinnation of hyperparameter that yields the best result. I am tuning n_estimators, max_depth, min_samples_leaf, and min_samples_split with five folds cross-validation, using accuracy as the performance metric.# Define the hyperparameter grid to search
= {
param_grid 'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Initialize RandomizedSearchCV with the classifier, hyperparameter grid, and cross-validation
= RandomizedSearchCV(
random_search
rf_classifier,=param_grid,
param_distributions=10, # Adjust the number of iterations as needed
n_iter='accuracy',
scoring=5,
cv=42
random_state
)
# Fit the randomized search on the selected features
random_search.fit(X_train_reduced, y_train)
# Get the best hyperparameters
= random_search.best_params_
best_params print("Best Hyperparameters:", best_params)
# Initialize the RandomForestClassifier
= RandomForestClassifier(n_estimators = 100, min_samples_split = 5, min_samples_leaf = 2, max_depth = 20, random_state=123)
rf_classifier_tuned
rf_classifier_tuned.fit(X_train_reduced, y_train)= rf_classifier_tuned.predict(X_test_reduced)
rfc_tuned_predict = cross_val_score(rf_classifier_tuned, X_test_reduced, y_test, cv=5, scoring='roc_auc') rfc_tuned_cv_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn import metrics
#%% Evaluate the tuned RF
print("=== Confusion Matrix ===")
print(confusion_matrix(y_test, rfc_tuned_predict))
print('\n')
print("=== Classification Report ===")
print(classification_report(y_test, rfc_tuned_predict))
print('\n')
print("=== All AUC Scores ===")
print(rfc_tuned_cv_score)
print('\n')
print("=== Mean AUC Score ===")
print("Mean AUC Score - Random Forest: ", rfc_tuned_cv_score.mean())
#Accuracy of the tuned RF: test data
print("accuracy score of the test set for tuned RF", rf_classifier_tuned.score(X_test_reduced, y_test))
#Root mean square error
= mean_squared_error(y_test, rfc_tuned_predict)
mse_rfc_tuned = sqrt(mse_rfc_tuned)
rmse_rfc_tuned
print('RMSE of tuned RF = ', rmse_rfc_tuned)
#define metrics for tuned RFC
= rf_classifier_tuned.predict_proba(X_test_reduced)[::,1]
y_pred_proba_tuned_rf = metrics.roc_curve(y_test, y_pred_proba_tuned_rf)
fpr_tuned_rf, tpr_tuned_rf, _
= metrics.roc_auc_score(y_test, y_pred_proba_tuned_rf)
auc_tuned_rf
#create ROC curve
="Tuned AUC for RF="+str(auc_tuned_rf.round(3)))
plt.plot(fpr_tuned_rf, tpr_tuned_rf, label
="lower right")
plt.legend(loc
'True Positive Rate')
plt.ylabel('False Positive Rate')
plt.xlabel(
# displaying the title
"ROC of Tuned RF")
plt.title(
plt.show()
#%% Feature importance report of the tuned RF
# Create a pd.Series of features importances
= pd.Series(rf_classifier_tuned.feature_importances_, index = X_test_reduced.columns)
importances_rf
# Sort importances_rf
= importances_rf.sort_values()
sorted_importance_rf
#Horizontal bar plot
=(8, 20))
plt.figure(figsize
='barh', color='lightgreen');
sorted_importance_rf.plot(kind'Feature Importance Score')
plt.xlabel('Features')
plt.ylabel("Visualizing Important Features")
plt.title( plt.show()
From the plot above, the four most influential predictors are as follows:
MHLEAVESCI: Have you considered leaving science because of depression, anxiety, or other mental health concerns related to your work? (0 = No 1 = Yes).
JOBPRSPCT: How do you feel about your future job prospects? (1 = Extremely negative 2 = Somewhat negative 3 = Neither positive nor negative
4 = Somewhat positive 5 = Extremely positive)
SATCAREER: Thinking about your current postdoc, how satisfied are you with the following? -Career advancement opportunities. (1 = extremely dissatisfied 4 = neither satisfied nor dissatisfied 7 = extremely satisfied).
SATCHANGE: In the past year would you say your level of satisfaction has… (1 = Significantly worsened 2 = Worsened a little 3 = Stayed the same 4 = Improved a little 5 = Significantly improved).
Let’s try explaining it further with DALEX: moDel Agnostic Language for Exploration and eXplanation (Baniecki et al., 2021). We will use partial dependence plot (PDP), breakdown plot, and SHapley Additive exPlanations (SHAP) plot to explain the prediction.
import dalex as dx
#we created an explainer with dalex package
= dx.Explainer(rf_classifier_tuned, X_test_reduced, y_test)
exp
# PDP profile for surface and construction.year variable
= exp.model_profile(variables = ["MHLEAVESCI", "JOBPRSPCT", "SATCAREER", "SATCHANGE"] , type = "partial")
RF_mprofile
# comparison for random forest and linear regression model
RF_mprofile.plot()
The PDP plot above shows that MHLEAVESCI has negative influence to the prediction, meaning that individuals who respond 1 (considered leaving science because of mental health concerns) are more likely to have the negative outcome (not recommend their younger self to pursue a career in scientific research).
The other three predictors show positive influence to the prediction, meaning that the more positive the individual feels about their job prospect, their career advancement opportunities, and their overall level of satisfaction in the past year, the more likely for them to have the positive outcome (recommend their younger self to pursue a career in scientific research).
We can further explore the prediction in instances of positive and negative cases. Positive case means post doctoral researchers who answered 1 in their targeted variable (recommend their younger self to pursue a career in scientific research) while the negative case answered the targeted variable otherwise. We will use breakdown plot and SHAP plot to explore the prediction of each individual case.
# Break Down for apartment observation
= exp.predict_parts(new_observation = X_test_reduced.iloc[202], type = "break_down")
rf_pparts
# plot Break Down
rf_pparts.plot()
# Break Down for apartment observation
= exp.predict_parts(new_observation = X_test_reduced.iloc[202], type = "shap")
rf_pparts_shap
# plot Break Down
rf_pparts_shap.plot()
# lime explanation
= exp.predict_surrogate(new_observation = X_test_reduced.iloc[202])
rf_pparts_lime
# plot LIME
rf_pparts_lime.show_in_notebook()
# Break Down for apartment observation
= exp.predict_parts(new_observation = X_test_reduced.iloc[6], type = "break_down")
rf_pparts
# plot Break Down
rf_pparts.plot()
# Break Down for apartment observation
= exp.predict_parts(new_observation = X_test_reduced.iloc[6], type = "shap")
rf_pparts_shap
# plot Break Down
rf_pparts_shap.plot()
# lime explanation
= exp.predict_surrogate(new_observation = X_test_reduced.iloc[6])
rf_pparts_lime
# plot LIME
rf_pparts_lime.show_in_notebook()
'REFLECT'] df_text[
from wordcloud import WordCloud, STOPWORDS
# Custom stopwords
= ["post doc", "doc", "nan", "don", "phd", "postdoc"]
custom_stopwords
# Concatenate all text data in the 'REFLECT' column
= ' '.join(df_reflect['REFLECT'])
text_data
# Remove English stopwords and custom stopwords
= set(STOPWORDS)
stopwords
stopwords.update(custom_stopwords)
# Create a WordCloud object with specified stopwords
= WordCloud(width=800, height=400, background_color='white', stopwords=stopwords).generate(text_data)
wordcloud
# Display the word cloud using matplotlib
=(10, 5))
plt.figure(figsize='bilinear')
plt.imshow(wordcloud, interpolation'off') # Turn off axis numbers and ticks
plt.axis( plt.show()
def plot_10_most_common_words(count_data, tfidf_vectorizer):
import matplotlib.pyplot as plt
= tfidf_vectorizer.get_feature_names()
words = np.zeros(len(words))
total_counts for t in count_data:
+= t.toarray()[0]
total_counts
= (zip(words, total_counts))
count_dict = sorted(count_dict, key=lambda x: x[1], reverse=True)[0:10]
count_dict = [w[0] for w in count_dict]
words = [w[1] for w in count_dict]
counts = np.arange(len(words))
x_pos
='center')
plt.bar(x_pos, counts, align=90)
plt.xticks(x_pos, words, rotation'Words')
plt.xlabel('Counts')
plt.ylabel('10 Most Common Words')
plt.title(
plt.show()
# Custom stopwords, including your own stopwords
= set(["post doc", "doc", "nan", "don", "phd", "postdoc"])
custom_stopwords
# Combine standard English stopwords with custom stopwords
= set(text.ENGLISH_STOP_WORDS).union(custom_stopwords)
stopwords
# Handle NaN values by filling them with an empty string
'REFLECT'] = df_text['REFLECT'].fillna('')
df_text[
# Initialize the count vectorizer with the English stop words
= TfidfVectorizer(stop_words=stopwords)
tfidf_vectorizer
# Fit and transform the processed titles
= tfidf_vectorizer.fit_transform(df_text['REFLECT'])
count_data
# Visualize the 10 most common words
plot_10_most_common_words(count_data, tfidf_vectorizer)
from sklearn.feature_extraction import text
import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
= ["post doc", "doc", "nan", "don", "phd", "postdoc"] custom_stopwords
# Custom stopwords, including your own stopwords
= set(["post doc", "doc", "nan", "don", "phd"])
custom_stopwords
# Combine standard English stopwords with custom stopwords
= set(text.ENGLISH_STOP_WORDS).union(custom_stopwords)
stopwords
= CountVectorizer(strip_accents='unicode',
tf_vectorizer =stopwords,
stop_words=True,
lowercase=r'\b[a-zA-Z]{3,}\b',
token_pattern=(1, 2),
ngram_range=0.5,
max_df=10)
min_df
= tf_vectorizer.fit_transform(df_text['REFLECT'].values.astype('U'))
dtm_tf
= TfidfVectorizer(**tf_vectorizer.get_params())
tfidf_vectorizer = tfidf_vectorizer.fit_transform(df_text['REFLECT'].values.astype('U'))
dtm_tfidf
# for TF DTM
= LatentDirichletAllocation(n_components=5, random_state=0)
lda_tf
lda_tf.fit(dtm_tf)
# for TFIDF DTM
= LatentDirichletAllocation(n_components=5, random_state=0)
lda_tfidf
lda_tfidf.fit(dtm_tfidf)
pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
for i,topic in enumerate(lda_tf.components_):
print(f'Top 10 words for topic #{i}:')
print([tf_vectorizer.get_feature_names_out()[i] for i in topic.argsort()[-10:]])
print('\n')
Most topics are about work in academia and research, with topic 0 and 2 that are different. Topic 0 seems to be about concerns for resources such as time and funding. Topic 2 seems to be about financial matter such as cost of living and salary. Some excerpts from respondents’ reflection are:
“Mostly, I would advise my past self to choose a different PI and lab for my postdoc. Beyond that, I wish had known about the cost of doing a postdoc in terms of lost income potential relative to other careers, and the reality of how few academic jobs there are in my field.”
“High degree of competitiveness, low salaries, strenuous hours, publish or perish, excessive stress, no psychological support, salary relationship with respect to the degree of studies/specialization with respect to other jobs”
“Wish I was aware of lack of support system for Postdocs. Beyond publishing our work, there is no other way of measuring ones caliber in science. It also depends of how well known the PI is. I also wish I had a better insight into how low the salary and other benefits are. Above all I wish I had known the difficulty and lack of available positions in academia to be pursued.”
“I wish that they were teaching us more that working in academia means never having job security. I wish that I knew that there is not enough jobs and that only way to land a permanent job in academia is to become a professor.”
“Science is not valued as it should be. Job prospects and opportunities have gotten worse. And most importantly the salary as post doc doesn’t help you make the ends meet at all. Post doc and academia only work for those who have lots of savings and strong financial ability to be underpaid for years of post doc until they build up their resume and publish sufficient amount of papers without having to think about financial problems.”
“There is a wealth of research possible in industry, that pays better, is organized better, and allows you to do MORE research, less beauracracy and teaching, working with real professionals, is goal oriented, not article oriented, and allows you to maintain a work life balance, have time for a partner, family, a house, and retirement.”
Wordings of the topics seem to be similar to the negative case results as well. However, topic 4 is different from results from negative cases as it discusses about balancing the task and responsibility (e.g., work, project, writing). This is not shown from the negative case dataset. Some excerpts from respondents’ reflection are:
“Personal enthusiasm and expectations for academic research will decline rapidly at a certain time, and good habits need to be developed to resist the frustration brought about by the decline in enthusiasm and physical strength; to exercise writing skills; to balance family and research, and to Research can always help yourself, but family doesn’t necessarily”
“Although I would still pursue a career in science, I would definitely not waste the years I have spent in my postdoc and rather head straight for industry job after my PhD. I am already miles behind in terms of salary and growth compared to my peers who went straight to industry after their degrees.”
“Having a better saving culture, work/family time balance, and having participated more in financing processes or acquiring scholarships, etc.”
“Everyone is going to be working just as hard as you, the best you can do is to do good science, follow your interests, and pursue every potential opportunity for funding and development.”
“Publishing in high impact journals is not as awarding as having a great life-work balance. Reach out and accept help and initiate collaborations.”
“You will have way less time for bench research because you will be writing fellowships/grants, spending time for committee service, and mentoring compared to when in grad school.”
In concluding, the data analysis methods discussed in this post, while not novel (as I’ve previously posted about them), reveal an intriguing aspect—how their combination can offer insights into a subject. The findings consistently align with existing literature, emphasizing the pivotal role of mental health in individual resilience in academia.
Navigating a postdoctoral research role naturally prompts considerations of job prospects, salary, and work-life balance, all crucial factors influencing one’s professional satisfaction. In academia, where competition is constant, elements such as maintaining work-life balance, coping with the pressure to publish, and confronting job insecurity can significantly impact mental well-being.
This analysis underscores the importance of prioritizing personal well-being over professional pursuits when faced with such choices. Although academia boasts merits, uncertainties regarding financial stability, job security, and work-life balance still exist. Choosing a career in academia demands thoughtful consideration of these complexities. Anyway, thank you so much for reading!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2023, Dec. 15). Tarid Wongvorachan: Exploring Nature's 2023 postdoc survey with predictive modeling and NLP. Retrieved from https://taridwong.github.io/posts/2023-12-10-naturepostdoc/
BibTeX citation
@misc{wongvorachan2023exploring, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Exploring Nature's 2023 postdoc survey with predictive modeling and NLP}, url = {https://taridwong.github.io/posts/2023-12-10-naturepostdoc/}, year = {2023} }