For this post, I will use semi-supervised learning approach to perform a classification task with a highly imbalance data.
(7 min read)
Machine Learning (ML) is a process of learning the best possible and most relevant patterns, relationships, or associations from a data set to predict the outcomes on unseen data. Broadly, there are three types of ML process:
Supervised Learning, which is a process that trains a ML model on a labelled data set. The model aims to find the relationships among the independent and dependent variable to predict unseen data we may receive in the future.
Unsupervised Learning, which is a process of training a ML model on a dataset in which the target variable is not known. The model aims to find the most relevant patterns in the data or the segments of data.
Semi-Supervised Learning, which is a combination of supervised and unsupervised learning processes in which the unlabeled data is used to train a model as well. In this approach, the properties of unsupervised learning are used to learn the best possible representation of data, and the properties of supervised learning are used to learn the relationships to make predictions.
Each ML algorithm has its own use under the consideration of a) the size, quality, and nature of data, b) The available computational time, c) The urgency of the task, and d) the expected result. Among the vast availability of algorithms, each ML type has its own role in tackling different types of data science problem depending on the mentioned considerations. The ML algorithm cheat sheet below is a good starting point to choose algorithms that are appropriate for your specific problems.
Semi-supervised learning is a restatement of the missing data imputation problem, which is specific to small data set with missing-label cases. This problem is commonly encountered in data set generation contexts as retrieving clean data can be costly and time consuming. Applying supervised machine learning techniques to small data set might yield poor results that cannot be further used; thus, it would be more useful to address this problem with a combination of both machine learning approaches (i.e., unsupervised and supervised) for optimal results.
For this post, I will be using the Credit Card Fraud Detection data set from the Université Libre de Bruxelles (Brussels, Belgium) machine learning group The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 fraud transactions out of 284,807 transactions. We will first importing in Python modules and the data set as usual.
from keras.layers import Input, Dense
from keras.models import Model, Sequential
from keras import regularizers
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, accuracy_score
from sklearn.manifold import TSNE
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
set(style="whitegrid")
sns.203)
np.random.seed(
= pd.read_csv("creditcard.csv")
data "Time"] = data["Time"].apply(lambda x : x / 3600 % 24)
data[ data.head()
Time V1 V2 V3 ... V27 V28 Amount Class
0 0.000000 -1.359807 -0.072781 2.536347 ... 0.133558 -0.021053 149.62 0
1 0.000000 1.191857 0.266151 0.166480 ... -0.008983 0.014724 2.69 0
2 0.000278 -1.358354 -1.340163 1.773209 ... -0.055353 -0.059752 378.66 0
3 0.000278 -0.966272 -0.185226 1.792993 ... 0.062723 0.061458 123.50 0
4 0.000556 -1.158233 0.877737 1.548718 ... 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
= data['Class'].value_counts().to_frame().reset_index()
vc 'percent'] = vc["Class"].apply(lambda x : round(100*float(x) / len(data), 2))
vc[= vc.rename(columns = {"index" : "Target", "Class" : "Count"})
vc vc
Target Count percent
0 0 284315 99.83
1 1 492 0.17
= data[data['Class'] == 0].sample(1000)
non_fraud = data[data['Class'] == 1]
fraud
= non_fraud.append(fraud).sample(frac=1).reset_index(drop=True)
df = df.drop(['Class'], axis = 1).values
X = df["Class"].values Y
= TSNE(n_components=2, random_state=0)
tsne
= tsne.fit_transform(X)
TSNE_result
0], TSNE_result[:,1], hue=Y, legend='full', palette="hls") sns.scatterplot(TSNE_result[:,
Autoencoders are a special type of neural network architectures in which the output is same as the input. In other words, an autoencoder, once trained on appropriate training data, can be used to generate compressed copies of input data point while preserve most of the information (features) from the input. For our case, the model will try to learn the best representation of non-fraud cases and generate the representations of fraud cases.
As for how it works, Autoencoders are designed to have a bottle-neck architecture with a few neurons to comprehensively compress the knowledge about representation of the original input for the machine to reproduce it by using decoders. The picture below shows architecture of the machine in how it learns from the input data.
- First, we will create a network with one input layer and one output
layer. Both of them will have identical dimensions.
## input layer
= Input(shape=(X.shape[1],))
input_layer
## encoding part
= Dense(100, activation='tanh', activity_regularizer=regularizers.l1(10e-5))(input_layer)
encoded = Dense(50, activation='relu')(encoded)
encoded
## decoding part
= Dense(50, activation='tanh')(encoded)
decoded = Dense(100, activation='tanh')(decoded)
decoded
## output layer
= Dense(X.shape[1], activation='relu')(decoded) output_layer
= Model(input_layer, output_layer)
autoencoder compile(optimizer="adadelta", loss="mse") autoencoder.
= data.drop(["Class"], axis=1)
x = data["Class"].values
y
= preprocessing.MinMaxScaler().fit_transform(x.values)
x_scale = x_scale[y == 0], x_scale[y == 1] x_norm, x_fraud
0:2000], x_norm[0:2000],
autoencoder.fit(x_norm[= 256, epochs = 10,
batch_size = True, validation_split = 0.20) shuffle
= Sequential()
hidden_representation 0])
hidden_representation.add(autoencoder.layers[1])
hidden_representation.add(autoencoder.layers[2]) hidden_representation.add(autoencoder.layers[
= hidden_representation.predict(x_norm[:3000])
norm_hid_rep = hidden_representation.predict(x_fraud) fraud_hid_rep
= np.append(norm_hid_rep, fraud_hid_rep, axis = 0)
new_x = np.zeros(norm_hid_rep.shape[0])
new_y_not_fraud = np.ones(fraud_hid_rep.shape[0])
new_y_fraud = np.append(new_y_not_fraud, new_y_fraud) new_y
= tsne.fit_transform(new_x)
TSNE_result_new 0], TSNE_result_new[:,1], hue=new_y, legend='full', palette="hls") sns.scatterplot(TSNE_result_new[:,
= train_test_split(new_x, new_y, test_size=0.3)
X_train, X_test, y_train, y_test
= LogisticRegression(max_iter = 450, random_state = 123)
clf_lr clf_lr.fit(X_train,y_train)
LogisticRegression(max_iter=450, random_state=123)
= clf_lr.predict(X_test)
pred_lr print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_lr)))
Accuracy: 0.9781
= classification_report(y_test, pred_lr)
report_lr print(report_lr)
precision recall f1-score support
0.0 0.97 1.00 0.99 894
1.0 1.00 0.85 0.92 154
accuracy 0.98 1048
macro avg 0.99 0.93 0.95 1048
weighted avg 0.98 0.98 0.98 1048
= confusion_matrix(y_test, pred_lr)
conf_matrix =True, annot=True, fmt='d', cbar=False)
sns.heatmap(conf_matrix.T, square'true label')
plt.xlabel('predicted label')
plt.ylabel("Confusion matrix of Logistic Regression")
plt.title( plt.show()
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2022, April 28). Tarid Wongvorachan: Addressing Data Imbalance with Semi-Supervised Learning. Retrieved from https://taridwong.github.io/posts/2022-04-28-semisupervised/
BibTeX citation
@misc{wongvorachan2022addressing, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Addressing Data Imbalance with Semi-Supervised Learning}, url = {https://taridwong.github.io/posts/2022-04-28-semisupervised/}, year = {2022} }