In this post, I will be using two unsupervised learning techniques with a data set, namely K-means clustering and Hierarchical clustering, to determine groups of customers from their age, income, and spending behavior data.
(8 min read)
A lot of my recent posts are about supervised machine learning problems, which is defined by its use of labeled datasets to train algorithms to classify or predict outcomes of the unseen data set. In other words, the model was trained (or supervised) by humans to achieve the best possible predictive capability. However, unsupervised machine learning is defined in opposition to supervised learning. Unsupervised learning, in contrast, is learning without labels. It is pure pattern discovery, unguided by a prediction task. The model learns from raw data without any prior knowledge or human training.
For example, you have a group of customers with a variety of characteristics such as age, location, and financial history. You wish to discover patterns and sort them into natural “clusters” without tampering with them in any ways. Or perhaps you have a set of texts, such as Wikipedia pages, and you wish to segment them into categories based on their content. These are examples of unsupervised learning techniques called “clustering” and “dimension reduction”.
Unsupervised learning is called as such because you are not guiding the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. In this post, I will be using two unsupervised learning techniques with a data set, namely K-means clustering and Hierarchical clustering, to determine groups of customers from their age, income, and spending behavior data. As usual, we will import the necessary modules first to set up the environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
import warnings
'ignore') warnings.filterwarnings(
= pd.read_csv("customers.csv")
# data set shape
print("The data set has", df.shape[0], "cases and", df.shape[1], "variables")
# print head of data set
The data set has 200 cases and 5 variables
CustomerID Gender Age Annual Income (k$) Spending Score (1-100)
0 1 1 19 15 39
1 2 1 21 15 81
2 3 0 20 16 6
3 4 0 23 16 77
4 5 0 31 17 40
5 6 0 22 17 76
6 7 0 35 18 6
7 8 0 23 18 94
8 9 1 64 19 3
9 10 0 30 19 72
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 200 non-null int64
1 Gender 200 non-null int64
2 Age 200 non-null int64
3 Annual Income (k$) 200 non-null int64
4 Spending Score (1-100) 200 non-null int64
dtypes: int64(5)
memory usage: 7.9 KB
#Check for missing data
CustomerID Gender ... Annual Income (k$) Spending Score (1-100)
count 200.000000 200.000000 ... 200.000000 200.000000
mean 100.500000 0.440000 ... 60.560000 50.200000
std 57.879185 0.497633 ... 26.264721 25.823522
min 1.000000 0.000000 ... 15.000000 1.000000
25% 50.750000 0.000000 ... 41.500000 34.750000
50% 100.500000 0.000000 ... 61.500000 50.000000
75% 150.250000 1.000000 ... 78.000000 73.000000
max 200.000000 1.000000 ... 137.000000 99.000000
[8 rows x 5 columns]
sum() df.isnull().
CustomerID 0
Gender 0
Age 0
Annual Income (k$) 0
Spending Score (1-100) 0
dtype: int64
1 , figsize = (15 , 6))
plt.figure(= 0
n for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
+= 1
n 1 , 3 , n)
plt.subplot(= 0.5 , wspace = 0.5)
plt.subplots_adjust(hspace = 15)
sns.distplot(df[x] , bins 'Distplot of {}'.format(x))
=True, cmap="YlGnBu")
sns.heatmap(df.corr(), annot
1, figsize = (16 ,8))
plt.figure( sns.clustermap(df)
vars = ['Spending Score (1-100)', 'Annual Income (k$)', 'Age'], hue = "Gender") sns.pairplot(df,
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms. For this method, we define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster; then, the algorithm allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The means in the K-means refers to averaging of the data in finding their corresponding centroids.
Instead of using equations, this short animation by Allison Horst explains k-means clustering in a very cute and comprehensive way.
1 , figsize = (15 , 7))
plt.figure('Scatter plot of Age v/s Spending Score', fontsize = 20)
plt.xlabel('Spending Score')
plt.ylabel(= 'Age', y = 'Spending Score (1-100)', data = df, s = 100)
plt.scatter( x
= df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values
X1 = []
for n in range(1 , 15):
= (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,
algorithm =0.0001, random_state= 111 , algorithm='full') )
KMeans(algorithm='full', n_clusters=1, random_state=111)
KMeans(algorithm='full', n_clusters=2, random_state=111)
KMeans(algorithm='full', n_clusters=3, random_state=111)
KMeans(algorithm='full', n_clusters=4, random_state=111)
KMeans(algorithm='full', n_clusters=5, random_state=111)
KMeans(algorithm='full', n_clusters=6, random_state=111)
KMeans(algorithm='full', n_clusters=7, random_state=111)
KMeans(algorithm='full', random_state=111)
KMeans(algorithm='full', n_clusters=9, random_state=111)
KMeans(algorithm='full', n_clusters=10, random_state=111)
KMeans(algorithm='full', n_clusters=11, random_state=111)
KMeans(algorithm='full', n_clusters=12, random_state=111)
KMeans(algorithm='full', n_clusters=13, random_state=111)
KMeans(algorithm='full', n_clusters=14, random_state=111)
1 , figsize = (15 ,6))
plt.figure(1 , 15) , inertia , 'o')
plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)
plt.plot(np.arange('Number of Clusters') , plt.ylabel('Inertia') plt.xlabel(
(Text(0.5, 0, 'Number of Clusters'), Text(0, 0.5, 'Inertia'))
= (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300,
algorithm =0.0001, random_state= 111 , algorithm='full') )
KMeans(algorithm='full', n_clusters=4, random_state=111)
= algorithm.labels_
labels1 = algorithm.cluster_centers_
= 0.02
h = X1[:, 0].min() - 1, X1[:, 0].max() + 1
x_min, x_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
y_min, y_max = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
xx, yy = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
1 , figsize = (15 , 7) )
plt.clf()= Z.reshape(xx.shape)
Z ='nearest',
plt.imshow(Z , interpolation=(xx.min(), xx.max(), yy.min(), yy.max()),
extent=, aspect = 'auto', origin='lower')
= 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter( x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.scatter(x 'Spending Score (1-100)') , plt.xlabel('Age') plt.ylabel(
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Age'))
"Four Clusters", loc='center')
#%%Applying KMeans for k=5
= (KMeans(n_clusters = 5, init='k-means++', n_init = 10, max_iter=300,
algorithm =0.0001, random_state= 111 , algorithm='elkan'))
KMeans(algorithm='elkan', n_clusters=5, random_state=111)
= algorithm.labels_
labels1 = algorithm.cluster_centers_
= 0.02
h = X1[:, 0].min() - 1, X1[:, 0].max() + 1
x_min, x_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
y_min, y_max = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
xx, yy = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
1 , figsize = (15 , 7) )
plt.clf()= Z.reshape(xx.shape)
Z ='nearest',
plt.imshow(Z , interpolation=(xx.min(), xx.max(), yy.min(), yy.max()),
extent=, aspect = 'auto', origin='lower')
= 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter( x = centroids1[: , 0] , y = centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.scatter(x 'Spending Score (1-100)') , plt.xlabel('Age') plt.ylabel(
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Age'))
"Five Clusters", loc='center')
The two diagrams above show what it is like when we group the data into 4 and 5 clusters. It is important that the clusters are stable. Even though the algorithm begins by randomly initializing the cluster centers, if the k-means algorithm is the right choice for the data, then different runs of the algorithm will result in similar clusters in terms of size and variable distribution. If there is a lot of change in clusters between the different iterations of the algorithm, then k-means clustering may not be the right choice for the data. However, it is not possible to validate that the clusters obtained from the algorithm are accurate because there is no patient labeling; thus, it is necessary to examine how the clusters change between different iterations of the algorithm and check if the number of clusters makes sense in both theoretical and practical sense. We can also have domain experts give their opinions about if the clusters of customer make practical sense.
For one more practice, we can try making a cluster based on annual income and spending score.
= df[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values
X2 = []
inertia for n in range(1 , 11):
= (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300,
algorithm =0.0001, random_state= 111 , algorithm='full') )
KMeans(algorithm='full', n_clusters=1, random_state=111)
KMeans(algorithm='full', n_clusters=2, random_state=111)
KMeans(algorithm='full', n_clusters=3, random_state=111)
KMeans(algorithm='full', n_clusters=4, random_state=111)
KMeans(algorithm='full', n_clusters=5, random_state=111)
KMeans(algorithm='full', n_clusters=6, random_state=111)
KMeans(algorithm='full', n_clusters=7, random_state=111)
KMeans(algorithm='full', random_state=111)
KMeans(algorithm='full', n_clusters=9, random_state=111)
KMeans(algorithm='full', n_clusters=10, random_state=111)
1 , figsize = (15 ,6))
plt.figure(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.plot(np.arange('Number of Clusters') , plt.ylabel('Inertia') plt.xlabel(
(Text(0.5, 0, 'Number of Clusters'), Text(0, 0.5, 'Inertia'))
= (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300,
algorithm =0.0001, random_state= 111 , algorithm='elkan') )
KMeans(algorithm='elkan', n_clusters=5, random_state=111)
= algorithm.labels_
labels2 = algorithm.cluster_centers_
= 0.02
h = X2[:, 0].min() - 1, X2[:, 0].max() + 1
x_min, x_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1
y_min, y_max = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
xx, yy = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
1 , figsize = (15 , 7) )
plt.clf()= Z2.reshape(xx.shape)
Z2 ='nearest',
plt.imshow(Z2 , interpolation=(xx.min(), xx.max(), yy.min(), yy.max()),
extent=, aspect = 'auto', origin='lower')
= 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = df , c = labels2 ,
plt.scatter( x = 100 )
s = centroids2[: , 0] , y = centroids2[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.scatter(x 'Spending Score (1-100)') , plt.xlabel('Annual Income (k$)') plt.ylabel(
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Annual Income (k$)'))
"Five Clusters", loc='center')
An alternative to k-means clustering is hierarchical clustering (also known as hierarchical cluster analysis), which groups similar objects into hierarchies (or levels) of clusters. The end product is a set of clusters, where each cluster is distinct on its own, and the objects within each cluster are broadly similar to each other. This method works well when data have a nested structure - meaning that one characteristic is related to another (e.g., spending habit of a certain age group).
Again, Allison Horst did a really good job explaining how hierarchical clustering works with her visuals below. Note that the visual is for the “single” method, but we will be using the “ward” method. However, they are similar in terms of how they build a hierarchical dendrogram. A dendrogram is a diagram representing a tree that, in this context, illustrates the arrangement of the clusters produced by the corresponding analyses.
We will try performing divisive hierarchical clustering first. This method is known as a top-down approach that splits a cluster that contains the whole data into smaller clusters recursively until each single data point have been splitted into singleton clusters or the termination condition holds. This method is rigid. Once a merging or splitting is done, it can never be undone.
1, figsize = (16 ,8))
plt.figure(= sch.dendrogram(sch.linkage(df, method = "ward"))
plt.xlabel('Euclidean distances')
= AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='average')
= hc.fit_predict(df)
y_hc y_hc
array([3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4,
3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 2,
3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
0, 1], dtype=int64)
= df.iloc[:, [3,4]].values
X ==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.scatter(X[y_hc'Clusters of Customers (Hierarchical Clustering Model)')
plt.title('Annual Income(k$)')
plt.xlabel('Spending Score(1-100)')
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2022, April 18). Tarid Wongvorachan: Examining Customer Cluster with Unsupervised Machine Learning. Retrieved from
BibTeX citation
@misc{wongvorachan2022examining, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Examining Customer Cluster with Unsupervised Machine Learning}, url = {}, year = {2022} }