Examining Customer Cluster with Unsupervised Machine Learning

Python Unsupervised Machine Learning

In this post, I will be using two unsupervised learning techniques with a data set, namely K-means clustering and Hierarchical clustering, to determine groups of customers from their age, income, and spending behavior data.

(8 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2022-04-18

Introduction

Show code
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn.cluster import KMeans
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering 

import warnings
warnings.filterwarnings('ignore')

About the Data Set

Show code
df = pd.read_csv("customers.csv")

# data set shape
print("The data set has", df.shape[0], "cases and", df.shape[1], "variables")

# print head of data set
The data set has 200 cases and 5 variables
Show code
print(df.head(10))
   CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
0           1       1   19                  15                      39
1           2       1   21                  15                      81
2           3       0   20                  16                       6
3           4       0   23                  16                      77
4           5       0   31                  17                      40
5           6       0   22                  17                      76
6           7       0   35                  18                       6
7           8       0   23                  18                      94
8           9       1   64                  19                       3
9          10       0   30                  19                      72
Show code
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   CustomerID              200 non-null    int64
 1   Gender                  200 non-null    int64
 2   Age                     200 non-null    int64
 3   Annual Income (k$)      200 non-null    int64
 4   Spending Score (1-100)  200 non-null    int64
dtypes: int64(5)
memory usage: 7.9 KB
Show code
df.describe()

#Check for missing data
       CustomerID      Gender  ...  Annual Income (k$)  Spending Score (1-100)
count  200.000000  200.000000  ...          200.000000              200.000000
mean   100.500000    0.440000  ...           60.560000               50.200000
std     57.879185    0.497633  ...           26.264721               25.823522
min      1.000000    0.000000  ...           15.000000                1.000000
25%     50.750000    0.000000  ...           41.500000               34.750000
50%    100.500000    0.000000  ...           61.500000               50.000000
75%    150.250000    1.000000  ...           78.000000               73.000000
max    200.000000    1.000000  ...          137.000000               99.000000

[8 rows x 5 columns]
Show code
df.isnull().sum()
CustomerID                0
Gender                    0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64
Show code
plt.figure(1 , figsize = (15 , 6))
n = 0 
for x in ['Age' , 'Annual Income (k$)' , 'Spending Score (1-100)']:
    n += 1
    plt.subplot(1 , 3 , n)
    plt.subplots_adjust(hspace = 0.5 , wspace = 0.5)
    sns.distplot(df[x] , bins = 15)
    plt.title('Distplot of {}'.format(x))
    
plt.show()

Show code
sns.heatmap(df.corr(), annot=True, cmap="YlGnBu")
plt.show()

Show code
plt.figure(1, figsize = (16 ,8))
sns.clustermap(df)

Show code
sns.pairplot(df, vars = ['Spending Score (1-100)', 'Annual Income (k$)', 'Age'], hue = "Gender")

K-Means Clustering

The Process of K-means Clustering by Allison Horst. No copyright infringement intended
Show code
plt.figure(1 , figsize = (15 , 7))
plt.title('Scatter plot of Age v/s Spending Score', fontsize = 20)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, s = 100)
plt.show()

Show code
X1 = df[['Age' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []

for n in range(1 , 15):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='full') )
    algorithm.fit(X1)
    inertia.append(algorithm.inertia_)
    
KMeans(algorithm='full', n_clusters=1, random_state=111)
KMeans(algorithm='full', n_clusters=2, random_state=111)
KMeans(algorithm='full', n_clusters=3, random_state=111)
KMeans(algorithm='full', n_clusters=4, random_state=111)
KMeans(algorithm='full', n_clusters=5, random_state=111)
KMeans(algorithm='full', n_clusters=6, random_state=111)
KMeans(algorithm='full', n_clusters=7, random_state=111)
KMeans(algorithm='full', random_state=111)
KMeans(algorithm='full', n_clusters=9, random_state=111)
KMeans(algorithm='full', n_clusters=10, random_state=111)
KMeans(algorithm='full', n_clusters=11, random_state=111)
KMeans(algorithm='full', n_clusters=12, random_state=111)
KMeans(algorithm='full', n_clusters=13, random_state=111)
KMeans(algorithm='full', n_clusters=14, random_state=111)
Show code
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 15) , inertia , 'o')
plt.plot(np.arange(1 , 15) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
(Text(0.5, 0, 'Number of Clusters'), Text(0, 0.5, 'Inertia'))
Show code
plt.show()

Show code
algorithm = (KMeans(n_clusters = 4 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='full') )
algorithm.fit(X1)
KMeans(algorithm='full', n_clusters=4, random_state=111)
Show code
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_

h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) 

plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter(x = centroids1[: , 0] , y =  centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Age'))
Show code
plt.title("Four Clusters", loc='center')
plt.show()  

Show code
#%%Applying KMeans for k=5

algorithm = (KMeans(n_clusters = 5, init='k-means++', n_init = 10, max_iter=300, 
                        tol=0.0001, random_state= 111 , algorithm='elkan'))
algorithm.fit(X1)
KMeans(algorithm='elkan', n_clusters=5, random_state=111)
Show code
labels1 = algorithm.labels_
centroids1 = algorithm.cluster_centers_

h = 0.02
x_min, x_max = X1[:, 0].min() - 1, X1[:, 0].max() + 1
y_min, y_max = X1[:, 1].min() - 1, X1[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) 

plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z = Z.reshape(xx.shape)
plt.imshow(Z , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Age', y = 'Spending Score (1-100)', data = df, c = labels1, s = 100)
plt.scatter(x = centroids1[: , 0] , y =  centroids1[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Age')
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Age'))
Show code
plt.title("Five Clusters", loc='center')
plt.show()

Show code
X2 = df[['Annual Income (k$)' , 'Spending Score (1-100)']].iloc[: , :].values
inertia = []
for n in range(1 , 11):
    algorithm = (KMeans(n_clusters = n ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='full') )
    algorithm.fit(X2)
    inertia.append(algorithm.inertia_)
    
KMeans(algorithm='full', n_clusters=1, random_state=111)
KMeans(algorithm='full', n_clusters=2, random_state=111)
KMeans(algorithm='full', n_clusters=3, random_state=111)
KMeans(algorithm='full', n_clusters=4, random_state=111)
KMeans(algorithm='full', n_clusters=5, random_state=111)
KMeans(algorithm='full', n_clusters=6, random_state=111)
KMeans(algorithm='full', n_clusters=7, random_state=111)
KMeans(algorithm='full', random_state=111)
KMeans(algorithm='full', n_clusters=9, random_state=111)
KMeans(algorithm='full', n_clusters=10, random_state=111)
Show code
plt.figure(1 , figsize = (15 ,6))
plt.plot(np.arange(1 , 11) , inertia , 'o')
plt.plot(np.arange(1 , 11) , inertia , '-' , alpha = 0.5)
plt.xlabel('Number of Clusters') , plt.ylabel('Inertia')
(Text(0.5, 0, 'Number of Clusters'), Text(0, 0.5, 'Inertia'))
Show code
plt.show()

Show code
algorithm = (KMeans(n_clusters = 5 ,init='k-means++', n_init = 10 ,max_iter=300, 
                        tol=0.0001,  random_state= 111  , algorithm='elkan') )
algorithm.fit(X2)
KMeans(algorithm='elkan', n_clusters=5, random_state=111)
Show code
labels2 = algorithm.labels_
centroids2 = algorithm.cluster_centers_

#%%
h = 0.02
x_min, x_max = X2[:, 0].min() - 1, X2[:, 0].max() + 1
y_min, y_max = X2[:, 1].min() - 1, X2[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z2 = algorithm.predict(np.c_[xx.ravel(), yy.ravel()]) 


#%%

plt.figure(1 , figsize = (15 , 7) )
plt.clf()
Z2 = Z2.reshape(xx.shape)
plt.imshow(Z2 , interpolation='nearest', 
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap = plt.cm.Pastel2, aspect = 'auto', origin='lower')

plt.scatter( x = 'Annual Income (k$)' ,y = 'Spending Score (1-100)' , data = df , c = labels2 , 
            s = 100 )
plt.scatter(x = centroids2[: , 0] , y =  centroids2[: , 1] , s = 300 , c = 'red' , alpha = 0.5)
plt.ylabel('Spending Score (1-100)') , plt.xlabel('Annual Income (k$)')
(Text(0, 0.5, 'Spending Score (1-100)'), Text(0.5, 0, 'Annual Income (k$)'))
Show code
plt.title("Five Clusters", loc='center')
plt.show()

Hierarchical Clustering

The Process of Hierarchical Clustering by Allison Horst. No copyright infringement intended
Show code
plt.figure(1, figsize = (16 ,8))
dendrogram = sch.dendrogram(sch.linkage(df, method  = "ward"))

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distances')
plt.show()

Show code
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean', linkage ='average')

y_hc = hc.fit_predict(df)
y_hc
array([3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4,
       3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 2,
       3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       0, 1], dtype=int64)
Show code
X = df.iloc[:, [3,4]].values
plt.scatter(X[y_hc==0, 0], X[y_hc==0, 1], s=100, c='red', label ='Cluster 1')
plt.scatter(X[y_hc==1, 0], X[y_hc==1, 1], s=100, c='blue', label ='Cluster 2')
plt.scatter(X[y_hc==2, 0], X[y_hc==2, 1], s=100, c='green', label ='Cluster 3')
plt.scatter(X[y_hc==3, 0], X[y_hc==3, 1], s=100, c='purple', label ='Cluster 4')
plt.scatter(X[y_hc==4, 0], X[y_hc==4, 1], s=100, c='orange', label ='Cluster 5')
plt.title('Clusters of Customers (Hierarchical Clustering Model)')
plt.xlabel('Annual Income(k$)')
plt.ylabel('Spending Score(1-100)')
plt.show()

Conclusion

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2022, April 18). Tarid Wongvorachan: Examining Customer Cluster with Unsupervised Machine Learning. Retrieved from https://taridwong.github.io/posts/2022-04-18-unsupervisedml/

BibTeX citation

@misc{wongvorachan2022examining,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Examining Customer Cluster with Unsupervised Machine Learning},
  url = {https://taridwong.github.io/posts/2022-04-18-unsupervisedml/},
  year = {2022}
}