For this entry, I am trying my hands on audio data to extract its features for exploratory data analysis (EDA), using machine learning algorithms to perform music classification, and finally build up on that result to develop a recommendation system for music of similar characteristics.
(13 min read)
When we think of data, people may think of numbers and texts in tables. Some may even think of using images as data, but just so you know, we can also convert and extract features from audio data (i.e., music) to understand and make use of it as well! Here, we will visualize music sound wave from .wav files to understand about what differentiates one tone from another (we can actually see soundwaves!).
I primarily relied on Olteanu et al. (2019)’s work, Music genre classification article, and Analytics Vidhya guide to the same topic to guide this reproduction and experimentation with the data.
To introduce the data set a bit. I will be using the GTZAN dataset, which is a public data set for evaluation in machine listening research for music genre recognition (MGR). The files were collected in 2000-2001 from a variety of sources including personal CDs, radio, microphone recordings to represent a variety of recording conditions.
We will start from importing audio data into our Python environment for data visualization; then, we will explore its feature such as sound wave, spectogram, mel-spectogram, harmonics and perceptrual, tempo, spectral centroid, and chroma frequencies. We will then conduct an exploratory data analysis with correlation heatmap with the extracted features, generating a box plot for genres distribution, and perform a principal component analysis to divide genres into groups.
Lastly, we will perform machine learning classification to train the algorithm to recognize and predict new audio files into genres (e.g., rock, pop, jazz), as well as develop a music recommendation system using the cosine similarity
statistics. This function is a part of music delivery platforms such as Spotify, Youtube music, or Apple Music.
We will begin by importing necessary libraries for graphing (seaborn
and matplotlib
), data manipulation (pandas
), machine learning (sklearn
), and audio work (librosa
).
# Usual Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
import librosa
import librosa.display
librosa
, which is the main library for audio work in Python. Let us first Explore our Audio Data to see how it looks (we’ll work with pop.00002.wav
file). We will check for sound - the sequence of vibrations in varying pressure strengths (y
) and sample rate (sr
) the number of samples of audio carried per second, measured in Hz or kHz.# Importing 1 file
= librosa.load('D:/Program/Private_project/DistillSite/_posts/2021-12-11-applying-machine-learning-to-audio-data/genres_original/pop/pop.00002.wav')
y, sr
print('y:', y, '\n')
y: [-0.09274292 -0.11630249 -0.11886597 ... 0.14419556 0.16311646
0.09634399]
print('y shape:', np.shape(y), '\n')
y shape: (661504,)
print('Sample Rate (KHz):', sr, '\n')
# Verify length of the audio
Sample Rate (KHz): 22050
print('Check Length of the audio in second:', 661794/22050)
Check Length of the audio in second: 30.013333333333332
# Trim leading and trailing silence from an audio signal (silence before and after the actual audio)
= librosa.effects.trim(y)
audio_file, _
# the result is an numpy ndarray
print('Audio File:', audio_file, '\n')
Audio File: [-0.09274292 -0.11630249 -0.11886597 ... 0.14419556 0.16311646
0.09634399]
print('Audio File shape:', np.shape(audio_file))
Audio File shape: (661504,)
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
= audio_file, sr = sr, color = "#A300F9");
librosa.display.waveplot(y "Sound Waves in Pop 02", fontsize = 23);
plt.title( plt.show()
# Default FFT window size
= 2048 # FFT window size
n_fft = 512 # number audio of frames between STFT columns (looks like a good default)
hop_length
# Short-time Fourier transform (STFT)
= np.abs(librosa.stft(audio_file, n_fft = n_fft, hop_length = hop_length))
D
print('Shape of D object:', np.shape(D))
Shape of D object: (1025, 1293)
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
;
plt.plot(D) plt.show()
# Convert an amplitude spectrogram to Decibels-scaled spectrogram.
= librosa.amplitude_to_db(D, ref = np.max)
DB
# Creating the Spectogram
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
= sr, hop_length = hop_length, x_axis = 'time', y_axis = 'log', cmap = 'cool') librosa.display.specshow(DB, sr
<matplotlib.collections.QuadMesh object at 0x000000006FAB1310>
;
plt.colorbar() plt.show()
= librosa.load('D:/Program/Private_project/DistillSite/_posts/2021-12-11-applying-machine-learning-to-audio-data/genres_original/metal/metal.00036.wav')
y, sr = librosa.effects.trim(y)
y, _
= librosa.feature.melspectrogram(y, sr=sr)
S = librosa.amplitude_to_db(S, ref=np.max)
S_DB = (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, hop_length=hop_length, x_axis = 'time', y_axis = 'log',
librosa.display.specshow(S_DB, sr= 'cool');
cmap ;
plt.colorbar()"Metal Mel Spectrogram", fontsize = 23);
plt.title( plt.show()
= librosa.load('D:/Program/Private_project/DistillSite/_posts/2021-12-11-applying-machine-learning-to-audio-data/genres_original/classical/classical.00036.wav')
y, sr = librosa.effects.trim(y)
y, _
= librosa.feature.melspectrogram(y, sr=sr)
S = librosa.amplitude_to_db(S, ref=np.max)
S_DB = (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, hop_length=hop_length, x_axis = 'time', y_axis = 'log',
librosa.display.specshow(S_DB, sr= 'cool');
cmap ;
plt.colorbar()"Classical Mel Spectrogram", fontsize = 23);
plt.title( plt.show()
# Total zero_crossings in our 1 song
= librosa.zero_crossings(audio_file, pad=False)
zero_crossings print(sum(zero_crossings))
78769
Harmonics (the orange wave) are audio characteristics that human ears can’t distinguish (represents the sound color)
Perceptual (the purple wave) are sound waves that represent rhythm and emotion of the music.
= librosa.effects.hpss(audio_file)
y_harm, y_perc
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
= '#A300F9');
plt.plot(y_harm, color = '#FFB100');
plt.plot(y_perc, color plt.show()
= librosa.beat.beat_track(y, sr = sr)
tempo, _ tempo
107.666015625
# Calculate the Spectral Centroids
= librosa.feature.spectral_centroid(audio_file, sr=sr)[0]
spectral_centroids
# Shape is a vector
print('Centroids:', spectral_centroids, '\n')
Centroids: [3042.39242043 3057.96296504 3043.45666379 ... 3476.4010229 3908.31319501
3834.930348 ]
print('Shape of Spectral Centroids:', spectral_centroids.shape, '\n')
# Computing the time variable for visualization
Shape of Spectral Centroids: (1293,)
= range(len(spectral_centroids))
frames
# Converts frame counts to time (seconds)
= librosa.frames_to_time(frames)
t
print('frames:', frames, '\n')
frames: range(0, 1293)
print('t:', t)
# Function that normalizes the Sound Data
t: [0.00000000e+00 2.32199546e-02 4.64399093e-02 ... 2.99537415e+01
2.99769615e+01 3.00001814e+01]
def normalize(x, axis=0):
return sklearn.preprocessing.minmax_scale(x, axis=axis)
#Plotting the Spectral Centroid along the waveform
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, alpha=0.4, color = '#A300F9');
librosa.display.waveplot(audio_file, sr='#FFB100');
plt.plot(t, normalize(spectral_centroids), color plt.show()
# Spectral RollOff Vector
= librosa.feature.spectral_rolloff(audio_file, sr=sr)[0]
spectral_rolloff
# The plot
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, alpha=0.4, color = '#A300F9');
librosa.display.waveplot(audio_file, sr='#FFB100');
plt.plot(t, normalize(spectral_rolloff), color plt.show()
= librosa.feature.mfcc(audio_file, sr=sr)
mfccs print('mfccs shape:', mfccs.shape)
#Displaying the MFCCs:
mfccs shape: (20, 1293)
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, x_axis='time', cmap = 'cool');
librosa.display.specshow(mfccs, sr plt.show()
# Perform Feature Scaling
= sklearn.preprocessing.scale(mfccs, axis=1) mfccs
C:\Users\tarid\AppData\Roaming\Python\Python38\site-packages\sklearn\preprocessing\_data.py:174: UserWarning: Numerical issues were encountered when centering the data and might not be solved. Dataset may contain too large values. You may need to prescale your features.
warnings.warn("Numerical issues were encountered "
C:\Users\tarid\AppData\Roaming\Python\Python38\site-packages\sklearn\preprocessing\_data.py:191: UserWarning: Numerical issues were encountered when scaling the data and might not be solved. The standard deviation of the data is probably very close to 0.
warnings.warn("Numerical issues were encountered "
print('Mean:', mfccs.mean(), '\n')
Mean: 3.097782e-09
print('Var:', mfccs.var())
Var: 1.0
= (16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
=sr, x_axis='time', cmap = 'cool');
librosa.display.specshow(mfccs, sr plt.show()
# Increase or decrease hop_length to change how granular you want your data to be
= 5000
hop_length
# Chromogram
= librosa.feature.chroma_stft(audio_file, sr=sr, hop_length=hop_length)
chromagram print('Chromogram shape:', chromagram.shape)
Chromogram shape: (12, 133)
=(16, 6)) plt.figure(figsize
<Figure size 1600x600 with 0 Axes>
='time', y_axis='chroma', hop_length=hop_length, cmap='coolwarm');
librosa.display.specshow(chromagram, x_axis plt.show()
features_30_sec.csv
data that contains the mean and variance of the features discussed above for all audio file in the data bank. We have 10 genres of music, each genre has 100 audio files. That makes the total of 1000 songs that we have. There are 60 features in total for each song.= pd.read_csv('features_30_sec.csv')
data data.head()
filename length chroma_stft_mean ... mfcc20_mean mfcc20_var label
0 blues.00000.wav 661794 0.350088 ... 1.221291 46.936035 blues
1 blues.00001.wav 661794 0.340914 ... 0.531217 45.786282 blues
2 blues.00002.wav 661794 0.363637 ... -2.231258 30.573025 blues
3 blues.00003.wav 661794 0.404785 ... -3.407448 31.949339 blues
4 blues.00004.wav 661794 0.308526 ... -11.703234 55.195160 blues
[5 rows x 60 columns]
# Computing the Correlation Matrix
= [col for col in data.columns if 'mean' in col]
spike_cols = data[spike_cols].corr()
corr
# Generate a mask for the upper triangle
= np.triu(np.ones_like(corr, dtype=np.bool))
mask
# Set up the matplotlib figure
= plt.subplots(figsize=(16, 11));
f, ax
# Generate a custom diverging colormap
= sns.diverging_palette(0, 25, as_cmap=True, s = 90, l = 45, n = 5)
cmap
# Draw the heatmap with the mask and correct aspect ratio
=mask, cmap=cmap, vmax=.3, center=0,
sns.heatmap(corr, mask=True, linewidths=.5, cbar_kws={"shrink": .5}) square
<AxesSubplot:>
'Correlation Heatmap (for the MEAN variables)', fontsize = 25) plt.title(
Text(0.5, 1.0, 'Correlation Heatmap (for the MEAN variables)')
= 10) plt.xticks(fontsize
(array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
22.5, 23.5, 24.5, 25.5, 26.5, 27.5]), [Text(0.5, 0, 'chroma_stft_mean'), Text(1.5, 0, 'rms_mean'), Text(2.5, 0, 'spectral_centroid_mean'), Text(3.5, 0, 'spectral_bandwidth_mean'), Text(4.5, 0, 'rolloff_mean'), Text(5.5, 0, 'zero_crossing_rate_mean'), Text(6.5, 0, 'harmony_mean'), Text(7.5, 0, 'perceptr_mean'), Text(8.5, 0, 'mfcc1_mean'), Text(9.5, 0, 'mfcc2_mean'), Text(10.5, 0, 'mfcc3_mean'), Text(11.5, 0, 'mfcc4_mean'), Text(12.5, 0, 'mfcc5_mean'), Text(13.5, 0, 'mfcc6_mean'), Text(14.5, 0, 'mfcc7_mean'), Text(15.5, 0, 'mfcc8_mean'), Text(16.5, 0, 'mfcc9_mean'), Text(17.5, 0, 'mfcc10_mean'), Text(18.5, 0, 'mfcc11_mean'), Text(19.5, 0, 'mfcc12_mean'), Text(20.5, 0, 'mfcc13_mean'), Text(21.5, 0, 'mfcc14_mean'), Text(22.5, 0, 'mfcc15_mean'), Text(23.5, 0, 'mfcc16_mean'), Text(24.5, 0, 'mfcc17_mean'), Text(25.5, 0, 'mfcc18_mean'), Text(26.5, 0, 'mfcc19_mean'), Text(27.5, 0, 'mfcc20_mean')])
= 10);
plt.yticks(fontsize plt.show()
= data[["label", "tempo"]]
x
= plt.subplots(figsize=(16, 9));
f, ax = "label", y = "tempo", data = x, palette = 'husl');
sns.boxplot(x
'BPM Boxplot for Genres', fontsize = 25) plt.title(
Text(0.5, 1.0, 'BPM Boxplot for Genres')
= 14) plt.xticks(fontsize
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), [Text(0, 0, 'blues'), Text(1, 0, 'classical'), Text(2, 0, 'country'), Text(3, 0, 'disco'), Text(4, 0, 'hiphop'), Text(5, 0, 'jazz'), Text(6, 0, 'metal'), Text(7, 0, 'pop'), Text(8, 0, 'reggae'), Text(9, 0, 'rock')])
= 10);
plt.yticks(fontsize "Genre", fontsize = 15) plt.xlabel(
Text(0.5, 0, 'Genre')
"BPM", fontsize = 15) plt.ylabel(
Text(0, 0.5, 'BPM')
plt.show()
from sklearn import preprocessing
= data.iloc[0:, 1:]
data = data['label']
y = data.loc[:, data.columns != 'label']
X
#### NORMALIZE X ####
= X.columns
cols = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit_transform(X)
np_scaled = pd.DataFrame(np_scaled, columns = cols)
X
#### PCA 2 COMPONENTS ####
from sklearn.decomposition import PCA
= PCA(n_components=2)
pca = pca.fit_transform(X)
principalComponents = pd.DataFrame(data = principalComponents, columns = ['principal component 1', 'principal component 2'])
principalDf
# concatenate with target label
= pd.concat([principalDf, y], axis = 1)
finalDf
pca.explained_variance_ratio_
# 44.93 variance explained
array([0.2439355 , 0.21781804])
= (16, 9)) plt.figure(figsize
<Figure size 1600x900 with 0 Axes>
= "principal component 1", y = "principal component 2", data = finalDf, hue = "label", alpha = 0.7,
sns.scatterplot(x = 100);
s
'PCA on Genres', fontsize = 25) plt.title(
Text(0.5, 1.0, 'PCA on Genres')
= 14) plt.xticks(fontsize
(array([-1.5, -1. , -0.5, 0. , 0.5, 1. , 1.5]), [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
= 10);
plt.yticks(fontsize "Principal Component 1", fontsize = 15) plt.xlabel(
Text(0.5, 0, 'Principal Component 1')
"Principal Component 2", fontsize = 15) plt.ylabel(
Text(0, 0.5, 'Principal Component 2')
plt.show()
features_3_sec.csv
file, we can build a machine learning classification model that predicts genre of a new audio file. We will be loading a lot of machine learning models to see which model performs best.from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier, XGBRFClassifier
from xgboost import plot_tree, plot_importance
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
= pd.read_csv('features_3_sec.csv')
data = data.iloc[0:, 1:]
data data.head()
length chroma_stft_mean chroma_stft_var ... mfcc20_mean mfcc20_var label
0 66149 0.335406 0.091048 ... -0.243027 43.771767 blues
1 66149 0.343065 0.086147 ... 5.784063 59.943081 blues
2 66149 0.346815 0.092243 ... 2.517375 33.105122 blues
3 66149 0.363639 0.086856 ... 3.630866 32.023678 blues
4 66149 0.335579 0.088129 ... 0.536961 29.146694 blues
[5 rows x 59 columns]
= data['label'] # genre variable.
y = data.loc[:, data.columns != 'label'] #select all columns but not the labels
X
#### NORMALIZE X ####
# Normalize so everything is on the same scale.
= X.columns
cols = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit_transform(X)
np_scaled
# new data frame with the new scaled data.
= pd.DataFrame(np_scaled, columns = cols)
X
= train_test_split(X, y, test_size=0.3, random_state=42) X_train, X_test, y_train, y_test
#Creating a Predefined function to assess the accuracy of a model
def model_assess(model, title = "Default"):
model.fit(X_train, y_train)= model.predict(X_test)
preds #print(confusion_matrix(y_test, preds))
print('Accuracy', title, ':', round(accuracy_score(y_test, preds), 5), '\n')
# Naive Bayes
= GaussianNB()
nb "Naive Bayes")
model_assess(nb,
# Stochastic Gradient Descent
Accuracy Naive Bayes : 0.51952
= SGDClassifier(max_iter=5000, random_state=0)
sgd "Stochastic Gradient Descent")
model_assess(sgd,
# KNN
Accuracy Stochastic Gradient Descent : 0.65532
= KNeighborsClassifier(n_neighbors=19)
knn "KNN")
model_assess(knn,
# Decission trees
Accuracy KNN : 0.80581
= DecisionTreeClassifier()
tree "Decission trees")
model_assess(tree,
# Random Forest
Accuracy Decission trees : 0.6383
= RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0)
rforest "Random Forest")
model_assess(rforest,
# Support Vector Machine
Accuracy Random Forest : 0.81415
= SVC(decision_function_shape="ovo")
svm "Support Vector Machine")
model_assess(svm,
# Logistic Regression
Accuracy Support Vector Machine : 0.75409
= LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
lg "Logistic Regression")
model_assess(lg,
# Neural Nets
Accuracy Logistic Regression : 0.6977
C:\Users\tarid\AppData\Roaming\Python\Python38\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
= MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5000, 10), random_state=1)
nn "Neural Nets")
model_assess(nn,
# Cross Gradient Booster
Accuracy Neural Nets : 0.67401
C:\Users\tarid\AppData\Roaming\Python\Python38\site-packages\sklearn\neural_network\_multilayer_perceptron.py:471: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
= XGBClassifier(n_estimators=1000, learning_rate=0.05, eval_metric='mlogloss')
xgb "Cross Gradient Booster")
model_assess(xgb,
# Cross Gradient Booster (Random Forest)
Accuracy Cross Gradient Booster : 0.90224
C:\Users\tarid\ANACON~1\lib\site-packages\xgboost\sklearn.py:1224: UserWarning: The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].
warnings.warn(label_encoder_deprecation_msg, UserWarning)
= XGBRFClassifier(objective= 'multi:softmax', eval_metric='mlogloss')
xgbrf "Cross Gradient Booster (Random Forest)") model_assess(xgbrf,
Accuracy Cross Gradient Booster (Random Forest) : 0.74575
The function Extreme Gradient Boosting (XGBoost
) achieves the highest performance with 90% accuracy. We will be using this model to create the final prediction model and compute feature importance output along with its confusion matrix.
Note that I have also included Multilayer Perception - a variant of Neural Networks model - into the list of candidate models as well. While neural networks may be known for its complexity, it does not mean that the model is a silver bullet for every machine learning task. This idea is derived from the No Free Lunch Theorem that implies that there is no single best algorithm.
#Final model
= XGBClassifier(n_estimators=1000, learning_rate=0.05, eval_metric='mlogloss')
xgb xgb.fit(X_train, y_train)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
eval_metric='mlogloss', gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.05, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=1000, n_jobs=8,
num_parallel_tree=1, objective='multi:softprob', predictor='auto',
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None,
subsample=1, tree_method='exact', validate_parameters=1,
verbosity=None)
= xgb.predict(X_test)
preds
print('Accuracy', ':', round(accuracy_score(y_test, preds), 5), '\n')
# Confusion Matrix
Accuracy : 0.90224
= confusion_matrix(y_test, preds) #normalize = 'true'
confusion_matr = (16, 9)) plt.figure(figsize
<Figure size 1600x900 with 0 Axes>
="Blues", annot=True,
sns.heatmap(confusion_matr, cmap= ["blues", "classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"],
xticklabels =["blues", "classical", "country", "disco", "hiphop", "jazz", "metal", "pop", "reggae", "rock"]);
yticklabels plt.show()
perceptr_var
are the two most important variable in genre classification.import eli5
from eli5.sklearn import PermutationImportance
= PermutationImportance(estimator=xgb, random_state=1)
perm perm.fit(X_test, y_test)
PermutationImportance(estimator=XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False,
eval_metric='mlogloss', gamma=0,
gpu_id=-1, importance_type=None,
interaction_constraints='',
learning_rate=0.05,
max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan,
monotone_constraints='()',
n_estimators=1000, n_jobs=8,
num_parallel_tree=1,
objective='multi:softprob',
predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1,
scale_pos_weight=None,
subsample=1, tree_method='exact',
validate_parameters=1,
verbosity=None),
random_state=1)
=perm, feature_names = X_test.columns.tolist()) eli5.show_weights(estimator
<IPython.core.display.HTML object>
cosine_similarity
statistics.# Libraries
import IPython.display as ipd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing
# Read data
= pd.read_csv('features_30_sec.csv', index_col='filename')
data
# Extract labels
= data[['label']]
labels
# Drop labels from original dataframe
= data.drop(columns=['length','label'])
data
data.head()
# Scale the data
chroma_stft_mean chroma_stft_var ... mfcc20_mean mfcc20_var
filename ...
blues.00000.wav 0.350088 0.088757 ... 1.221291 46.936035
blues.00001.wav 0.340914 0.094980 ... 0.531217 45.786282
blues.00002.wav 0.363637 0.085275 ... -2.231258 30.573025
blues.00003.wav 0.404785 0.093999 ... -3.407448 31.949339
blues.00004.wav 0.308526 0.087841 ... -11.703234 55.195160
[5 rows x 57 columns]
=preprocessing.scale(data)
data_scaledprint('Scaled data type:', type(data_scaled))
Scaled data type: <class 'numpy.ndarray'>
# Cosine similarity
= cosine_similarity(data_scaled)
similarity print("Similarity shape:", similarity.shape)
# Convert into a dataframe and then set the row index and column names as labels
Similarity shape: (1000, 1000)
= pd.DataFrame(similarity)
sim_df_labels = sim_df_labels.set_index(labels.index)
sim_df_names = labels.index
sim_df_names.columns
sim_df_names.head()
filename blues.00000.wav ... rock.00099.wav
filename ...
blues.00000.wav 1.000000 ... 0.304098
blues.00001.wav 0.049231 ... 0.311723
blues.00002.wav 0.589618 ... 0.321069
blues.00003.wav 0.284862 ... 0.183210
blues.00004.wav 0.025561 ... 0.061785
[5 rows x 1000 columns]
find_similar_songs()
to take the name of the song and return top 5 best matches for that song.def find_similar_songs(name):
# Find songs most similar to another song
= sim_df_names[name].sort_values(ascending = False)
series
# Remove cosine similarity == 1 (songs will always have the best match with themselves)
= series.drop(name)
series
# Display the 5 top matches
print("\n*******\nSimilar songs to ", name)
print(series.head(5))
'pop.00023.wav') find_similar_songs(
*******
Similar songs to pop.00023.wav
filename
pop.00075.wav 0.875235
pop.00089.wav 0.874246
pop.00088.wav 0.872443
pop.00091.wav 0.871975
pop.00024.wav 0.869849
Name: pop.00023.wav, dtype: float64
'pop.00078.wav') find_similar_songs(
*******
Similar songs to pop.00078.wav
filename
pop.00088.wav 0.914322
hiphop.00077.wav 0.876289
pop.00089.wav 0.871822
pop.00074.wav 0.855630
pop.00023.wav 0.854349
Name: pop.00078.wav, dtype: float64
'rock.00018.wav') find_similar_songs(
*******
Similar songs to rock.00018.wav
filename
rock.00017.wav 0.921997
metal.00028.wav 0.913790
metal.00058.wav 0.912421
rock.00016.wav 0.912421
rock.00026.wav 0.910113
Name: rock.00018.wav, dtype: float64
'metal.00002.wav') find_similar_songs(
*******
Similar songs to metal.00002.wav
filename
metal.00028.wav 0.904367
metal.00059.wav 0.896096
rock.00018.wav 0.891910
rock.00017.wav 0.886526
rock.00016.wav 0.867508
Name: metal.00002.wav, dtype: float64
The output above shows similarity score for the sampled song. For example, the top three similar songs to pop.00023
- Britney Spears - “I’m so curious (2009 remaster)” are pop.00075
, pop.00089
, and pop.00088
respectively.
The algorithm can also recommend similar songs from other genres as well, for example, metal.00002
- Iron Maiden “Flight of Icarus”has similar songs in both metal and rock genre. The same thing also applies to rock.00018
- Queens - “Another One Bites The Dust” that has similar songs in both metal and rock genre as well.
It is interesting in how we are able to process audio data into numbers or images. The application of music recognition algorithm could be highly beneficial to entertainment industry in meeting the needs of consumer market. Researchers can also apply algorithm of this nature to extract characteristics that may be useful to their variable of interest such as attention or mental concentration.
One thing worth noting is, I am not a music expert, though I would love to practice piano at some point. The algorithm that I used is just one way of classifying musics into genres with the available information (e.g., tempo, harmonic wave). Domain expertise is important in data work regardless of your skill in data science. That is why it is crucial to consult with experts of the subject matter (i.e., musician) to make the most out of the insight we gained from this data. This also applies to other area such as testing as well. I can do the math and the programming, but I don’t know much about students or English testing. This is where domain experts come into play. I just want to emphasize the importance of collaboration between fields to ensure the best results for the collective good.
Due to the nature of my field (education), it is unlikely that I will have much chance to work with audio data, but this practice is still valuable regardless. The model_assess
function that I used can be applied to any machine learning work that requires the use of several models to find the most suitable algorithm for the task. The cosine_similarity
statistics is also useful to recommendation system of any products such as textbooks or novels. Anyway, it was a good practice, and I had fun nonetheless. As always, thank you very much for your read! I hope you have a good day wherever you are!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2021, Dec. 11). Tarid Wongvorachan: Applying Machine Learning to Audio Data: Visualization, Classification, and Recommendation. Retrieved from https://taridwong.github.io/posts/2021-12-11-applying-machine-learning-to-audio-data/
BibTeX citation
@misc{wongvorachan2021applying, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Applying Machine Learning to Audio Data: Visualization, Classification, and Recommendation}, url = {https://taridwong.github.io/posts/2021-12-11-applying-machine-learning-to-audio-data/}, year = {2021} }