# Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation

Similar to our previous post “Voice Gender Detection“, this blog-post focuses on a beginner’s method to answer the question ‘who is the speaker‘ in the speech file. Recently, lot of voice biometric systems have been developed which can extract speaker information from the recorded voice and identify the speaker from set of trained speakers in the database. In this blog-post, we will illustrate the same with a naive approach using Gaussian Mixture Models (GMM). There are other conventional as well as modern approaches which are more robust to channel noise and also performs better than approach followed in this blog-post.

1. GMM-UBM (Gaussian Mixture Model – Universal Background Model) using MAP (Maximum Aposteriori) adaptation [1] is one of the successful conventional technique to implement speaker identification.
2. I-vectors based speaker identification [2] is the state-of-the-art technique implemented in lot of voice biometric products.

As a beginner, the above mentioned techniques may overwhelm you as they are mathematically complex methods and requires some research effort in order to comprehend. Therefore, I am not following any of the two approaches. Instead, I am interested in showing you the implementation of fundamental step of speaker identification (using GMMs) which can then lead to development of GMM-UBM or I-vectors approach.

1. Training corpus : It has been developed from audios taken from ‘on-line VoxForge speech database’ and consists of 5 speech utterances for each speaker, spoken by 34 speakers (i.e, 20-30 seconds/speaker).
2. Test corpus: This consists of remaining 5 unseen utterances of the same 34 speakers taken in train corpus.  All audio files are of 10 seconds duration and are sampled at 16000 Hz.

I will strongly recommend you to read our previous post ‘Voice Gender Detection’ as a brief primer about how to work with speech signals are explained there. We have also discussed about extracting a popular speech feature, Mel Frequency Cepstrum Coefficients (MFCCs) previously. A GMM will take as input the MFCCs and derivatives of MFCCs of the training samples of a speaker and will try to learn their distribution, which will be representative of that speaker. A typical speaker identification process can be shown by flow diagram below.

Speaker Identification Process

While testing when the speaker of a new voice sample is to be identified, first the 40-dimensional feature (MFCCs + delta MFCC) of the sample will be extracted and then the trained speaker GMM models will be used to calculate the scores of the features for all the models. Speaker model with the maximum score is predicted as the identified speaker of the test speech. Having said that we will go through the python implementation of the following steps:

1. 40-Dimensional Feature Extraction
2. Training Speaker Models.
3. Evaluating Performance on test set

Lets get started !!

### 1. Feature Extraction.

We extract 40-dimensional features from speech frames. There are 20 MFCC features and 20 derivatives of MFCC features. The derivatives of MFCCs provides the information of dynamics of MFCCs over the time. It turns out that calculating the delta-MFCC and appending them to the original MFCC features (20-dimenaionl) increases the performance in lot of speech analytics applications. To calculate delta features from MFCCs, we apply the following equation.

where ‘N’ is number of deltas summed over. Typically taken as 2.

The below python functions extracts MFCC features and derives delta coefficients from from audio signal.

import numpy as np
from sklearn import preprocessing
import python_speech_features as mfcc

def calculate_delta(array):
"""Calculate and returns the delta of given feature vector matrix"""

rows,cols = array.shape
deltas = np.zeros((rows,20))
N = 2
for i in range(rows):
index = []
j = 1
while j <= N:
if i-j  rows-1:
second = rows-1
else:
second = i+j
index.append((second,first))
j+=1
deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10
return deltas

def extract_features(audio,rate):
"""extract 20 dim mfcc features from an audio, performs CMS and combines
delta to make it 40 dim feature vector"""

mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)
mfcc_feat = preprocessing.scale(mfcc_feat)
delta = calculate_delta(mfcc_feat)
combined = np.hstack((mfcc_feat,delta))
return combined


### 2. Training Speaker Models.

As we know, There are 34 distinct speakers in training corpus which are taken from lots of speaker provided by VoxForge. The path of all the audio files (5 per speaker) utilized for training are given in this file. Usually there is a very important step called pre-processing, aslo known as voice activity detection(VAD) which includes noise removal and silence truncation from the audios. I have assumed that there is no requirement of performing VAD here.

In order to build a speaker identification system from the above extracted features, we need to model all the speakers independently now. We employ GMMs for this task.

A Gaussian mixture model is a probabilistic clustering model for representing the presence of sub-populations within an overall population.  The idea of training a GMM is to approximate the probability distribution of a class by a linear combination of ‘k’ Gaussian distributions/clusters, also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation:

$P(X|\lambda) = \sum_{k=1}^{K} w_k P_k(X|\mu_k, \Sigma_k)$
, where $P_k(X|\mu_k, \Sigma_k)$ is the Gaussian distribution

$P_k(X|\mu_k,\Sigma_k) =$   $\frac{1}{{\sqrt{2\pi|\Sigma_k|}}} \thinspace e^{\frac{1}{2}(X-\mu_k)^T \Sigma^{-1}(X-\mu_k)}$

The training data $X_i$ of the class $\lambda$ are used to estimate the parameters mean $\mu$, co-variance matrices $\Sigma$ and weights $w$ of these k components.

Initially, it identifies clusters in the data by the K-means algorithm and assigns equal weight $w = \frac{1}{k}$ to each cluster. ‘k’ gaussian distributions are then fitted to these k clusters. The parameters $\mu$, $\sigma$ and $w$ of all the clusters are updated in iterations until the converge. The most popularly used method for this estimation is the Expectation Maximization (EM) algorithm.

Python’s sklearn.mixture package is used by us to learn a GMM from the features matrix containing the 40 dimensional MFCC and delta-MFCC features. More about sklearn GMM can be read from section 3 of our previous post ‘Voice Gender Detection‘. The following Python code is used to train the GMM speaker models with 16 components. The code is run once for each speaker and train_file is variable which has text filename containing path to all the audios for the respective speaker. Also, you have to create a “speaker_models” directory where all the models will be dumped after training.

import cPickle
import numpy as np
from sklearn.mixture import GMM
from speakerfeatures import extract_features
import warnings
warnings.filterwarnings("ignore")

#path to training data
source   = "development_set\\"

#path where training speakers will be saved
dest = "speaker_models\\"
train_file = "development_set_enroll.txt"
file_paths = open(train_file,'r')

count = 1
# Extracting features for each speaker (5 files per speakers)
features = np.asarray(())
for path in file_paths:
path = path.strip()
print path

# extract 40 dimensional MFCC & delta MFCC features
vector   = extract_features(audio,sr)

if features.size == 0:
features = vector
else:
features = np.vstack((features, vector))
# when features of 5 files of speaker are concatenated, then do model training
if count == 5:
gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3)
gmm.fit(features)

# dumping the trained gaussian model
picklefile = path.split("-")[0]+".gmm"
cPickle.dump(gmm,open(dest + picklefile,'w'))
print '+ modeling completed for speaker:',picklefile," with data point = ",features.shape
features = np.asarray(())
count = 0
count = count + 1


### 3. Evaluating Performance on Test set.

Test set consists of 5 unseen utterances of trained 34 speakers. The path of all the audio files (5 per speaker) utilized for evaluation are given in this file.

Upon arrival of a test voice sample for speaker identification, we begin by extracting the 40 dimensional for it, with 25 ms frame size and 10 ms overlap between frames. Next we require the log likelihood scores for each frame of the sample, $x_1, x_2, ... ,x_i$, belonging to each speaker, ie, $P(x_i|S_j)$ (for all j that belongs to S) is to be calculated. The likelihood of the frame being from a particular speaker is calculated by substituting the $\mu$ and $\Sigma$ of that speaker GMM model in likelihood equation shown in previous section. This is done for each of the ‘k’ Gaussian components in the model, and the weighted  sum of the ‘k’ likelihoods from the components is taken as per the weight ‘$w$‘ parameter of the model. The logarithm operation when applied on the obtained sum gives us the log likelihood value for the frame. This is repeated for all the frames of the sample and the likelihoods of all the frames are added. The speaker model with highest likelihood score is considered as the identified speaker.

The Python code given below predicts the speaker of the test audio.

import os
import cPickle
import numpy as np
from speakerfeatures import extract_features
import warnings
warnings.filterwarnings("ignore")
import time

#path to training data
source   = "development_set\\"
modelpath = "speaker_models\\"
test_file = "development_set_test.txt"
file_paths = open(test_file,'r')

gmm_files = [os.path.join(modelpath,fname) for fname in
os.listdir(modelpath) if fname.endswith('.gmm')]

models    = [cPickle.load(open(fname,'r')) for fname in gmm_files]
speakers   = [fname.split("\\")[-1].split(".gmm")[0] for fname
in gmm_files]

# Read the test directory and get the list of test audio files
for path in file_paths:

path = path.strip()
print path
vector   = extract_features(audio,sr)

log_likelihood = np.zeros(len(models))

for i in range(len(models)):
gmm    = models[i]  #checking with each model one by one
scores = np.array(gmm.score(vector))
log_likelihood[i] = scores.sum()

winner = np.argmax(log_likelihood)
print "\tdetected as - ", speakers[winner]
time.sleep(1.0)


### Results and Conclusion

This beginner’s approach performs with an in-set accuracy of 100%, identifying all the 170 speech utterances correctly. There are few reasons for such perfect result.

1. The unseen utterances of speakers taken from VoxForge are possibly of same channel or environment.
2. The evaluation task in performed on small dataset. Consider a data inflow where you are getting probably some thousands of calls in a day.
3. Consider the situation when we have to identify speakers from the set of 1000 speakers.
4. In this evaluation, we have not taken out-of-set speakers into account i.e. if the audio is not from any speaker still our system will identify it as one of speakers in trained set depending upon highest likelihood.
5. In the real environment, we may get more noisy and unclean data. Speaker identification system needs to be robust.

We hope the blog post was successful in explaining basic approach towards speaker identification task. We expect you to reproduce the results posted by us. Remember, this is not the end. I hope it forms the background to do further research in this particular task. To read more effective techniques, below are the provided reference for speaker identification task.

[1] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, M.I.T. Lincoln Laboratory, 2000

[2] Najim Dehak et al., “Front end Factor Analysis for Speaker Verification”, IEEE transaction on Audio, Speech and Language Processing, 2010

The full implementation of followed approach for training and evaluation of speaker identification from voice can be downloaded from GitHub link here. Also remember to download the data-set provided at the beginning of blog-post.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

## 45 thoughts on “Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation”

1. venkat narendra says:

in the above code indentation is missing in feature extraction part , can you send me the code with indentation , it will help me a lot.
thank you

• venkat narendra says:

thank you very much for the reply 🙂

• nani says:

sir can u provide for voice disorder identification

• Abhijit Kumar says:

Hi Abhijeet,Thanks for a wonderful post. Please suggest us a complete voice/sound tutorial or book.

2. I liked this. Im trying this with my local files. however, it gives me a warning saying that:
WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
+ modeling completed for speaker: Tilak\wav\TIlak1.gmm with data point = (598, 40)

• Hey Tilak, I had the same warning saying mine is frame length of 551 and is greater than FFT size of 512, so the trained model was not producing good results. However as the warning suggested, I increased the NFFT value from 512 to 1024 (it should be a power of 2, try searching google for more information) by manipulating the mfcc function in extract_features function. It is set as 512 as default. Hope it works

3. julie says:

hi I tried to apply the program however I am getting the same error but I don’t know how to increase the NFFT value. Can anyone help?

4. Snehankit Chikhalekar says:

Sir I have problem with installation of cPickle module please sir can you help me with that
installation .

5. JARVIISS says:

Hey…This blog is quite helpful …
Is there is any way to declare as “unknowns” for silence or noise (except from trained models).
e.g some threshold for declaring speaker match..etc

• Silence : You can check energy and discard it earlier as there would be no voice and energy will be low.
Noise : Train a SVM/GMM discrimator (noice vs speech) on acoustic features like MFCC etc.
You may like to do little research on these.

6. deveid says:

I am having the issue of ImportError: cannot import name GMM. What could be the issue.I am using python 2.7 and sklearn version 0.20.2 and I ‘m importing GMM with from sklearn.mixture import GaussianMixture as GMM

• That’s true. Probably, the scikit-learn design had been changed since this code was written. Check out where and how is GMM implemented now in scikit-learn !!.
Do write it here if you figure out. It may help others.

7. Tarang V. says:

What are the ‘N’ component’s , taken as 16 here , Is it same as number of speakers ? Wanted to understand why 16 specifically ?

• GMM is Gaussian Mixture Model.

N is number of mixtures. Basically, GMM is mixture of gaussians. So ‘N’ tells about number of Gaussians that has to be fit on data.
If data is huge, we may require N – 32,64 or 128 in order to capture all the variability.

Thanks.

8. Brhane says:

Thank you Abhijeet Kumar!!! for creating such kind of important blog. I am facing a problem and I don’t know how GMM-UBM is implemented in Speaker-recognition. Can you help me please? It is very very decisive to me.

• Hi,
GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].

The technique is as follows:
1. To train a large universal GMM model on 1000 speakers. This is called UBM.
2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.

I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).

Thanks,
Abhijeet

• Brhane says:

Thank you too much. I really appreciate your work

9. prashu says:

Hi Abhijeet Kumar, I appreciate your work. but I have a query that when I tested an input file which is form training data set then model works well but the model gives a false result for the data which is not in the training data set, why this is so?

• Hi Prashu,
That is true. So question is how does it classify ? It checks the gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.

There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.

• Prashu Gupta says:

I am not stating about the unknown speaker but the same trained speaker having a different voice sample that was not at the time of training. In my case, I use real-time audio testing from the mic so obviously, that audio file would not present in the training dataset. hope you get me

• What was the source of training data for your speaker model ? Channel has its effect. This basic technique will work if the channel is same.

10. Prashu Gupta says:

For the training data set I use Recorder.ipynb which has channel 2 and run it every time by changing the file name, 6 for every speaker. then train the model as you illustrate above.
after that, the model works fine for the stored audio file. but for real-time testing I use real-time_test.ipynb which first record the file by the mic and then test concurrently after that.
but this shows the false result.

this is my final yr project, I am very grateful for you.

“check my work”
https://github.com/prashu22/speaker_reco

11. Hi,
I am getting file not found error. But, I have given the development set and training model files correctly. Here is the error I am getting.

anthonyschaller-20071221-\wav\a0491.wav
—————————————————————————
FileNotFoundError Traceback (most recent call last)
in ()
30
—> 32 sr,audio = read(source + path)
33
34 # extract 40 dimensional MFCC & delta MFCC features

231 mmap = False
232 else:
–> 233 fid = open(filename, ‘rb’)
234
235 try:

FileNotFoundError: [Errno 2] No such file or directory: ‘development_set\\anthonyschaller-20071221-\\wav\\a0491.wav’

• Kindly check the directory path properly. Hope you can debug it on your own.

12. takwa says:

hi,
I can use python 3.6 with this code ? if no what can i do cause my project with 3.6 ?

13. M7md says:

Hi,
Thank you for this helpful tutorial

I am doing voice authentication project where user can login by record his voice
So, what I do is making user submit enrollment and extract features
but I am struggling with part where I should compare the previous submitted enrollment with voice recording he is doing to login.

can explain more
“# Read the test directory and get the list of test audio files
for path in file_paths:

path = path.strip()
print path
vector = extract_features(audio,sr)

log_likelihood = np.zeros(len(models))

for i in range(len(models)):
gmm = models[i] #checking with each model one by one
scores = np.array(gmm.score(vector))
log_likelihood[i] = scores.sum()

winner = np.argmax(log_likelihood)
print “\tdetected as – “, speakers[winner]
time.sleep(1.0)”

Best regards,

14. Mamun says:

I need GMM model python source code for accent variation detection. Could you please provide that?

• Look at the package “pyAudioAnalysis”. You may find something useful in segmentation or speaker Diarization part.

15. pratiksha says:

I am using python 3.6…and this the error i am getting

runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)
Traceback (most recent call last):

File “”, line 1, in
runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)

File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
execfile(filename, namespace)

File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile

File “D:/pyspeaker/test_speaker.py”, line 28, in

TypeError: a bytes-like object is required, not ‘str’

• Seems you have not trained the speaker models as it could not find pickled models.

16. VInit Dhamale says:

If Multiple people are talking in clip can model detect which speaker said what?

17. Akhil Krishna says:

hey, speakerfeatures and extract_features are shown as unknown package while trying to install
Can u help me out with appropriate names of the packages to be installed for speakerfeatures
And I have tried to import extract_features function which is again showing as unknown

18. Akshay Kishan says:

Hi Abhijeet, thanks for this tutorial. I am encountering the following error while implementing the exract_features function.

TypeErrorTraceback (most recent call last)

in ()
—-> 1 a = extract_features(‘sp1.wav’,16000)
2 a

3 frames

/usr/local/lib/python2.7/dist-packages/python_speech_features/sigproc.pyc in preemphasis(signal, coeff)
116 :returns: the filtered signal.
117 “””
–> 118 return numpy.append(signal[0],signal[1:]-coeff*signal[:-1])
119
120

TypeError: can’t multiply sequence by non-int of type ‘float’

What can be the reason for this? I was trying to extract_features of a sample audio sampled at 16kHz. Other than that, I didn’t change anything.

• Akshay Kishan says:

Resolved.

19. yan says:

I am wondering if taken out-of-set speakers into account i.e. if the audio is not from any speaker in the trained models,how to recognize the out-of-set speaker as imposter,or maybe how to set the threshold?I hope that you can give me some ideas.Thank you very much.

• Hi Yan,

The Question is how does it recognize ? It checks the Gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.

There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.

GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].

The technique is as follows:
1. To train a large universal GMM model on 1000 speakers. This is called UBM.
2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.

I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).

Thanks,
Abhijeet

• yan says:

Hi Abhijeet,
Thanks very much for your reply,I am going to learn the GMM-UBM model.Thank you!

20. praveen says:

index 299 is out of bounds for axis 0 with size 299

21. Minakshi says:

how we can find out accuracy of the above model