Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation

Similar to our previous post “Voice Gender Detection“, this blog-post focuses on a beginner’s method to answer the question ‘who is the speaker‘ in the speech file. Recently, lot of voice biometric systems have been developed which can extract speaker information from the recorded voice and identify the speaker from set of trained speakers in the database. In this blog-post, we will illustrate the same with a naive approach using Gaussian Mixture Models (GMM). There are other conventional as well as modern approaches which are more robust to channel noise and also performs better than approach followed in this blog-post.

  1. GMM-UBM (Gaussian Mixture Model – Universal Background Model) using MAP (Maximum Aposteriori) adaptation [1] is one of the successful conventional technique to implement speaker identification.
  2. I-vectors based speaker identification [2] is the state-of-the-art technique implemented in lot of voice biometric products.

As a beginner, the above mentioned techniques may overwhelm you as they are mathematically complex methods and requires some research effort in order to comprehend. Therefore, I am not following any of the two approaches. Instead, I am interested in showing you the implementation of fundamental step of speaker identification (using GMMs) which can then lead to development of GMM-UBM or I-vectors approach.

Data-sets: The below data-sets can be downloaded from here.

  1. Training corpus : It has been developed from audios taken from ‘on-line VoxForge speech database’ and consists of 5 speech utterances for each speaker, spoken by 34 speakers (i.e, 20-30 seconds/speaker).
  2. Test corpus: This consists of remaining 5 unseen utterances of the same 34 speakers taken in train corpus.  All audio files are of 10 seconds duration and are sampled at 16000 Hz.

I will strongly recommend you to read our previous post ‘Voice Gender Detection’ as a brief primer about how to work with speech signals are explained there. We have also discussed about extracting a popular speech feature, Mel Frequency Cepstrum Coefficients (MFCCs) previously. A GMM will take as input the MFCCs and derivatives of MFCCs of the training samples of a speaker and will try to learn their distribution, which will be representative of that speaker. A typical speaker identification process can be shown by flow diagram below.


Speaker Identification Process

While testing when the speaker of a new voice sample is to be identified, first the 40-dimensional feature (MFCCs + delta MFCC) of the sample will be extracted and then the trained speaker GMM models will be used to calculate the scores of the features for all the models. Speaker model with the maximum score is predicted as the identified speaker of the test speech. Having said that we will go through the python implementation of the following steps:

  1. 40-Dimensional Feature Extraction
  2. Training Speaker Models.
  3. Evaluating Performance on test set

Lets get started !!

1. Feature Extraction.

We extract 40-dimensional features from speech frames. There are 20 MFCC features and 20 derivatives of MFCC features. The derivatives of MFCCs provides the information of dynamics of MFCCs over the time. It turns out that calculating the delta-MFCC and appending them to the original MFCC features (20-dimenaionl) increases the performance in lot of speech analytics applications. To calculate delta features from MFCCs, we apply the following equation.

where ‘N’ is number of deltas summed over. Typically taken as 2.

The below python functions extracts MFCC features and derives delta coefficients from from audio signal.

import numpy as np
from sklearn import preprocessing
import python_speech_features as mfcc

def calculate_delta(array):
    """Calculate and returns the delta of given feature vector matrix"""

    rows,cols = array.shape
    deltas = np.zeros((rows,20))
    N = 2
    for i in range(rows):
        index = []
        j = 1
        while j <= N:
            if i-j  rows-1:
                second = rows-1
                second = i+j
        deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10
    return deltas

def extract_features(audio,rate):
    """extract 20 dim mfcc features from an audio, performs CMS and combines
    delta to make it 40 dim feature vector"""    

    mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)
    mfcc_feat = preprocessing.scale(mfcc_feat)
    delta = calculate_delta(mfcc_feat)
    combined = np.hstack((mfcc_feat,delta))
    return combined

2. Training Speaker Models.

As we know, There are 34 distinct speakers in training corpus which are taken from lots of speaker provided by VoxForge. The path of all the audio files (5 per speaker) utilized for training are given in this file. Usually there is a very important step called pre-processing, aslo known as voice activity detection(VAD) which includes noise removal and silence truncation from the audios. I have assumed that there is no requirement of performing VAD here.

In order to build a speaker identification system from the above extracted features, we need to model all the speakers independently now. We employ GMMs for this task.

A Gaussian mixture model is a probabilistic clustering model for representing the presence of sub-populations within an overall population.  The idea of training a GMM is to approximate the probability distribution of a class by a linear combination of ‘k’ Gaussian distributions/clusters, also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation:

P(X|\lambda) = \sum_{k=1}^{K} w_k P_k(X|\mu_k, \Sigma_k)
, where P_k(X|\mu_k, \Sigma_k) is the Gaussian distribution

   P_k(X|\mu_k,\Sigma_k) =    \frac{1}{{\sqrt{2\pi|\Sigma_k|}}} \thinspace e^{\frac{1}{2}(X-\mu_k)^T \Sigma^{-1}(X-\mu_k)}

The training data X_i of the class \lambda are used to estimate the parameters mean \mu  , co-variance matrices \Sigma   and weights w  of these k components.

Initially, it identifies clusters in the data by the K-means algorithm and assigns equal weight w = \frac{1}{k}  to each cluster. ‘k’ gaussian distributions are then fitted to these k clusters. The parameters \mu , \sigma and w of all the clusters are updated in iterations until the converge. The most popularly used method for this estimation is the Expectation Maximization (EM) algorithm.

Python’s sklearn.mixture package is used by us to learn a GMM from the features matrix containing the 40 dimensional MFCC and delta-MFCC features. More about sklearn GMM can be read from section 3 of our previous post ‘Voice Gender Detection‘. The following Python code is used to train the GMM speaker models with 16 components. The code is run once for each speaker and train_file is variable which has text filename containing path to all the audios for the respective speaker. Also, you have to create a “speaker_models” directory where all the models will be dumped after training.

import cPickle
import numpy as np
from scipy.io.wavfile import read
from sklearn.mixture import GMM
from speakerfeatures import extract_features
import warnings

#path to training data
source   = "development_set\\"   

#path where training speakers will be saved
dest = "speaker_models\\"
train_file = "development_set_enroll.txt"
file_paths = open(train_file,'r')

count = 1
# Extracting features for each speaker (5 files per speakers)
features = np.asarray(())
for path in file_paths:
    path = path.strip()
    print path

    # read the audio
    sr,audio = read(source + path)

    # extract 40 dimensional MFCC & delta MFCC features
    vector   = extract_features(audio,sr)

    if features.size == 0:
        features = vector
        features = np.vstack((features, vector))
    # when features of 5 files of speaker are concatenated, then do model training
    if count == 5:
        gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3)

        # dumping the trained gaussian model
        picklefile = path.split("-")[0]+".gmm"
        cPickle.dump(gmm,open(dest + picklefile,'w'))
        print '+ modeling completed for speaker:',picklefile," with data point = ",features.shape
        features = np.asarray(())
        count = 0
    count = count + 1

3. Evaluating Performance on Test set.

Test set consists of 5 unseen utterances of trained 34 speakers. The path of all the audio files (5 per speaker) utilized for evaluation are given in this file.

Upon arrival of a test voice sample for speaker identification, we begin by extracting the 40 dimensional for it, with 25 ms frame size and 10 ms overlap between frames. Next we require the log likelihood scores for each frame of the sample, x_1, x_2, ... ,x_i , belonging to each speaker, ie, P(x_i|S_j) (for all j that belongs to S) is to be calculated. The likelihood of the frame being from a particular speaker is calculated by substituting the \mu  and \Sigma  of that speaker GMM model in likelihood equation shown in previous section. This is done for each of the ‘k’ Gaussian components in the model, and the weighted  sum of the ‘k’ likelihoods from the components is taken as per the weight ‘w ‘ parameter of the model. The logarithm operation when applied on the obtained sum gives us the log likelihood value for the frame. This is repeated for all the frames of the sample and the likelihoods of all the frames are added. The speaker model with highest likelihood score is considered as the identified speaker.

The Python code given below predicts the speaker of the test audio.

import os
import cPickle
import numpy as np
from scipy.io.wavfile import read
from speakerfeatures import extract_features
import warnings
import time

#path to training data
source   = "development_set\\"
modelpath = "speaker_models\\"
test_file = "development_set_test.txt"
file_paths = open(test_file,'r')

gmm_files = [os.path.join(modelpath,fname) for fname in
              os.listdir(modelpath) if fname.endswith('.gmm')]

#Load the Gaussian gender Models
models    = [cPickle.load(open(fname,'r')) for fname in gmm_files]
speakers   = [fname.split("\\")[-1].split(".gmm")[0] for fname
              in gmm_files]

# Read the test directory and get the list of test audio files
for path in file_paths:   

    path = path.strip()
    print path
    sr,audio = read(source + path)
    vector   = extract_features(audio,sr)

    log_likelihood = np.zeros(len(models)) 

    for i in range(len(models)):
        gmm    = models[i]  #checking with each model one by one
        scores = np.array(gmm.score(vector))
        log_likelihood[i] = scores.sum()

    winner = np.argmax(log_likelihood)
    print "\tdetected as - ", speakers[winner]

Results and Conclusion

This beginner’s approach performs with an in-set accuracy of 100%, identifying all the 170 speech utterances correctly. There are few reasons for such perfect result.

  1. The unseen utterances of speakers taken from VoxForge are possibly of same channel or environment.
  2. The evaluation task in performed on small dataset. Consider a data inflow where you are getting probably some thousands of calls in a day.
  3. Consider the situation when we have to identify speakers from the set of 1000 speakers.
  4. In this evaluation, we have not taken out-of-set speakers into account i.e. if the audio is not from any speaker still our system will identify it as one of speakers in trained set depending upon highest likelihood.
  5. In the real environment, we may get more noisy and unclean data. Speaker identification system needs to be robust.

We hope the blog post was successful in explaining basic approach towards speaker identification task. We expect you to reproduce the results posted by us. Remember, this is not the end. I hope it forms the background to do further research in this particular task. To read more effective techniques, below are the provided reference for speaker identification task.

[1] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, M.I.T. Lincoln Laboratory, 2000

[2] Najim Dehak et al., “Front end Factor Analysis for Speaker Verification”, IEEE transaction on Audio, Speech and Language Processing, 2010

The full implementation of followed approach for training and evaluation of speaker identification from voice can be downloaded from GitHub link here. Also remember to download the data-set provided at the beginning of blog-post.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂


39 thoughts on “Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation

  1. in the above code indentation is missing in feature extraction part , can you send me the code with indentation , it will help me a lot.
    thank you

  2. I liked this. Im trying this with my local files. however, it gives me a warning saying that:
    WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
    + modeling completed for speaker: Tilak\wav\TIlak1.gmm with data point = (598, 40)

    • Hey Tilak, I had the same warning saying mine is frame length of 551 and is greater than FFT size of 512, so the trained model was not producing good results. However as the warning suggested, I increased the NFFT value from 512 to 1024 (it should be a power of 2, try searching google for more information) by manipulating the mfcc function in extract_features function. It is set as 512 as default. Hope it works

  3. hi I tried to apply the program however I am getting the same error but I don’t know how to increase the NFFT value. Can anyone help?
    Thanks in advance

  4. Sir I have problem with installation of cPickle module please sir can you help me with that
    installation .

  5. Hey…This blog is quite helpful …
    Is there is any way to declare as “unknowns” for silence or noise (except from trained models).
    e.g some threshold for declaring speaker match..etc

    • Silence : You can check energy and discard it earlier as there would be no voice and energy will be low.
      Noise : Train a SVM/GMM discrimator (noice vs speech) on acoustic features like MFCC etc.
      You may like to do little research on these.

  6. I am having the issue of ImportError: cannot import name GMM. What could be the issue.I am using python 2.7 and sklearn version 0.20.2 and I ‘m importing GMM with from sklearn.mixture import GaussianMixture as GMM

    • That’s true. Probably, the scikit-learn design had been changed since this code was written. Check out where and how is GMM implemented now in scikit-learn !!.
      Do write it here if you figure out. It may help others.

  7. What are the ‘N’ component’s , taken as 16 here , Is it same as number of speakers ? Wanted to understand why 16 specifically ?

    • GMM is Gaussian Mixture Model.

      N is number of mixtures. Basically, GMM is mixture of gaussians. So ‘N’ tells about number of Gaussians that has to be fit on data.
      If data is huge, we may require N – 32,64 or 128 in order to capture all the variability.


  8. Thank you Abhijeet Kumar!!! for creating such kind of important blog. I am facing a problem and I don’t know how GMM-UBM is implemented in Speaker-recognition. Can you help me please? It is very very decisive to me.

    • Hi,
      GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].

      The technique is as follows:
      1. To train a large universal GMM model on 1000 speakers. This is called UBM.
      2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.

      I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).


  9. Hi Abhijeet Kumar, I appreciate your work. but I have a query that when I tested an input file which is form training data set then model works well but the model gives a false result for the data which is not in the training data set, why this is so?

    • Hi Prashu,
      That is true. So question is how does it classify ? It checks the gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.

      There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.

      • I am not stating about the unknown speaker but the same trained speaker having a different voice sample that was not at the time of training. In my case, I use real-time audio testing from the mic so obviously, that audio file would not present in the training dataset. hope you get me

  10. For the training data set I use Recorder.ipynb which has channel 2 and run it every time by changing the file name, 6 for every speaker. then train the model as you illustrate above.
    after that, the model works fine for the stored audio file. but for real-time testing I use real-time_test.ipynb which first record the file by the mic and then test concurrently after that.
    but this shows the false result.

    this is my final yr project, I am very grateful for you.

    “check my work”

  11. Hi,
    I am getting file not found error. But, I have given the development set and training model files correctly. Here is the error I am getting.

    FileNotFoundError Traceback (most recent call last)
    in ()
    31 # read the audio
    —> 32 sr,audio = read(source + path)
    34 # extract 40 dimensional MFCC & delta MFCC features

    ~/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py in read(filename, mmap)
    231 mmap = False
    232 else:
    –> 233 fid = open(filename, ‘rb’)
    235 try:

    FileNotFoundError: [Errno 2] No such file or directory: ‘development_set\\anthonyschaller-20071221-\\wav\\a0491.wav’

  12. Hi,
    Thank you for this helpful tutorial

    I am doing voice authentication project where user can login by record his voice
    So, what I do is making user submit enrollment and extract features
    but I am struggling with part where I should compare the previous submitted enrollment with voice recording he is doing to login.

    can explain more
    “# Read the test directory and get the list of test audio files
    for path in file_paths:

    path = path.strip()
    print path
    sr,audio = read(source + path)
    vector = extract_features(audio,sr)

    log_likelihood = np.zeros(len(models))

    for i in range(len(models)):
    gmm = models[i] #checking with each model one by one
    scores = np.array(gmm.score(vector))
    log_likelihood[i] = scores.sum()

    winner = np.argmax(log_likelihood)
    print “\tdetected as – “, speakers[winner]

    Best regards,

  13. I am using python 3.6…and this the error i am getting

    runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)
    Traceback (most recent call last):

    File “”, line 1, in
    runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)

    File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
    execfile(filename, namespace)

    File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
    exec(compile(f.read(), filename, ‘exec’), namespace)

    File “D:/pyspeaker/test_speaker.py”, line 28, in
    models = [cPickle.load(open(fname,’r’,errors=’ignore’))]

    TypeError: a bytes-like object is required, not ‘str’

  14. hey, speakerfeatures and extract_features are shown as unknown package while trying to install
    Can u help me out with appropriate names of the packages to be installed for speakerfeatures
    And I have tried to import extract_features function which is again showing as unknown

  15. Hi Abhijeet, thanks for this tutorial. I am encountering the following error while implementing the exract_features function.

    TypeErrorTraceback (most recent call last)

    in ()
    —-> 1 a = extract_features(‘sp1.wav’,16000)
    2 a

    3 frames

    /usr/local/lib/python2.7/dist-packages/python_speech_features/sigproc.pyc in preemphasis(signal, coeff)
    116 :returns: the filtered signal.
    117 “””
    –> 118 return numpy.append(signal[0],signal[1:]-coeff*signal[:-1])

    TypeError: can’t multiply sequence by non-int of type ‘float’

    What can be the reason for this? I was trying to extract_features of a sample audio sampled at 16kHz. Other than that, I didn’t change anything.

Leave a Reply