Clustering, Speech Analytics, Unsupervised Learning

Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation

Date: November 14, 2017Author: Abhijeet Kumar 49 Comments

Similar to our previous post “Voice Gender Detection“, this blog-post focuses on a beginner’s method to answer the question ‘who is the speaker‘ in the speech file. Recently, lot of voice biometric systems have been developed which can extract speaker information from the recorded voice and identify the speaker from set of trained speakers in the database. In this blog-post, we will illustrate the same with a naive approach using Gaussian Mixture Models (GMM). There are other conventional as well as modern approaches which are more robust to channel noise and also performs better than approach followed in this blog-post.

GMM-UBM (Gaussian Mixture Model – Universal Background Model) using MAP (Maximum Aposteriori) adaptation [1] is one of the successful conventional technique to implement speaker identification.
I-vectors based speaker identification [2] is the state-of-the-art technique implemented in lot of voice biometric products.

As a beginner, the above mentioned techniques may overwhelm you as they are mathematically complex methods and requires some research effort in order to comprehend. Therefore, I am not following any of the two approaches. Instead, I am interested in showing you the implementation of fundamental step of speaker identification (using GMMs) which can then lead to development of GMM-UBM or I-vectors approach.

Data-sets: The below data-sets can be downloaded from here.

Training corpus : It has been developed from audios taken from ‘on-line VoxForge speech database’ and consists of 5 speech utterances for each speaker, spoken by 34 speakers (i.e, 20-30 seconds/speaker).
Test corpus: This consists of remaining 5 unseen utterances of the same 34 speakers taken in train corpus. All audio files are of 10 seconds duration and are sampled at 16000 Hz.

I will strongly recommend you to read our previous post ‘Voice Gender Detection’ as a brief primer about how to work with speech signals are explained there. We have also discussed about extracting a popular speech feature, Mel Frequency Cepstrum Coefficients (MFCCs) previously. A GMM will take as input the MFCCs and derivatives of MFCCs of the training samples of a speaker and will try to learn their distribution, which will be representative of that speaker. A typical speaker identification process can be shown by flow diagram below.

speakerID_process — Speaker Identification Process

While testing when the speaker of a new voice sample is to be identified, first the 40-dimensional feature (MFCCs + delta MFCC) of the sample will be extracted and then the trained speaker GMM models will be used to calculate the scores of the features for all the models. Speaker model with the maximum score is predicted as the identified speaker of the test speech. Having said that we will go through the python implementation of the following steps:

40-Dimensional Feature Extraction
Training Speaker Models.
Evaluating Performance on test set

Lets get started !!

1. Feature Extraction.

We extract 40-dimensional features from speech frames. There are 20 MFCC features and 20 derivatives of MFCC features. The derivatives of MFCCs provides the information of dynamics of MFCCs over the time. It turns out that calculating the delta-MFCC and appending them to the original MFCC features (20-dimenaionl) increases the performance in lot of speech analytics applications. To calculate delta features from MFCCs, we apply the following equation.

$\text{[math]}$

where ‘N’ is number of deltas summed over. Typically taken as 2.

The below python functions extracts MFCC features and derives delta coefficients from from audio signal.

import numpy as np
from sklearn import preprocessing
import python_speech_features as mfcc

def calculate_delta(array):
    """Calculate and returns the delta of given feature vector matrix"""

    rows,cols = array.shape
    deltas = np.zeros((rows,20))
    N = 2
    for i in range(rows):
        index = []
        j = 1
        while j <= N:
            if i-j  rows-1:
                second = rows-1
            else:
                second = i+j
            index.append((second,first))
            j+=1
        deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10
    return deltas

def extract_features(audio,rate):
    """extract 20 dim mfcc features from an audio, performs CMS and combines
    delta to make it 40 dim feature vector"""    

    mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)
    mfcc_feat = preprocessing.scale(mfcc_feat)
    delta = calculate_delta(mfcc_feat)
    combined = np.hstack((mfcc_feat,delta))
    return combined

2. Training Speaker Models.

As we know, There are 34 distinct speakers in training corpus which are taken from lots of speaker provided by VoxForge. The path of all the audio files (5 per speaker) utilized for training are given in this file. Usually there is a very important step called pre-processing, aslo known as voice activity detection(VAD) which includes noise removal and silence truncation from the audios. I have assumed that there is no requirement of performing VAD here.

In order to build a speaker identification system from the above extracted features, we need to model all the speakers independently now. We employ GMMs for this task.

A Gaussian mixture model is a probabilistic clustering model for representing the presence of sub-populations within an overall population. The idea of training a GMM is to approximate the probability distribution of a class by a linear combination of ‘k’ Gaussian distributions/clusters, also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation:

$P(X|\lambda) = \sum_{k=1}^{K} w_k P_k(X|\mu_k, \Sigma_k)$
, where $P_k(X|\mu_k, \Sigma_k)$ is the Gaussian distribution

$P_k(X|\mu_k,\Sigma_k) =$ $\frac{1}{{\sqrt{2\pi|\Sigma_k|}}} \thinspace e^{\frac{1}{2}(X-\mu_k)^T \Sigma^{-1}(X-\mu_k)}$

The training data $X_i$ of the class $\lambda$ are used to estimate the parameters mean $\mu$ , co-variance matrices $\Sigma$ and weights $w$ of these k components.

Initially, it identifies k clusters in the data by the K-means algorithm and assigns equal weight $w = \frac{1}{k}$ to each cluster. ‘k’ gaussian distributions are then fitted to these k clusters. The parameters $\mu$ , $\sigma$ and $w$ of all the clusters are updated in iterations until the converge. The most popularly used method for this estimation is the Expectation Maximization (EM) algorithm.

Python’s sklearn.mixture package is used by us to learn a GMM from the features matrix containing the 40 dimensional MFCC and delta-MFCC features. More about sklearn GMM can be read from section 3 of our previous post ‘Voice Gender Detection‘. The following Python code is used to train the GMM speaker models with 16 components. The code is run once for each speaker and train_file is variable which has text filename containing path to all the audios for the respective speaker. Also, you have to create a “speaker_models” directory where all the models will be dumped after training.

import cPickle
import numpy as np
from scipy.io.wavfile import read
from sklearn.mixture import GMM
from speakerfeatures import extract_features
import warnings
warnings.filterwarnings("ignore")

#path to training data
source   = "development_set\\"   

#path where training speakers will be saved
dest = "speaker_models\\"
train_file = "development_set_enroll.txt"
file_paths = open(train_file,'r')

count = 1
# Extracting features for each speaker (5 files per speakers)
features = np.asarray(())
for path in file_paths:
    path = path.strip()
    print path

    # read the audio
    sr,audio = read(source + path)

    # extract 40 dimensional MFCC & delta MFCC features
    vector   = extract_features(audio,sr)

    if features.size == 0:
        features = vector
    else:
        features = np.vstack((features, vector))
    # when features of 5 files of speaker are concatenated, then do model training
    if count == 5:
        gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3)
        gmm.fit(features)

        # dumping the trained gaussian model
        picklefile = path.split("-")[0]+".gmm"
        cPickle.dump(gmm,open(dest + picklefile,'w'))
        print '+ modeling completed for speaker:',picklefile," with data point = ",features.shape
        features = np.asarray(())
        count = 0
    count = count + 1

3. Evaluating Performance on Test set.

Test set consists of 5 unseen utterances of trained 34 speakers. The path of all the audio files (5 per speaker) utilized for evaluation are given in this file.

Upon arrival of a test voice sample for speaker identification, we begin by extracting the 40 dimensional for it, with 25 ms frame size and 10 ms overlap between frames. Next we require the log likelihood scores for each frame of the sample, $x_1, x_2, ... ,x_i$ , belonging to each speaker, ie, $P(x_i|S_j)$ (for all j that belongs to S) is to be calculated. The likelihood of the frame being from a particular speaker is calculated by substituting the $\mu$ and $\Sigma$ of that speaker GMM model in likelihood equation shown in previous section. This is done for each of the ‘k’ Gaussian components in the model, and the weighted sum of the ‘k’ likelihoods from the components is taken as per the weight ‘ $w$ ‘ parameter of the model. The logarithm operation when applied on the obtained sum gives us the log likelihood value for the frame. This is repeated for all the frames of the sample and the likelihoods of all the frames are added. The speaker model with highest likelihood score is considered as the identified speaker.

The Python code given below predicts the speaker of the test audio.

import os
import cPickle
import numpy as np
from scipy.io.wavfile import read
from speakerfeatures import extract_features
import warnings
warnings.filterwarnings("ignore")
import time

#path to training data
source   = "development_set\\"
modelpath = "speaker_models\\"
test_file = "development_set_test.txt"
file_paths = open(test_file,'r')

gmm_files = [os.path.join(modelpath,fname) for fname in
              os.listdir(modelpath) if fname.endswith('.gmm')]

#Load the Gaussian gender Models
models    = [cPickle.load(open(fname,'r')) for fname in gmm_files]
speakers   = [fname.split("\\")[-1].split(".gmm")[0] for fname
              in gmm_files]

# Read the test directory and get the list of test audio files
for path in file_paths:   

    path = path.strip()
    print path
    sr,audio = read(source + path)
    vector   = extract_features(audio,sr)

    log_likelihood = np.zeros(len(models)) 

    for i in range(len(models)):
        gmm    = models[i]  #checking with each model one by one
        scores = np.array(gmm.score(vector))
        log_likelihood[i] = scores.sum()

    winner = np.argmax(log_likelihood)
    print "\tdetected as - ", speakers[winner]
    time.sleep(1.0)

Results and Conclusion

This beginner’s approach performs with an in-set accuracy of 100%, identifying all the 170 speech utterances correctly. There are few reasons for such perfect result.

The unseen utterances of speakers taken from VoxForge are possibly of same channel or environment.
The evaluation task in performed on small dataset. Consider a data inflow where you are getting probably some thousands of calls in a day.
Consider the situation when we have to identify speakers from the set of 1000 speakers.
In this evaluation, we have not taken out-of-set speakers into account i.e. if the audio is not from any speaker still our system will identify it as one of speakers in trained set depending upon highest likelihood.
In the real environment, we may get more noisy and unclean data. Speaker identification system needs to be robust.

We hope the blog post was successful in explaining basic approach towards speaker identification task. We expect you to reproduce the results posted by us. Remember, this is not the end. I hope it forms the background to do further research in this particular task. To read more effective techniques, below are the provided reference for speaker identification task.

[1] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, M.I.T. Lincoln Laboratory, 2000

[2] Najim Dehak et al., “Front end Factor Analysis for Speaker Verification”, IEEE transaction on Audio, Speech and Language Processing, 2010

The full implementation of followed approach for training and evaluation of speaker identification from voice can be downloaded from GitHub link here. Also remember to download the data-set provided at the beginning of blog-post.

If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.

Happy machine learning 🙂

49 thoughts on “Spoken Speaker Identification based on Gaussian Mixture Models : Python Implementation”

Add Comment

venkat narendra says:

March 12, 2018 at 7:08 pm

in the above code indentation is missing in feature extraction part , can you send me the code with indentation , it will help me a lot.
thank you

Like

Reply
1. Abhijeet Kumar says:
  
  March 16, 2018 at 6:57 pm
  
  Hi venkat,
  
  You can get all the python codes from my GitHub account.
  
  https://github.com/abhijeet3922/Speaker-identification-using-GMMs
  
  I would suggest you to go through blog once. The github link of all the codes in each of my blog-posts are given towards the end of blog.
  
  Liked by 1 person
  
  Reply
  1. venkat narendra says:
    
    March 17, 2018 at 4:52 am
    
    thank you very much for the reply 🙂
    
    Liked by 1 person
    
    Reply
  2. nani says:
    
    January 31, 2019 at 4:47 am
    
    sir can u provide for voice disorder identification
    
    Like
    
    Reply
    1. Abhijeet Kumar says:
      
      January 31, 2019 at 5:08 am
      
      What do you mean by voice disorder identification ?
      
      Like
      
      Reply
2. Abhijit Kumar says:
  
  October 1, 2019 at 9:13 am
  
  Hi Abhijeet,Thanks for a wonderful post. Please suggest us a complete voice/sound tutorial or book.
  
  Like
  
  Reply
3. JN says:
  
  September 13, 2020 at 5:05 am
  
  Hi, I have a question for the code
  mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)
  This should return an array of n rows and 20 columns, each row corresponds to a frame. If so, what did you do to handle audio clips with variable lengths?
  
  Like
  
  Reply
4. JN says:
  
  September 13, 2020 at 5:08 am
  
  Hi,
  Can I have a question on the line mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)? This should return an array of n rows and 20 columns, each row corresponds to a frame. My question is what to do if audio clips have variable lengths?
  
  Like
  
  Reply
says:

April 19, 2018 at 12:24 am

I liked this. Im trying this with my local files. however, it gives me a warning saying that:
WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
+ modeling completed for speaker: Tilak\wav\TIlak1.gmm with data point = (598, 40)

Like

Reply
1. says:
  
  April 23, 2018 at 3:05 pm
  
  Hey Tilak, I had the same warning saying mine is frame length of 551 and is greater than FFT size of 512, so the trained model was not producing good results. However as the warning suggested, I increased the NFFT value from 512 to 1024 (it should be a power of 2, try searching google for more information) by manipulating the mfcc function in extract_features function. It is set as 512 as default. Hope it works
  
  Like
  
  Reply
julie says:

May 19, 2018 at 9:26 pm

hi I tried to apply the program however I am getting the same error but I don’t know how to increase the NFFT value. Can anyone help?
Thanks in advance

Like

Reply
1. Manoj says:
  
  December 10, 2018 at 9:41 am
  
  You can refer python_speech_features’s documentation…
  Default nfft sizeis 512
  https://python-speech-features.readthedocs.io/en/latest/
  
  Like
  
  Reply
Snehankit Chikhalekar says:

October 13, 2018 at 3:44 am

Sir I have problem with installation of cPickle module please sir can you help me with that
installation .

Like

Reply
1. Abhijeet Kumar says:
  
  October 16, 2018 at 4:37 am
  
  Probably you are using python3. There is no cPickle there.
  Kindly follow this and google your problem.
  
  https://askubuntu.com/questions/742782/how-to-install-cpickle-on-python-3-4
  
  Like
  
  Reply
JARVIISS says:

December 10, 2018 at 9:48 am

Hey…This blog is quite helpful …
Is there is any way to declare as “unknowns” for silence or noise (except from trained models).
e.g some threshold for declaring speaker match..etc

Like

Reply
1. Abhijeet Kumar says:
  
  January 13, 2019 at 12:26 pm
  
  Silence : You can check energy and discard it earlier as there would be no voice and energy will be low.
  Noise : Train a SVM/GMM discrimator (noice vs speech) on acoustic features like MFCC etc.
  You may like to do little research on these.
  
  Like
  
  Reply
deveid says:

January 11, 2019 at 2:36 pm

I am having the issue of ImportError: cannot import name GMM. What could be the issue.I am using python 2.7 and sklearn version 0.20.2 and I ‘m importing GMM with from sklearn.mixture import GaussianMixture as GMM

Like

Reply
1. Abhijeet Kumar says:
  
  January 13, 2019 at 12:27 pm
  
  That’s true. Probably, the scikit-learn design had been changed since this code was written. Check out where and how is GMM implemented now in scikit-learn !!.
  Do write it here if you figure out. It may help others.
  
  Like
  
  Reply
Tarang V. says:

January 13, 2019 at 10:13 am

What are the ‘N’ component’s , taken as 16 here , Is it same as number of speakers ? Wanted to understand why 16 specifically ?

Like

Reply
1. Abhijeet Kumar says:
  
  January 13, 2019 at 12:30 pm
  
  GMM is Gaussian Mixture Model.
  
  N is number of mixtures. Basically, GMM is mixture of gaussians. So ‘N’ tells about number of Gaussians that has to be fit on data.
  If data is huge, we may require N – 32,64 or 128 in order to capture all the variability.
  
  Thanks.
  
  Like
  
  Reply
Brhane says:

January 16, 2019 at 6:40 am

Thank you Abhijeet Kumar!!! for creating such kind of important blog. I am facing a problem and I don’t know how GMM-UBM is implemented in Speaker-recognition. Can you help me please? It is very very decisive to me.

Like

Reply
1. Abhijeet Kumar says:
  
  January 16, 2019 at 7:59 am
  
  Hi,
  GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].
  
  The technique is as follows:
  1. To train a large universal GMM model on 1000 speakers. This is called UBM.
  2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.
  
  I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).
  
  Thanks,
  Abhijeet
  
  Like
  
  Reply
  1. Brhane says:
    
    January 18, 2019 at 4:56 am
    
    Thank you too much. I really appreciate your work
    
    Liked by 1 person
    
    Reply
    1. Abhijeet Kumar says:
      
      January 18, 2019 at 5:24 am
      
      I am Glad !!
      
      Like
      
      Reply
prashu says:

February 11, 2019 at 5:07 pm

Hi Abhijeet Kumar, I appreciate your work. but I have a query that when I tested an input file which is form training data set then model works well but the model gives a false result for the data which is not in the training data set, why this is so?

Like

Reply
1. Abhijeet Kumar says:
  
  February 12, 2019 at 6:31 am
  
  Hi Prashu,
  That is true. So question is how does it classify ? It checks the gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.
  
  There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.
  
  Like
  
  Reply
  1. Prashu Gupta says:
    
    February 12, 2019 at 7:15 am
    
    I am not stating about the unknown speaker but the same trained speaker having a different voice sample that was not at the time of training. In my case, I use real-time audio testing from the mic so obviously, that audio file would not present in the training dataset. hope you get me
    
    Like
    
    Reply
    1. Abhijeet Kumar says:
      
      February 12, 2019 at 7:26 am
      
      What was the source of training data for your speaker model ? Channel has its effect. This basic technique will work if the channel is same.
      
      Like
      
      Reply
Prashu Gupta says:

February 12, 2019 at 11:45 am

For the training data set I use Recorder.ipynb which has channel 2 and run it every time by changing the file name, 6 for every speaker. then train the model as you illustrate above.
after that, the model works fine for the stored audio file. but for real-time testing I use real-time_test.ipynb which first record the file by the mic and then test concurrently after that.
but this shows the false result.

this is my final yr project, I am very grateful for you.

“check my work”
https://github.com/prashu22/speaker_reco

Like

Reply
Priya says:

February 14, 2019 at 1:06 pm

Hi,
I am getting file not found error. But, I have given the development set and training model files correctly. Here is the error I am getting.

anthonyschaller-20071221-\wav\a0491.wav
—————————————————————————
FileNotFoundError Traceback (most recent call last)
in ()
30
31 # read the audio
—> 32 sr,audio = read(source + path)
33
34 # extract 40 dimensional MFCC & delta MFCC features

~/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py in read(filename, mmap)
231 mmap = False
232 else:
–> 233 fid = open(filename, ‘rb’)
234
235 try:

FileNotFoundError: [Errno 2] No such file or directory: ‘development_set\\anthonyschaller-20071221-\\wav\\a0491.wav’

Like

Reply
1. Abhijeet Kumar says:
  
  February 15, 2019 at 10:48 am
  
  Kindly check the directory path properly. Hope you can debug it on your own.
  
  Like
  
  Reply
takwa says:

February 27, 2019 at 11:24 am

hi,
I can use python 3.6 with this code ? if no what can i do cause my project with 3.6 ?

Like

Reply
1. Abhijeet Kumar says:
  
  February 27, 2019 at 6:11 pm
  
  You can use 3.6 also. It’s just the code man !!
  
  Like
  
  Reply
M7md says:

March 3, 2019 at 4:30 pm

Hi,
Thank you for this helpful tutorial

I am doing voice authentication project where user can login by record his voice
So, what I do is making user submit enrollment and extract features
but I am struggling with part where I should compare the previous submitted enrollment with voice recording he is doing to login.

can explain more
“# Read the test directory and get the list of test audio files
for path in file_paths:

path = path.strip()
print path
sr,audio = read(source + path)
vector = extract_features(audio,sr)

log_likelihood = np.zeros(len(models))

for i in range(len(models)):
gmm = models[i] #checking with each model one by one
scores = np.array(gmm.score(vector))
log_likelihood[i] = scores.sum()

winner = np.argmax(log_likelihood)
print “\tdetected as – “, speakers[winner]
time.sleep(1.0)”

Best regards,

Like

Reply
Mamun says:

March 13, 2019 at 4:58 pm

I need GMM model python source code for accent variation detection. Could you please provide that?

Like

Reply
1. Abhijeet Kumar says:
  
  March 15, 2019 at 12:22 am
  
  Look at the package “pyAudioAnalysis”. You may find something useful in segmentation or speaker Diarization part.
  
  Like
  
  Reply
pratiksha says:

March 16, 2019 at 11:43 am

I am using python 3.6…and this the error i am getting

runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)
Traceback (most recent call last):

File “”, line 1, in
runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)

File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
execfile(filename, namespace)

File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)

File “D:/pyspeaker/test_speaker.py”, line 28, in
models = [cPickle.load(open(fname,’r’,errors=’ignore’))]

TypeError: a bytes-like object is required, not ‘str’

Like

Reply
1. Abhijeet Kumar says:
  
  March 16, 2019 at 11:48 am
  
  Seems you have not trained the speaker models as it could not find pickled models.
  
  Like
  
  Reply
VInit Dhamale says:

March 22, 2019 at 4:47 am

If Multiple people are talking in clip can model detect which speaker said what?

Like

Reply
Akhil Krishna says:

May 25, 2019 at 7:47 am

hey, speakerfeatures and extract_features are shown as unknown package while trying to install
Can u help me out with appropriate names of the packages to be installed for speakerfeatures
And I have tried to import extract_features function which is again showing as unknown

Like

Reply
Akshay Kishan says:

June 19, 2019 at 4:02 am

Hi Abhijeet, thanks for this tutorial. I am encountering the following error while implementing the exract_features function.

TypeErrorTraceback (most recent call last)

in ()
—-> 1 a = extract_features(‘sp1.wav’,16000)
2 a

3 frames

/usr/local/lib/python2.7/dist-packages/python_speech_features/sigproc.pyc in preemphasis(signal, coeff)
116 :returns: the filtered signal.
117 “””
–> 118 return numpy.append(signal[0],signal[1:]-coeff*signal[:-1])
119
120

TypeError: can’t multiply sequence by non-int of type ‘float’

What can be the reason for this? I was trying to extract_features of a sample audio sampled at 16kHz. Other than that, I didn’t change anything.

Like

Reply
1. Akshay Kishan says:
  
  June 19, 2019 at 5:33 am
  
  Resolved.
  
  Like
  
  Reply
yan says:

August 1, 2019 at 1:25 pm

Hi，thank you for your helpful tutorial.It is very clear for me!
I am wondering if taken out-of-set speakers into account i.e. if the audio is not from any speaker in the trained models,how to recognize the out-of-set speaker as imposter,or maybe how to set the threshold?I hope that you can give me some ideas.Thank you very much.

Like

Reply
1. Abhijeet Kumar says:
  
  August 1, 2019 at 3:06 pm
  
  Hi Yan,
  
  The Question is how does it recognize ? It checks the Gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.
  
  There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.
  
  GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].
  
  The technique is as follows:
  1. To train a large universal GMM model on 1000 speakers. This is called UBM.
  2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.
  
  I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).
  
  Thanks,
  Abhijeet
  
  Like
  
  Reply
  1. yan says:
    
    August 8, 2019 at 9:25 am
    
    Hi Abhijeet,
    Thanks very much for your reply,I am going to learn the GMM-UBM model.Thank you!
    
    Like
    
    Reply
praveen says:

November 22, 2019 at 7:44 am

index 299 is out of bounds for axis 0 with size 299

Like

Reply
Minakshi says:

February 9, 2020 at 3:48 pm

how we can find out accuracy of the above model

Like

Reply
Arichandran R says:

June 8, 2021 at 4:32 am

Thank you so much sir. It will be useful for my research.
Regards
Arichandran R

Like

Reply
Marco says:

September 21, 2021 at 8:01 pm

Thanks for sharing! You’ve done a great job for those that want to begin with speech recognition, like me 🙂

I’ve done a little refactoring for python 3, and put it into classes, so that’s more useful.

Thanks man!

Like

Reply