Similar to our previous post “Voice Gender Detection“, this blog-post focuses on a beginner’s method to answer the question ‘who is the speaker‘ in the speech file. Recently, lot of voice biometric systems have been developed which can extract speaker information from the recorded voice and identify the speaker from set of trained speakers in the database. In this blog-post, we will illustrate the same with a naive approach using Gaussian Mixture Models (GMM). There are other conventional as well as modern approaches which are more robust to channel noise and also performs better than approach followed in this blog-post.
- GMM-UBM (Gaussian Mixture Model – Universal Background Model) using MAP (Maximum Aposteriori) adaptation [1] is one of the successful conventional technique to implement speaker identification.
- I-vectors based speaker identification [2] is the state-of-the-art technique implemented in lot of voice biometric products.
As a beginner, the above mentioned techniques may overwhelm you as they are mathematically complex methods and requires some research effort in order to comprehend. Therefore, I am not following any of the two approaches. Instead, I am interested in showing you the implementation of fundamental step of speaker identification (using GMMs) which can then lead to development of GMM-UBM or I-vectors approach.
Data-sets: The below data-sets can be downloaded from here.
- Training corpus : It has been developed from audios taken from ‘on-line VoxForge speech database’ and consists of 5 speech utterances for each speaker, spoken by 34 speakers (i.e, 20-30 seconds/speaker).
- Test corpus: This consists of remaining 5 unseen utterances of the same 34 speakers taken in train corpus. All audio files are of 10 seconds duration and are sampled at 16000 Hz.
I will strongly recommend you to read our previous post ‘Voice Gender Detection’ as a brief primer about how to work with speech signals are explained there. We have also discussed about extracting a popular speech feature, Mel Frequency Cepstrum Coefficients (MFCCs) previously. A GMM will take as input the MFCCs and derivatives of MFCCs of the training samples of a speaker and will try to learn their distribution, which will be representative of that speaker. A typical speaker identification process can be shown by flow diagram below.
While testing when the speaker of a new voice sample is to be identified, first the 40-dimensional feature (MFCCs + delta MFCC) of the sample will be extracted and then the trained speaker GMM models will be used to calculate the scores of the features for all the models. Speaker model with the maximum score is predicted as the identified speaker of the test speech. Having said that we will go through the python implementation of the following steps:
- 40-Dimensional Feature Extraction
- Training Speaker Models.
- Evaluating Performance on test set
Lets get started !!
1. Feature Extraction.
We extract 40-dimensional features from speech frames. There are 20 MFCC features and 20 derivatives of MFCC features. The derivatives of MFCCs provides the information of dynamics of MFCCs over the time. It turns out that calculating the delta-MFCC and appending them to the original MFCC features (20-dimenaionl) increases the performance in lot of speech analytics applications. To calculate delta features from MFCCs, we apply the following equation.
where ‘N’ is number of deltas summed over. Typically taken as 2.
The below python functions extracts MFCC features and derives delta coefficients from from audio signal.
import numpy as np from sklearn import preprocessing import python_speech_features as mfcc def calculate_delta(array): """Calculate and returns the delta of given feature vector matrix""" rows,cols = array.shape deltas = np.zeros((rows,20)) N = 2 for i in range(rows): index = [] j = 1 while j <= N: if i-j rows-1: second = rows-1 else: second = i+j index.append((second,first)) j+=1 deltas[i] = ( array[index[0][0]]-array[index[0][1]] + (2 * (array[index[1][0]]-array[index[1][1]])) ) / 10 return deltas def extract_features(audio,rate): """extract 20 dim mfcc features from an audio, performs CMS and combines delta to make it 40 dim feature vector""" mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True) mfcc_feat = preprocessing.scale(mfcc_feat) delta = calculate_delta(mfcc_feat) combined = np.hstack((mfcc_feat,delta)) return combined
2. Training Speaker Models.
As we know, There are 34 distinct speakers in training corpus which are taken from lots of speaker provided by VoxForge. The path of all the audio files (5 per speaker) utilized for training are given in this file. Usually there is a very important step called pre-processing, aslo known as voice activity detection(VAD) which includes noise removal and silence truncation from the audios. I have assumed that there is no requirement of performing VAD here.
In order to build a speaker identification system from the above extracted features, we need to model all the speakers independently now. We employ GMMs for this task.
A Gaussian mixture model is a probabilistic clustering model for representing the presence of sub-populations within an overall population. The idea of training a GMM is to approximate the probability distribution of a class by a linear combination of ‘k’ Gaussian distributions/clusters, also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation:
, where is the Gaussian distribution
The training data of the class are used to estimate the parameters mean , co-variance matrices and weights of these k components.
Initially, it identifies k clusters in the data by the K-means algorithm and assigns equal weight to each cluster. ‘k’ gaussian distributions are then fitted to these k clusters. The parameters , and of all the clusters are updated in iterations until the converge. The most popularly used method for this estimation is the Expectation Maximization (EM) algorithm.
Python’s sklearn.mixture
package is used by us to learn a GMM from the features
matrix containing the 40 dimensional MFCC and delta-MFCC features. More about sklearn GMM can be read from section 3 of our previous post ‘Voice Gender Detection‘. The following Python code is used to train the GMM speaker models with 16 components. The code is run once for each speaker and train_file
is variable which has text filename containing path to all the audios for the respective speaker. Also, you have to create a “speaker_models” directory where all the models will be dumped after training.
import cPickle import numpy as np from scipy.io.wavfile import read from sklearn.mixture import GMM from speakerfeatures import extract_features import warnings warnings.filterwarnings("ignore") #path to training data source = "development_set\\" #path where training speakers will be saved dest = "speaker_models\\" train_file = "development_set_enroll.txt" file_paths = open(train_file,'r') count = 1 # Extracting features for each speaker (5 files per speakers) features = np.asarray(()) for path in file_paths: path = path.strip() print path # read the audio sr,audio = read(source + path) # extract 40 dimensional MFCC & delta MFCC features vector = extract_features(audio,sr) if features.size == 0: features = vector else: features = np.vstack((features, vector)) # when features of 5 files of speaker are concatenated, then do model training if count == 5: gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3) gmm.fit(features) # dumping the trained gaussian model picklefile = path.split("-")[0]+".gmm" cPickle.dump(gmm,open(dest + picklefile,'w')) print '+ modeling completed for speaker:',picklefile," with data point = ",features.shape features = np.asarray(()) count = 0 count = count + 1
3. Evaluating Performance on Test set.
Test set consists of 5 unseen utterances of trained 34 speakers. The path of all the audio files (5 per speaker) utilized for evaluation are given in this file.
Upon arrival of a test voice sample for speaker identification, we begin by extracting the 40 dimensional for it, with 25 ms frame size and 10 ms overlap between frames. Next we require the log likelihood scores for each frame of the sample, , belonging to each speaker, ie, (for all j that belongs to S) is to be calculated. The likelihood of the frame being from a particular speaker is calculated by substituting the and of that speaker GMM model in likelihood equation shown in previous section. This is done for each of the ‘k’ Gaussian components in the model, and the weighted sum of the ‘k’ likelihoods from the components is taken as per the weight ‘‘ parameter of the model. The logarithm operation when applied on the obtained sum gives us the log likelihood value for the frame. This is repeated for all the frames of the sample and the likelihoods of all the frames are added. The speaker model with highest likelihood score is considered as the identified speaker.
The Python code given below predicts the speaker of the test audio.
import os import cPickle import numpy as np from scipy.io.wavfile import read from speakerfeatures import extract_features import warnings warnings.filterwarnings("ignore") import time #path to training data source = "development_set\\" modelpath = "speaker_models\\" test_file = "development_set_test.txt" file_paths = open(test_file,'r') gmm_files = [os.path.join(modelpath,fname) for fname in os.listdir(modelpath) if fname.endswith('.gmm')] #Load the Gaussian gender Models models = [cPickle.load(open(fname,'r')) for fname in gmm_files] speakers = [fname.split("\\")[-1].split(".gmm")[0] for fname in gmm_files] # Read the test directory and get the list of test audio files for path in file_paths: path = path.strip() print path sr,audio = read(source + path) vector = extract_features(audio,sr) log_likelihood = np.zeros(len(models)) for i in range(len(models)): gmm = models[i] #checking with each model one by one scores = np.array(gmm.score(vector)) log_likelihood[i] = scores.sum() winner = np.argmax(log_likelihood) print "\tdetected as - ", speakers[winner] time.sleep(1.0)
Results and Conclusion
This beginner’s approach performs with an in-set accuracy of 100%, identifying all the 170 speech utterances correctly. There are few reasons for such perfect result.
- The unseen utterances of speakers taken from VoxForge are possibly of same channel or environment.
- The evaluation task in performed on small dataset. Consider a data inflow where you are getting probably some thousands of calls in a day.
- Consider the situation when we have to identify speakers from the set of 1000 speakers.
- In this evaluation, we have not taken out-of-set speakers into account i.e. if the audio is not from any speaker still our system will identify it as one of speakers in trained set depending upon highest likelihood.
- In the real environment, we may get more noisy and unclean data. Speaker identification system needs to be robust.
We hope the blog post was successful in explaining basic approach towards speaker identification task. We expect you to reproduce the results posted by us. Remember, this is not the end. I hope it forms the background to do further research in this particular task. To read more effective techniques, below are the provided reference for speaker identification task.
[1] Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, M.I.T. Lincoln Laboratory, 2000
[2] Najim Dehak et al., “Front end Factor Analysis for Speaker Verification”, IEEE transaction on Audio, Speech and Language Processing, 2010
The full implementation of followed approach for training and evaluation of speaker identification from voice can be downloaded from GitHub link here. Also remember to download the data-set provided at the beginning of blog-post.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂
in the above code indentation is missing in feature extraction part , can you send me the code with indentation , it will help me a lot.
thank you
Like
Hi venkat,
You can get all the python codes from my GitHub account.
https://github.com/abhijeet3922/Speaker-identification-using-GMMs
I would suggest you to go through blog once. The github link of all the codes in each of my blog-posts are given towards the end of blog.
Liked by 1 person
thank you very much for the reply 🙂
Liked by 1 person
sir can u provide for voice disorder identification
Like
What do you mean by voice disorder identification ?
Like
Hi Abhijeet,Thanks for a wonderful post. Please suggest us a complete voice/sound tutorial or book.
Like
Hi, I have a question for the code
mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)
This should return an array of n rows and 20 columns, each row corresponds to a frame. If so, what did you do to handle audio clips with variable lengths?
Like
Hi,
Can I have a question on the line mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True)? This should return an array of n rows and 20 columns, each row corresponds to a frame. My question is what to do if audio clips have variable lengths?
Like
I liked this. Im trying this with my local files. however, it gives me a warning saying that:
WARNING:root:frame length (1103) is greater than FFT size (512), frame will be truncated. Increase NFFT to avoid.
+ modeling completed for speaker: Tilak\wav\TIlak1.gmm with data point = (598, 40)
Like
Hey Tilak, I had the same warning saying mine is frame length of 551 and is greater than FFT size of 512, so the trained model was not producing good results. However as the warning suggested, I increased the NFFT value from 512 to 1024 (it should be a power of 2, try searching google for more information) by manipulating the mfcc function in extract_features function. It is set as 512 as default. Hope it works
Like
hi I tried to apply the program however I am getting the same error but I don’t know how to increase the NFFT value. Can anyone help?
Thanks in advance
Like
You can refer python_speech_features’s documentation…
Default nfft sizeis 512
https://python-speech-features.readthedocs.io/en/latest/
Like
Sir I have problem with installation of cPickle module please sir can you help me with that
installation .
Like
Probably you are using python3. There is no cPickle there.
Kindly follow this and google your problem.
https://askubuntu.com/questions/742782/how-to-install-cpickle-on-python-3-4
Like
Hey…This blog is quite helpful …
Is there is any way to declare as “unknowns” for silence or noise (except from trained models).
e.g some threshold for declaring speaker match..etc
Like
Silence : You can check energy and discard it earlier as there would be no voice and energy will be low.
Noise : Train a SVM/GMM discrimator (noice vs speech) on acoustic features like MFCC etc.
You may like to do little research on these.
Like
I am having the issue of ImportError: cannot import name GMM. What could be the issue.I am using python 2.7 and sklearn version 0.20.2 and I ‘m importing GMM with from sklearn.mixture import GaussianMixture as GMM
Like
That’s true. Probably, the scikit-learn design had been changed since this code was written. Check out where and how is GMM implemented now in scikit-learn !!.
Do write it here if you figure out. It may help others.
Like
What are the ‘N’ component’s , taken as 16 here , Is it same as number of speakers ? Wanted to understand why 16 specifically ?
Like
GMM is Gaussian Mixture Model.
N is number of mixtures. Basically, GMM is mixture of gaussians. So ‘N’ tells about number of Gaussians that has to be fit on data.
If data is huge, we may require N – 32,64 or 128 in order to capture all the variability.
Thanks.
Like
Thank you Abhijeet Kumar!!! for creating such kind of important blog. I am facing a problem and I don’t know how GMM-UBM is implemented in Speaker-recognition. Can you help me please? It is very very decisive to me.
Like
Hi,
GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].
The technique is as follows:
1. To train a large universal GMM model on 1000 speakers. This is called UBM.
2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.
I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).
Thanks,
Abhijeet
Like
Thank you too much. I really appreciate your work
Liked by 1 person
I am Glad !!
Like
Hi Abhijeet Kumar, I appreciate your work. but I have a query that when I tested an input file which is form training data set then model works well but the model gives a false result for the data which is not in the training data set, why this is so?
Like
Hi Prashu,
That is true. So question is how does it classify ? It checks the gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.
There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.
Like
I am not stating about the unknown speaker but the same trained speaker having a different voice sample that was not at the time of training. In my case, I use real-time audio testing from the mic so obviously, that audio file would not present in the training dataset. hope you get me
Like
What was the source of training data for your speaker model ? Channel has its effect. This basic technique will work if the channel is same.
Like
For the training data set I use Recorder.ipynb which has channel 2 and run it every time by changing the file name, 6 for every speaker. then train the model as you illustrate above.
after that, the model works fine for the stored audio file. but for real-time testing I use real-time_test.ipynb which first record the file by the mic and then test concurrently after that.
but this shows the false result.
this is my final yr project, I am very grateful for you.
“check my work”
https://github.com/prashu22/speaker_reco
Like
Hi,
I am getting file not found error. But, I have given the development set and training model files correctly. Here is the error I am getting.
anthonyschaller-20071221-\wav\a0491.wav
—————————————————————————
FileNotFoundError Traceback (most recent call last)
in ()
30
31 # read the audio
—> 32 sr,audio = read(source + path)
33
34 # extract 40 dimensional MFCC & delta MFCC features
~/anaconda3/lib/python3.6/site-packages/scipy/io/wavfile.py in read(filename, mmap)
231 mmap = False
232 else:
–> 233 fid = open(filename, ‘rb’)
234
235 try:
FileNotFoundError: [Errno 2] No such file or directory: ‘development_set\\anthonyschaller-20071221-\\wav\\a0491.wav’
Like
Kindly check the directory path properly. Hope you can debug it on your own.
Like
hi,
I can use python 3.6 with this code ? if no what can i do cause my project with 3.6 ?
Like
You can use 3.6 also. It’s just the code man !!
Like
Hi,
Thank you for this helpful tutorial
I am doing voice authentication project where user can login by record his voice
So, what I do is making user submit enrollment and extract features
but I am struggling with part where I should compare the previous submitted enrollment with voice recording he is doing to login.
can explain more
“# Read the test directory and get the list of test audio files
for path in file_paths:
path = path.strip()
print path
sr,audio = read(source + path)
vector = extract_features(audio,sr)
log_likelihood = np.zeros(len(models))
for i in range(len(models)):
gmm = models[i] #checking with each model one by one
scores = np.array(gmm.score(vector))
log_likelihood[i] = scores.sum()
winner = np.argmax(log_likelihood)
print “\tdetected as – “, speakers[winner]
time.sleep(1.0)”
Best regards,
Like
I need GMM model python source code for accent variation detection. Could you please provide that?
Like
Look at the package “pyAudioAnalysis”. You may find something useful in segmentation or speaker Diarization part.
Like
I am using python 3.6…and this the error i am getting
runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)
Traceback (most recent call last):
File “”, line 1, in
runfile(‘D:/pyspeaker/test_speaker.py’, wdir=’D:/pyspeaker’)
File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 705, in runfile
execfile(filename, namespace)
File “D:\Anacondanew\lib\site-packages\spyder\utils\site\sitecustomize.py”, line 102, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)
File “D:/pyspeaker/test_speaker.py”, line 28, in
models = [cPickle.load(open(fname,’r’,errors=’ignore’))]
TypeError: a bytes-like object is required, not ‘str’
Like
Seems you have not trained the speaker models as it could not find pickled models.
Like
If Multiple people are talking in clip can model detect which speaker said what?
Like
hey, speakerfeatures and extract_features are shown as unknown package while trying to install
Can u help me out with appropriate names of the packages to be installed for speakerfeatures
And I have tried to import extract_features function which is again showing as unknown
Like
Hi Abhijeet, thanks for this tutorial. I am encountering the following error while implementing the exract_features function.
TypeErrorTraceback (most recent call last)
in ()
—-> 1 a = extract_features(‘sp1.wav’,16000)
2 a
3 frames
/usr/local/lib/python2.7/dist-packages/python_speech_features/sigproc.pyc in preemphasis(signal, coeff)
116 :returns: the filtered signal.
117 “””
–> 118 return numpy.append(signal[0],signal[1:]-coeff*signal[:-1])
119
120
TypeError: can’t multiply sequence by non-int of type ‘float’
What can be the reason for this? I was trying to extract_features of a sample audio sampled at 16kHz. Other than that, I didn’t change anything.
Like
Resolved.
Like
Hi,thank you for your helpful tutorial.It is very clear for me!
I am wondering if taken out-of-set speakers into account i.e. if the audio is not from any speaker in the trained models,how to recognize the out-of-set speaker as imposter,or maybe how to set the threshold?I hope that you can give me some ideas.Thank you very much.
Like
Hi Yan,
The Question is how does it recognize ? It checks the Gaussian probability with all the speaker models and gives the speaker name which is highest in terms of probability/likelihood. It will always give you the name of speaker which has highest probability and there is no way to find if it is out of set.
There are ways GMM-UBM method mentioned in above comments which can filter out unknowns from target speakers in database. Kindly go through GMM-UBM method in order to understand it.
GMM-UBM systems can be implemented in python by implementing the research paper given in reference [1].
The technique is as follows:
1. To train a large universal GMM model on 1000 speakers. This is called UBM.
2. Further, Each of the target speakers has to be adapted from the mean, covariance from the trained UBM model. Thus you will get adapted GMM models for each speaker.
I would strongly recommend to read reference [1] paper. After you understand the paper, you may like to modify the scikit learn GMM library in order to make it adapted GMM implementation (or you may search and find open source implementation for the same).
Thanks,
Abhijeet
Like
Hi Abhijeet,
Thanks very much for your reply,I am going to learn the GMM-UBM model.Thank you!
Like
index 299 is out of bounds for axis 0 with size 299
Like
how we can find out accuracy of the above model
Like
Thank you so much sir. It will be useful for my research.
Regards
Arichandran R
Like
Thanks for sharing! You’ve done a great job for those that want to begin with speech recognition, like me 🙂
I’ve done a little refactoring for python 3, and put it into classes, so that’s more useful.
Thanks man!
Like