Similar to our previous post “Voice Gender Detection“, this blog-post focuses on a beginner’s method to answer the question ‘who is the speaker‘ in the speech file. Recently, lot of voice biometric systems have been developed which can extract speaker information from the recorded voice and identify the speaker from set of trained speakers in the database. In this blog-post, we will illustrate the same with a naive approach using Gaussian Mixture Models (GMM). There are other conventional as well as modern approaches which are more robust to channel noise and also performs better than approach followed in this blog-post.
- GMM-UBM (Gaussian Mixture Model – Universal Background Model) using MAP (Maximum Aposteriori) adaptation  is one of the successful conventional technique to implement speaker identification.
- I-vectors based speaker identification  is the state-of-the-art technique implemented in lot of voice biometric products.
As a beginner, the above mentioned techniques may overwhelm you as they are mathematically complex methods and requires some research effort in order to comprehend. Therefore, I am not following any of the two approaches. Instead, I am interested in showing you the implementation of fundamental step of speaker identification (using GMMs) which can then lead to developement of GMM-UBM or I-vectors approach.
Data-sets: The below data-sets can be downloaded from here.
- Training corpus : It has been developed from audios taken from ‘on-line VoxForge speech database’ and consists of 5 speech utterances for each speaker, spoken by 34 speakers (i.e, 20-30 seconds/speaker).
- Test corpus: This consists of remaining 5 unseen utterances of the same 34 speakers taken in train corpus. All audio files are of 10 seconds duration and are sampled at 16000 Hz.
I will strongly recommend you to read our previous post ‘Voice Gender Detection’ as a brief primer about how to work with speech signals are explained there. We have also discussed about extracting a popular speech feature, Mel Frequency Cepstrum Coefficients (MFCCs) previously. A GMM will take as input the MFCCs and derivatives of MFCCs of the training samples of a speaker and will try to learn their distribution, which will be representative of that speaker. A typical speaker identification process can be shown by flow diagram below.
While testing when the speaker of a new voice sample is to be identified, first the 40-dimensional feature (MFCCs + delta MFCC) of the sample will be extracted and then the trained speaker GMM models will be used to calculate the scores of the features for all the models. Speaker model with the maximum score is predicted as the identified speaker of the test speech. Having said that we will go through the python implementation of the following steps:
- 40-Dimensional Feature Extraction
- Training Speaker Models.
- Evaluating Performance on test set
Lets get started !!
1. Feature Extraction.
We extract 40-dimensional features from speech frames. There are 20 MFCC features and 20 derivatives of MFCC features. The derivatives of MFCCs provides the information of dynamics of MFCCs over the time. It turns out that calculating the delta-MFCC and appending them to the original MFCC features (20-dimenaionl) increases the performance in lot of speech analytics applications. To calculate delta features from MFCCs, we apply the following equation.
where ‘N’ is number of deltas summed over. Typically taken as 2.
The below python functions extracts MFCC features and derives delta coefficients from from audio signal.
import numpy as np from sklearn import preprocessing import python_speech_features as mfcc def calculate_delta(array): """Calculate and returns the delta of given feature vector matrix""" rows,cols = array.shape deltas = np.zeros((rows,20)) N = 2 for i in range(rows): index =  j = 1 while j <= N: if i-j < 0: first = 0 else: first = i-j if i+j > rows-1: second = rows-1 else: second = i+j index.append((second,first)) j+=1 deltas[i] = ( array[index]-array[index] + (2 * (array[index]-array[index])) ) / 10 return deltas def extract_features(audio,rate): """extract 20 dim mfcc features from an audio, performs CMS and combines delta to make it 40 dim feature vector""" mfcc_feat = mfcc.mfcc(audio,rate, 0.025, 0.01,20,appendEnergy = True) mfcc_feat = preprocessing.scale(mfcc_feat) delta = calculate_delta(mfcc_feat) combined = np.hstack((mfcc_feat,delta)) return combined
2. Training Speaker Models.
As we know, There are 34 distinct speakers in training corpus which are taken from lots of speaker provided by VoxForge. The path of all the audio files (5 per speaker) utilized for training are given in this file. Usually there is a very important step called pre-processing, aslo known as voice activity detection(VAD) which includes noise removal and silence truncation from the audios. I have assumed that there is no requirement of performing VAD here.
In order to build a speaker identification system from the above extracted features, we need to model all the speakers independently now. We employ GMMs for this task.
A Gaussian mixture model is a probabilistic clustering model for representing the presence of sub-populations within an overall population. The idea of training a GMM is to approximate the probability distribution of a class by a linear combination of ‘k’ Gaussian distributions/clusters, also called the components of the GMM. The likelihood of data points (feature vectors) for a model is given by following equation:
, where is the Gaussian distribution
The training data of the class are used to estimate the parameters mean , co-variance matrices and weights of these k components.
Initially, it identifies k clusters in the data by the K-means algorithm and assigns equal weight to each cluster. ‘k’ gaussian distributions are then fitted to these k clusters. The parameters , and of all the clusters are updated in iterations until the converge. The most popularly used method for this estimation is the Expectation Maximization (EM) algorithm.
sklearn.mixture package is used by us to learn a GMM from the
features matrix containing the 40 dimensional MFCC and delta-MFCC features. More about sklearn GMM can be read from section 3 of our previous post ‘Voice Gender Detection‘. The following Python code is used to train the GMM speaker models with 16 components. The code is run once for each speaker and
train_file is variable which has text filename containing path to all the audios for the respective speaker. Also, you have to create a “speaker_models” directory where all the models will be dumped after training.
import cPickle import numpy as np from scipy.io.wavfile import read from sklearn.mixture import GMM from speakerfeatures import extract_features import warnings warnings.filterwarnings("ignore") #path to training data source = "development_set\\" #path where training speakers will be saved dest = "speaker_models\\" train_file = "development_set_enroll.txt" file_paths = open(train_file,'r') count = 1 # Extracting features for each speaker (5 files per speakers) features = np.asarray(()) for path in file_paths: path = path.strip() print path # read the audio sr,audio = read(source + path) # extract 40 dimensional MFCC & delta MFCC features vector = extract_features(audio,sr) if features.size == 0: features = vector else: features = np.vstack((features, vector)) # when features of 5 files of speaker are concatenated, then do model training if count == 5: gmm = GMM(n_components = 16, n_iter = 200, covariance_type='diag',n_init = 3) gmm.fit(features) # dumping the trained gaussian model picklefile = path.split("-")+".gmm" cPickle.dump(gmm,open(dest + picklefile,'w')) print '+ modeling completed for speaker:',picklefile," with data point = ",features.shape features = np.asarray(()) count = 0 count = count + 1
3. Evaluating Performance on Test set.
Test set consists of 5 unseen utterances of trained 34 speakers. The path of all the audio files (5 per speaker) utilized for evaluation are given in this file.
Upon arrival of a test voice sample for speaker identification, we begin by extracting the 40 dimensional for it, with 25 ms frame size and 10 ms overlap between frames. Next we require the log likelihood scores for each frame of the sample, , belonging to each speaker, ie, (for all j that belongs to S) is to be calculated. The likelihood of the frame being from a particular speaker is calculated by substituting the and of that speaker GMM model in likelihood equation shown in previous section. This is done for each of the ‘k’ Gaussian components in the model, and the weighted sum of the ‘k’ likelihoods from the components is taken as per the weight ‘‘ parameter of the model. The logarithm operation when applied on the obtained sum gives us the log likelihood value for the frame. This is repeated for all the frames of the sample and the likelihoods of all the frames are added. The speaker model with highest likelihood score is considered as the identified speaker.
The Python code given below predicts the speaker of the test audio.
import os import cPickle import numpy as np from scipy.io.wavfile import read from speakerfeatures import extract_features import warnings warnings.filterwarnings("ignore") import time #path to training data source = "development_set\\" modelpath = "speaker_models\\" test_file = "development_set_test.txt" file_paths = open(test_file,'r') gmm_files = [os.path.join(modelpath,fname) for fname in os.listdir(modelpath) if fname.endswith('.gmm')] #Load the Gaussian gender Models models = [cPickle.load(open(fname,'r')) for fname in gmm_files] speakers = [fname.split("\\")[-1].split(".gmm") for fname in gmm_files] # Read the test directory and get the list of test audio files for path in file_paths: path = path.strip() print path sr,audio = read(source + path) vector = extract_features(audio,sr) log_likelihood = np.zeros(len(models)) for i in range(len(models)): gmm = models[i] #checking with each model one by one scores = np.array(gmm.score(vector)) log_likelihood[i] = scores.sum() winner = np.argmax(log_likelihood) print "\tdetected as - ", speakers[winner] time.sleep(1.0
Results and Conclusion
This beginner’s approach performs with an in-set accuracy of 100%, identifying all the 170 speech utterances correctly. There are few reasons for such perfect result.
- The unseen utterances of speakers taken from VoxForge are possibly of same channel or environment.
- The evaluation task in performed on small dataset. Consider a data inflow where you are getting probably some thousands of calls in a day.
- Consider the situation when we have to identify speakers from the set of 1000 speakers.
- In this evaluation, we have not taken out-of-set speakers into account i.e. if the audio is not from any speaker still our system will identify it as one of speakers in trained set depending upon highest likelihood.
- In the real environment, we may get more noisy and unclean data. Speaker identification system needs to be robust.
We hope the blog post was successful in explaining basic approach towards speaker identification task. We expect you to reproduce the results posted by us. Remember, this is not the end. I hope it forms the background to do further research in this particular task. To read more effective techniques, below are the provided reference for speaker identification task.
 Douglas A. Reynolds, Thomas F. Quatieri, and Robert B. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, M.I.T. Lincoln Laboratory, 2000
 Najim Dehak et al., “Front end Factor Analysis for Speaker Verification”, IEEE transaction on Audio, Speech and Language Processing, 2010
The full implementation of followed approach for training and evaluation of speaker identification from voice can be downloaded from GitHub link here. Also remember to download the data-set provided at the beginning of blog-post.
If you liked the post, follow this blog to get updates about upcoming articles. Also, share it so that it can reach out to the readers who can actually gain from this. Please feel free to discuss anything regarding the post. I would love to hear feedback from you.
Happy machine learning 🙂