<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1815-5928</journal-id>
<journal-title><![CDATA[Ingeniería Electrónica, Automática y Comunicaciones]]></journal-title>
<abbrev-journal-title><![CDATA[EAC]]></abbrev-journal-title>
<issn>1815-5928</issn>
<publisher>
<publisher-name><![CDATA[Universidad Tecnológica de La Habana José Antonio Echeverría, Cujae]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1815-59282012000200007</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[New Missing Features Mask Estimation Method for Speaker Recognition in Noisy Environments]]></article-title>
<article-title xml:lang="es"><![CDATA[Nuevo método de estimación de máscara de rasgos perdidos para reconocimiento de locutores en ambientes ruidosos]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Ribas González]]></surname>
<given-names><![CDATA[Dayana]]></given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Calvo de Lara]]></surname>
<given-names><![CDATA[José R.]]></given-names>
</name>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,CENATAV  ]]></institution>
<addr-line><![CDATA[Havana ]]></addr-line>
<country>Cuba</country>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>08</month>
<year>2012</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>08</month>
<year>2012</year>
</pub-date>
<volume>33</volume>
<numero>2</numero>
<fpage>50</fpage>
<lpage>56</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_arttext&amp;pid=S1815-59282012000200007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_abstract&amp;pid=S1815-59282012000200007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://scielo.sld.cu/scielo.php?script=sci_pdf&amp;pid=S1815-59282012000200007&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[Currently, many speaker recognition applications must handle speech corrupted by environmental additive noise without having a priori knowledge about the characteristics of noise. Some previous works in speaker recognition have used Missing Feature (MF) approach to compensate for noise. In most of those applications the spectral reliability decision step is done using the Signal to Noise Ratio (SNR) criterion. This has the goal of enhancing signal power rather than noise power, which could be dangerous in speaker recognition tasks, because useful speaker information could be removed. This work proposes a new mask estimation method based on Speaker Discriminative Information (SDI) for determining spectral reliability in speaker recognition applications based on the MF approach. The proposal was evaluated through speaker verification experiments in speech corrupted by additive noise. Experiments demonstrated that this new criterion has a promising performance in speaker verification tasks.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[En la actualidad, muchas aplicaciones de reconocimiento de locutores deben manejar voz corrupta por ruido aditivo ambiental sin tener conocimiento previo sobre las características del ruido. Trabajos previos de reconocimiento de locutores han usado la teoría de Rasgos Perdidos (MF: Missing Features) para compensar el ruido. En muchas de estas aplicaciones el paso de la decisión de confiabilidad espectral se hace usando el criterio de Relación Señal a Ruido (SNR: Signal to Noise Ratio). Este tiene el objetivo de resaltar la potencia de señal sobre la potencia de ruido, lo que pudiera ser peligroso en tareas de reconocimiento de locutores, porque se pudiera eliminar información útil del locutor. Este trabajo propone un nuevo método de estimación de máscara basado en Información Discriminativa del Locutor (SDI: Speaker Discriminative Information) para determinar la confiabilidad espectral en aplicaciones de reconocimiento de locutores basadas en la teoría de MF. La propuesta fue evaluada en experimentos de verificación de locutores con voces corruptas por ruido aditivo. Los experimentos demostraron que este criterio tiene un desempeño prometedor en verificación de locutores.]]></p></abstract>
<kwd-group>
<kwd lng="en"><![CDATA[formants]]></kwd>
<kwd lng="en"><![CDATA[missing features approach]]></kwd>
<kwd lng="en"><![CDATA[speaker recognition]]></kwd>
<kwd lng="es"><![CDATA[formantes]]></kwd>
<kwd lng="es"><![CDATA[teoría de rasgos perdidos]]></kwd>
<kwd lng="es"><![CDATA[reconocimiento de locutores]]></kwd>
</kwd-group>
</article-meta>
</front><body><![CDATA[ <P align="right" ><font size="2" face="Verdana"><strong>ARTICULO ORIGINAL</strong></font></p>      <P >&nbsp;</p>     <P ><font size="2" face="Verdana"><B><font size="4">New Missing Features Mask Estimation Method for Speaker Recognition    in Noisy Environments</font></B></font></p>     <P >&nbsp;</p>     <P ><strong><font size="3" face="Verdana">Nuevo m&eacute;todo de estimaci&oacute;n de m&aacute;scara de rasgos perdidos para reconocimiento de locutores en ambientes ruidosos</font></strong></p>     <P >&nbsp;</p>     <P >&nbsp;</p>     <P ><font size="2" face="Verdana"><B>Ing. Dayana Ribas Gonz&aacute;lez, Jos&eacute; R. Calvo de Lara</B></font></p>     <P class="style1"><font size="2" face="Verdana">CENATAV,Havana  City, Cuba, <a href="mailto:dribas@cenatav.co.cu">dribas@cenatav.co.cu</a> , <a href="mailto:jcalvo@cenatav.co.cu">jcalvo@cenatav.co.cu</a></font></p>     <P class="style1">&nbsp;</p>     ]]></body>
<body><![CDATA[<P class="style1">&nbsp;</p> <hr>     <P class="style1"><font size="2" face="Verdana"><b>ABSTRACT</b> </font></p>     <P class="style1"><font size="2" face="Verdana">Currently, many speaker recognition applications must handle speech corrupted by environmental additive noise  without having a priori knowledge about the characteristics of noise. Some previous works in speaker recognition have  used Missing Feature (MF) approach to compensate for noise. In most of those applications the spectral reliability  decision step is done using the Signal to Noise Ratio (SNR) criterion. This has the goal of enhancing signal power rather  than noise power, which could be dangerous in speaker recognition tasks, because useful speaker information could  be removed. This work proposes a new mask estimation method based on Speaker Discriminative Information (SDI)  for determining spectral reliability in speaker recognition applications based on the MF approach. The proposal  was evaluated through speaker verification experiments in speech corrupted by additive noise. Experiments  demonstrated that this new criterion has a promising performance in speaker verification tasks. </font></p>     <P class="style1"><font size="2" face="Verdana"><B>Key words: </B>formants, missing features approach, speaker recognition.    </font>    <br> </p> <hr>     <P class="style1"><font size="2" face="Verdana"><B>RESUMEN</B></font></p>     <P class="style1"><font size="2" face="Verdana">En la actualidad, muchas aplicaciones de reconocimiento de locutores   deben manejar voz corrupta por ruido aditivo ambiental sin tener   conocimiento previo sobre las caracter&iacute;sticas del ruido. Trabajos   previos de reconocimiento de locutores han usado la teor&iacute;a de Rasgos   Perdidos (MF: <em>Missing Features</em>) para compensar el ruido. En   muchas de estas aplicaciones el paso de la decisi&oacute;n de confiabilidad   espectral se hace usando el criterio de Relaci&oacute;n Se&ntilde;al a Ruido (SNR: <em>Signal to Noise Ratio</em>).   Este tiene el objetivo de resaltar la potencia de se&ntilde;al sobre la   potencia de ruido, lo que pudiera ser peligroso en tareas de   reconocimiento de locutores, porque se pudiera eliminar informaci&oacute;n &uacute;til   del locutor. Este trabajo propone un nuevo m&eacute;todo de estimaci&oacute;n de   m&aacute;scara basado en Informaci&oacute;n Discriminativa del Locutor (SDI: <em>Speaker Discriminative Information</em>)   para determinar la confiabilidad espectral en aplicaciones de   reconocimiento de locutores basadas en la teor&iacute;a de MF. La propuesta fue   evaluada en experimentos de verificaci&oacute;n de locutores con voces   corruptas por ruido aditivo. Los experimentos demostraron que este   criterio tiene un desempe&ntilde;o prometedor en verificaci&oacute;n de locutores.</font>&nbsp;</p>     <P class="style1"><font size="2" face="Verdana"><strong>Palabras claves:</strong></font>    <font face="Verdana, Arial, Helvetica, sans-serif" size="2">formantes, teor&iacute;a    de rasgos perdidos, reconocimiento de locutores</font>.    <br> </p> <hr>     <P class="style1">&nbsp;</p>     ]]></body>
<body><![CDATA[<P class="style1">&nbsp;</p>     <P class="style1"><font size="3" face="Verdana"><B>INTRODUCTION</B></font></p>     <P class="style1">&nbsp;</p>     <P class="style1"><font size="2" face="Verdana">Plenty of advances have been made in automatic speaker recognition (ASR) technology in the last decades.  Despite that, robust speaker recognition in noisy environments is even a challenging task for the current technology [1].  </font></p>     <P class="style1"><font size="2" face="Verdana">Robust ASR in noisy environments ought to have great attention for the researchers because is a very common case  in applications. For example, in forensics there is a current trend to implement auditory and semiautomatic analysis  over telephone conversations for recognizing persons in a conversation [2]. In addition is the speaker diarization,  which consists in determining who spoke in each moment, is a special case of speaker recognition in noisy  environments, where the voice of other persons is the acoustic noise, this kind of noise is called babble noise and is very difficult  to deal with [3]. In remote access services, identity verification using user's voice is advantageous for both, users  and providers, because of the security that could offer a biometric measure in regard of text password or pin, which  could be stolen or cracked easily. What is more, the operation and personalization of those services could be done  speaking. This requires the integration of speech and language recognition technologies besides speaker recognition, but  will provide a great range of automatization for those services. There are several other applications of ASR  technology, however these examples are enough to demonstrate the importance of working on strengthening the speaker  recognition scheme when dealing with voices acquired in noise environments. </font></p>     <P class="style1"><font size="2" face="Verdana">Missing feature approach [4] has been applied to robust speaker recognition in noisy environments to compensate  for noise, with promising results [5-7]. This approach is based on the fact that any noise affects time-frequency  (t-f) components of speech spectrum in different ways, so it consists in detecting spectrum corruption level and  determining which part of the spectrum is reliable to be used in recognition. </font></p>     <P class="style1"><font size="2" face="Verdana">The use of MF approach in speaker recognition has two steps. Firstly the detection of the reliability degree of  corrupted speech spectrum, by creating a map of reliability in correspondence with t-f components, called spectrographic  mask. The mask is formed by reliable (R) and unreliable (U) labels that correspond to each t-f component in the spectrum,  so the analysis includes each t-f component in the spectrum. Components which are highly corrupted by noise are  tagged with U labels and components with a low level of corruption with R labels. Secondly the missing feature  compensation which is based on spectrographic mask. This procedure has two options: to reconstruct unreliable components  to perform recognition with the newly reconstructed spectrum or to bypass unreliable components, so as not to use it  in the recognition process. The first option submits unreliable reconstruction techniques, called imputation  techniques, developed for speech recognition, the second is known as marginalization and requires a change in score  computation method to handle an incomplete set of spectral features in speaker verification. </font></p>     <P class="style1"><font size="2" face="Verdana">As it could be seen, the potential for improvement increases mainly depending on mask estimation accuracy.  This happens because missing feature compensation works only with U components determined by mask, if the mask is  not accurate, the error is dragged, i.e. some R components will be compensated, while some U components will be  kept untouched. In short, it could be said that mask estimation is the most important process in the MF approach, so in  this paper we will focus on the mask estimation step. </font></p>     <P class="style1"><font size="2" face="Verdana">The paradigm that has been behind most spectrographic masks estimation methods used in speaker recognition  consists on determining if a t-f component is dominated by speech or by noise. This is achieved by SNR computation,  wherein noise energy estimation for each t-f component is compared with an estimation of clean signal energy through  local SNR computation, and then a threshold is used to determine the spectrum reliability. This paradigm that we will  call &#171;SNR criterion&#187; is the basis of most mask estimation methods used in speaker recognition works, where the key is  the way to obtain the elements to compute local SNR. An example is the highly used method proposed by Maliki  and Drygajlo in 1998 [8], which employs Spectral Substraction; another example is the method proposed in [9] which  uses Minimum Mean Square Error (MMSE). Those methods perform accurately for stationary noise, but it degrades  severely in non-stationary noise conditions, thus other methods have been developed over the SNR criterion. The work  presented in [10] is an example of that, this is specially designed for the highly difficult and non-stationary babble noise,  which is based on a speech segregation system using a pitch estimator to discern between target and impostor speech,  then are selected the t-f components dominated by target as R components of mask. Another is the Pullella'set. al  [7] proposal which reuses [8] method and is tuned using a features selection method based on a multicondition training. </font></p>     <P class="style1"><font size="2" face="Verdana">In general, methods based on SNR criterion works quite well, some even for non-stationary noises, but have  the limitation that all their performance relies only on one feature, the SNR so, the system's performance will depend  on signal and noise estimations accuracy, for non-stationary noises it is quite difficult to achieve accurate  estimations [11]. What is more, SNR is not the main criterion used by ASR for recognizing speakers, those systems are based  on Speaker Discriminative Information (SDI) instead, that would be affected by SNR index but it is not precisely the key. </font></p>     ]]></body>
<body><![CDATA[<P class="style1"><font size="2" face="Verdana">In view of these facts, this paper proposed a new mask estimation method for speaker recognition which copes  with the limitations suffered by methods based on SNR criterion. This new method employs a SDI criterion and a  feature classification algorithm, wherein the reliability of each t-f component is characterized with several features which  are speaker discriminative, so the paradigm of this new mask lies on determining if a t-f component keeps enough SDI  to be useful in speaker recognition process, where the SDI of a t-f component could be mainly affected by additive  noise corruption. </font></p>     <P class="style1"><font size="2" face="Verdana">The key contributions of this paper are firstly the introduction of a new concept in the paradigm of  spectrographic estimated masks for MF approach fixed to speaker recognition tasks. Moreover, a new mask estimation method  is proposed which support the paradigm exposed, considering the SDI as the main measure in the reliability  decision. The proposal is evaluated experimentally showing its promising performance over [8] method based on SNR  criterion, which is a very used baseline in previous similar works.From now on section 2 presents the mask estimation  method proposed. Section 3 presents an evaluation of the proposal through a speaker verification experiment. Section 4  shows the results obtained with a discussion of those and finally section 5 carry out some conclusions of the whole study.</font></p>     <P class="style1">&nbsp;</p>     <P class="style1"><font size="3" face="Verdana"><span class="style1"><B>PROPOSED METHOD </B></span> </font></p>     <P class="style1"></p>     <P class="style1"><font size="2" face="Verdana"><b>Hypothesis</b> </font></p>     <P class="style1"><font size="2" face="Verdana">Formants estimator methods tend to fail when processing speech signals corrupted by noise. These failures consists  in detecting false formants or omitting formants in spectral region where actually there are. Therefore we decided to  take advantage of this issue, designing a mask estimation method which used mistakes of formants estimator methods  as noisy t-f components detector. On the other side, previous works have demonstrated that formants are a valuable  SDI in the process of speaker recognition [12], so if formant information could not be accurately recovered from  spectrum, hardly this t-f component will have a favorable impact in posterior speaker recognition process. Thus, we can  conclude that determining unreliable components from formants corruption ensures that speaker recognition will do only  with spectral region with enough SDI, such that it have been capable to survive acoustic noise effects. </font></p>     <P class="style1"><font size="2" face="Verdana"><B>The method</B> </font></p>     <P class="style1"><font size="2" face="Verdana">The mask estimation method was designed as a supervised classification schema <a href="#f1">(fig. 1)</a> with  two stages:  </font></p>     <P class="style1"><font size="2" face="Verdana">1.     Creation of formant models  </font></p>     ]]></body>
<body><![CDATA[<P class="style1"><font size="2" face="Verdana">2.     Computation of spectrographic masks </font></p>     <P align="center" class="style1"><img src="/img/revistas/eac/v33n2/f0107212.jpg" width="335" height="200"><a name="f1"></a></p>     
<P class="style1"></p>     <P class="style1"><font size="2" face="Verdana">As it is based on formant information was called: FMT mask. </font></p>      <P class="style1"><font size="2" face="Verdana"><B>Formant models creation</B> </font></p>     <P class="style1"><font size="2" face="Verdana">The first stage is executed offline to the process of creating spectrographic masks, in this  are modeled the six formants in separated models. To represent formants were used the frequency  (F), energy (E) and bandwidth (BW) of formants as features. They were extracted from frames of  several samples of clean speech. To describe the distribution of those features, histograms for all  F samples for each formant were calculated, showing the gaussian distribution of  formant frequencies, which encourage us to use Gaussian Mixtures Models (GMM) to model the formants.  Then histograms of E and BW samples by formant were computed to fix the number of gaussian needed  for each specific GMM, staying as indicate <a href="#t1">table I.</a></font></p>     <P align="center" class="style1"><img src="/img/revistas/eac/v33n2/t0107212.jpg" width="438" height="100"><a name="t1"></a></p>     
<P class="style1"><font size="2" face="Verdana">Let <I>FMT</I> denote a formant vector, which will be composed by      <I>F, E</I> and<I> BW</I> measure of a specific  formant (f) in a frame of speech signal, defined as:      <I>FMT<SUB>f</SUB> = (F<SUB>f</SUB>,E<SUB>f</SUB>,  BW<SUB>f</SUB>).</I> Given a collection of  <I>FMT</I>, GMM parameters are estimated using the iterative expectation maximization (EM) algorithm [13]. The  EM algorithm iteratively refines the maximum likelihood models parameters to  monotonically increase the likelihood of the estimated model for the observed formant vectors. The EM  equations for training a </font><font size="2" face="Verdana">GMM can be found in [14]. The formants models are denoted as:      <I>&euml;<SUB>f</SUB>,</I> where f is the number of formant.  We can assume formant </font><font size="2" face="Verdana">vectors statistically independents, so the loglikelihood of a model      <I>&euml;<SUB>f</SUB></I> for a sequence of  formant vectors, to specific formant, is computed as:   </font></p>      <P align="center" class="style1"><img src="/img/revistas/eac/v33n2/e0107212.jpg" width="224" height="54"></p>     
<div align="center">(1) </div>     ]]></body>
<body><![CDATA[<P align="center" class="style1"></p>     <P class="style1" align="center"><img src="/img/revistas/eac/v33n2/e0207212.jpg" width="209" height="63"></p>     
<div align="center">(2) </div>     <P class="style1" align="center"></p>     <P class="style1" align="center"><img src="/img/revistas/eac/v33n2/e0307212.jpg" width="230" height="69">  </p>     
<div align="center">(3) </div>     <P class="style1"></p>     <p>fr: frame    <br>   &micro;<sub>fr</sub>:  distribution&rsquo;s mean    <br>   &Sigma;<sub>fr</sub>:  distribution&rsquo;s variancewm:  distribution&rsquo;s weight    ]]></body>
<body><![CDATA[<br> w<sub>m</sub>: distribution&rsquo;s weight</p>     <p><font size="2" face="Verdana"><B>Spectrographic mask computation</B> </font></p>     <P class="style1"><font size="2" face="Verdana">The second stage takes over computation of spectrographic masks by determining reliability of  t-f components to label as R/U for creating the mask. For that, the FMTf for each formant in  each frame of target speech signal are extracted and those are compared with the formant models  to obtain the loglikelihood of FMTf regarding all formant models. If the maximum loglikelihood  of FMTf belongs to the corresponding &euml;f (for example: FMT2 = &euml;2) then this  t-f component is labeled as R otherwise it is U <a href="#e4">(eq. 4)</a>. </font></p>     <P class="style1" align="center"><img src="/img/revistas/eac/v33n2/e0407212.jpg" width="185" height="39">    <a name="e4"></a></p>     
<div align="center">(4) </div>     <P class="style1"><font size="2" face="Verdana"><B>Formant tracking method</B> </font></p>     <P class="style1"><font size="2" face="Verdana">For both,online and offline stages is necessary to extract      <I>FMT<SUB>f</SUB>. </I>So we designed a method  for obtaining <I>F<SUB>f</SUB></I>, and measure corresponding  <I>E<SUB>f</SUB></I>and <I>BW<SUB>f</SUB></I>. This method is based on spectral phase  acquired from Chirp Group Delay (CGD) function [15]. </font></p>     <P class="style1"><font size="2" face="Verdana">Firstly CGD of the speech signalis computed, then the frequencies corresponding to the peaks  of CGD are evaluated for selecting which one corresponds to each formant and      <I>F<SUB>f</SUB></I> is obtained according <a href="#e5">eq. 5</a>: </font></p>     <P class="style1" align="center"><img src="/img/revistas/eac/v33n2/e0507212.jpg" width="163" height="22">    <a name="e5"></a></p>     
<div align="center">(5) </div>     ]]></body>
<body><![CDATA[<P class="style1" align="center"><img src="/img/revistas/eac/v33n2/e0607212.jpg" width="381" height="79"><font size="2" face="Verdana">    (6) </font></p> <ul>       
<li><em>df</em>: Measures how near is the frequency  of CGD peak analyzed to central frequency of formant subbands.&nbsp; </li>       <li><em>CFreq</em>: central frequency of formant subbands (F1 =  500 Hz, F2 = 1500 Hz, F3 = 2750 Hz, F4 = 4000 Hz, F5 = 5000 Hz, F6 = 6000 Hz)</li>       <li><em>freq</em>: frequency of CGD peak</li>     </ul>     <P class="style1" align="center"><font size="2" face="Verdana"> <img src="/img/revistas/eac/v33n2/e0707212.jpg" width="195" height="55">(7)    </font></p>      
<P class="style1"></p>          <P class="style1"><font size="2" face="Verdana">The energy is computed for each frequency taking the spectral intensity. The bandwidth  is computed taking the frequencies with &#177; 3 dB of attenuation from spectrogram using  corresponding formant frequency as reference, these two values are subtracted for obtaining the bandwidth. </font></p>     <P class="style1">   <font size="2" face="Verdana"><span class="style1"><B>Experimental setup   </B></span> </font></p>     <P class="style1"><font size="2" face="Verdana"><b>Corpus</b> </font></p>     ]]></body>
<body><![CDATA[<P class="style1"><font size="2" face="Verdana">This article evaluates the performance of FMT mask estimator of MF approach through a  speaker verification experiment, conducted with a set of 100 male speakers of AHUMADA [16], a Spanish  NIST 2001 speech database for speaker characterization and identification.  </font></p>     <P class="style1"><font size="2" face="Verdana">To perform the evaluation, the speaker verification system was trained and tested with  clean speech to establish the &#171;clean&#187; baseline; then, for setting the &#171;dirty&#187; baseline, it was  tested with corrupted speech without using the MF approach. Later on, the system was tested with the  same corrupted speech used in dirty baseline but using the MF approach. Each train and test  utterances contains about 90 seconds of spontaneous speech from the Ahumada'smicrophonic section M1.  All speech material used for training and testing is digitized at 16 bits, at 16000 Hz sample rate. </font></p>     <P class="style1"><font size="2" face="Verdana">The corruption signal comes from a special case of non-stationary noise, called babble  noise, which is highly correlated with voice because it is the voice of other speakers. This was  added electronically to test speech signals at different SNR levels, from 0 to 20 dB in 5 dB step. </font></p>     <P class="style1"><font size="2" face="Verdana"><B>Missing feature protocol    </B> </font></p>     <P class="style1"><font size="2" face="Verdana">The MF approach is divided into 2 steps: missing feature detection and missing feature compensation. For  unreliable compensation, classical marginalization technique [17] was used. This simple method was selected considering  our intention of focus on mask estimator performance and taking into account the good results reached by this method  in previous works [5]. For detection three types of masks were used: </font></p>     <P class="style1"><font size="2" face="Verdana">a) Oracle masks , to determine the ideal performance that speaker verification could reach using MF approach. </font></p>     <P class="style1"><font size="2" face="Verdana">b) Spectral Subtraction mask (SS-mask), based on SNR criterion that allows us to establish a comparative line. </font></p>     <P class="style1"><font size="2" face="Verdana">c) FMT mask (FMT-mask), based on SDI criterion, which is the proposal of this article. </font></p>     <P class="style1"><font size="2" face="Verdana">To estimate FMT-mask (c), a set of clean speech signals was first selected to create the formant models. The  signals were  short read phrases from 50 speakers of AHUMADA [16] from 3 microphonic and 3 telephonic sections, in  total 300 signals, around 20 minutes of speech for each model. </font></p>     <P class="style1"><font size="2" face="Verdana"><B>Speaker verification protocol</B> </font></p>     ]]></body>
<body><![CDATA[<P class="style1"><font size="2" face="Verdana">For applying MF approach, speech signals were represented with Log-Mel Spectral features: a Hamming window  with 20ms window length and 10ms of overlap is applied to each frame and a short time spectrum is obtained applying  a FFT. Then 20 Mel filterbank were applied over it followed by a logarithmic transformation. For implementing  &#171;dirty baseline&#187; state of the art MFCC features were used, computed according to the process described previously,  adding the transformation to cepstrum domain and finally selecting 15 cesptral coefficients as features. </font></p>     <P class="style1"><font size="2" face="Verdana">Speech from 50 male speakers were used to create a gender dependent Universal Background Model (UBM)  [18] using a GMM of 512 gaussians. The amount of mixtures to GMM was chosen taking into account the number  of speakers, the phonetic richness and the signals duration to create the UBM. Other 50 male speakers were used  as targets and their models were obtained adapting UBM with Maximum a Posteriori (MAP) approach. Based on  our goal the mismatch between train and test sessions produced only by additive noise was measured. We did not  introduce any other mismatch source, hence was used for test the same signals </font><font size="2" face="Verdana">used for train in a text-dependent speaker verification, with the difference that those were clean in train session  and corrupted by babble noise in test session. </font></p>     <P class="style1"><font size="2" face="Verdana"> All in all, 2500 trials - 50 client speakers against each of 50 target models  were done for each SNR level (0, 5,  10, 15, 20 dB) and noise compensation method (without any: MFCC baseline, MF-Oracle, MF-SNR, MF-FMT). In  total 50000 trials in 20 experiments were done. </font></p>     <P class="style1">&nbsp;</p>     <P class="style1"><font size="3" face="Verdana"><B>RESULTS AND DISCUSSION</B> </font></p>     <P class="style1"></p>      <P class="style1"><font size="2" face="Verdana"><a href="#t2">Table II</a> presents a summary of speaker verification effectiveness in EER percentage vs. SNR reached by the  experiments described in the previous epigraph. </font></p>     <P align="center" class="style1"><img src="/img/revistas/eac/v33n2/t0207212.jpg" width="408" height="198"><a name="t2"></a></p>     
<P class="style1"></p>     <P class="style1"><font size="2" face="Verdana">The table shows that MF-FMT mask offers the best speaker verification results, under highly contaminated  noise conditions (SNR&lt;10dB), however when SNR increases, MF-FMT results are not better than MF-SNR results.  This happens because if the power of noise is low, EER results tend to those values that could be obtained if the  speaker verification had been carried out with clean speech. This is a very common behavior for noise compensation  methods applied to high SNR speech in speaker verification, that could be seen in [4, 9] too. On the other hand, those  results show that only formants are not capable of providing enough SDI to reach MF-Oracle performance, so adding  other features with SDI could improve the performance.  </font></p>     ]]></body>
<body><![CDATA[<P class="style1">&nbsp;</p>     <P class="style1"><font size="3" face="Verdana"><B>CONCLUSIONS AND FUTURE WORK</B> </font></p>     <P class="style1">&nbsp;</p>     <P class="style1"><font size="2" face="Verdana">In spite of babble noise it is very difficult to handle for compensation methods, due to its    non-stationarity and that it is very correlated with voice, the proposed mask estimation criterion    -MF-SDI- outperformed the MF-SNR in the most difficult conditions, SNR&lt;10 dB. </font></p>     <P class="style1"><font size="2" face="Verdana">The analytical conclusions and experimental results obtained in this article encourage us to continue using SDI  criterion to create mask estimation methods, as long as we explore other features more related with the speaker identity,  to associate the reliability decision with the measure of corruption in information useful for speaker  recognition. Futureworkwill be in thisdirection. . </font></p>     <P class="style1">&nbsp;</p>     <P class="style1"><font size="3" face="Verdana"><B>REFERENCES</B> </font></p>     <P class="style1">&nbsp;</p>     <!-- ref --><P class="style1"><font size="2" face="Verdana">1.Kinnunen, T. and H. Li, An overiew of text-independent speaker recognition: From features    to supervectors. Speech communication, 2010. <B>52</B>: p. 12-40.     </font></p>     ]]></body>
<body><![CDATA[<!-- ref --><P class="style1"><font size="2" face="Verdana">2.Campbell, J.P., et al., Forensic Speaker  Recognition, in IEEE Signal Processing  Magazine. 2009. p. 95-103.     </font></p>     <!-- ref --><P class="style1"><font size="2" face="Verdana">3.Tranter, S. and D. Reynolds, An Overview of Automatic Speaker Diarisation  Systems. IEEE Transactions on Speech and Audio Processing, 2006. 14: p. 1557-1565.     </font></p>     <!-- ref --><P class="style1"><font size="2" face="Verdana">4.Raj, B. and R.M. Stern, Missing-Feature Approaches in Speech  Recognition, in IEEE Signal Processing  Magazine. 2005. p. 101-116.     </font></p>     <!-- ref --><P class="style1"><font size="2" face="Verdana">5.Padilla, M.T., T.F. Quatieri, and D.A. Reynolds,        Missing Feature Theory with Soft Spectral Subtraction for Speaker  Verification, in Interspeech 2006, : Pittsburgh, Pennsylvania.     </font></p>     <!-- ref --><P class="style1"><font size="2" face="Verdana">6.Ming, J., et al., Robust Speaker Recognition in Noisy  Conditions. IEEE Transactions on Audio, Speech and Language Processing  2007. <B>15</B>: p. 1711-1723.     </font></p>     ]]></body>
<body><![CDATA[<!-- ref --><P class="style1"><font size="2" face="Verdana">7.Pullella, D., M. Kuhne, and R. Togneri. Robust Speaker Identification Using Combined  Feature Selection and Missing Data Recognition. in  International Conference in Acoustics, Speech  and Signal Processing (ICASSP-08). 2008. Las Vegas, NV.     </font></p>     <P class="style1"><font size="2" face="Verdana">8.Drygajlo, A. and M. El-Maliki, Speaker Verification in Noisy Enviroments with Combined  Spectral Subtraction and Missing Feature  Theory. 1998, Signal Processing Laboratory, Swiss  Federal Institute of Technology at Lausanne. </font></p>     <P class="style1"><font size="2" face="Verdana">9.El-Maliki, M. and A. Drygajlo. Missing features detection and handling for robust  speaker verification. in Eurospeech. 1999. Budapest, Hungary. </font></p>     <P class="style1"><font size="2" face="Verdana">10.Shao, Y. and D. Wang, Robust speaker recognition using binary time-frequency  masks, in ICASSP. 2006. </font></p>     <P class="style1"><font size="2" face="Verdana">11.Davis, G.M., Noise Reduction in Speech  Applications. 2002, New York: CRC PRESS LLC. </font></p>     <P class="style1"><font size="2" face="Verdana">12.Rose, P., Forensic Speaker  Identification. Taylor &amp; Francis Forensic Science Series, ed.  J. Robertson. 2002, London: Taylor &amp; Francis. </font></p>     <P class="style1"><font size="2" face="Verdana">13.Dempster, A., N. Laird, and D. Rubin, Maximum likelihood from incomplete data via the EM  algorithm. J. Roy. Stat. Soc., 1977. <B>39</B>: p. 1-38. </font></p>     <P class="style1"><font size="2" face="Verdana">14.Duda, R.O. and P.E. Hart, Pattern Classification and Scene  Analysis. 1973, New York: Wiley. </font></p>     <P class="style1"><font size="2" face="Verdana">15.Bozkurt, B., L. Couvreur, and T. Dutoit, Chirp group delay analysis of speech  signals. Speech Communication, 2007. <B>49</B>: p. 159-176.  </font></p>     ]]></body>
<body><![CDATA[<P class="style1"><font size="2" face="Verdana">16.Ortega, J., J. Gonzalez, and V. Marrero, AHUMADA: A large speech corpus in Spanish for  speaker characterization and identification. Speech Communication, 2000. <B>31</B>: p. 255-264. </font></p>     <P class="style1"><font size="2" face="Verdana">17.Drygajlo, A. and M. El-Maliki, Speaker Verification in Missing Features Detection and  Handling for Robust Speaker Verification, in  EUROSPEECH'99. 1999: Budapest, Hungary. </font></p>     <P class="style1"><font size="2" face="Verdana">18.Reynolds, D.A., T.F. Quatieri, and R.B. Dunn,        Speaker Verification Using Adapted Gaussian  Mixture Models. Digital Signal Processing, 2000. 10: p. 19 41.</font></p>     <P class="style1">&nbsp;</p>     <P class="style1">&nbsp;</p>     <P class="style1"><font face="Verdana, Arial, Helvetica, sans-serif" size="2">Recibido:    Marzo 2012    <br>   Aprobado: Mayo 2012</font></p>     <P class="style1">&nbsp;</p>     <P class="style1">&nbsp;</p>     <P class="style1">&nbsp;</p>     ]]></body>
<body><![CDATA[<P class="style1"></p>     <P class="style1">&nbsp;</p>      ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kinnunen]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
<name>
<surname><![CDATA[H.]]></surname>
<given-names><![CDATA[Li]]></given-names>
</name>
</person-group>
<source><![CDATA[An overiew of text-independent speaker recognition:: From features to supervectors]]></source>
<year>2010</year>
<volume>52</volume>
<page-range>12-40</page-range><publisher-name><![CDATA[Speech communication]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Campbell]]></surname>
<given-names><![CDATA[J.P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Forensic Speaker Recognition]]></article-title>
<source><![CDATA[IEEE Signal Processing Magazine]]></source>
<year>2009</year>
<page-range>95-103</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Tranter]]></surname>
<given-names><![CDATA[S.]]></given-names>
</name>
<name>
<surname><![CDATA[Reynolds]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<source><![CDATA[An Overview of Automatic Speaker Diarisation Systems]]></source>
<year>2006</year>
<volume>14</volume>
<page-range>1557-1565</page-range><publisher-name><![CDATA[IEEE Transactions on Speech and Audio Processing]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Raj,]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Stern]]></surname>
<given-names><![CDATA[R.M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Missing-Feature Approaches in Speech Recognition]]></article-title>
<source><![CDATA[IEEE Signal Processing Magazine]]></source>
<year>2005</year>
<page-range>101-116</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Padilla]]></surname>
<given-names><![CDATA[M.T.]]></given-names>
</name>
<name>
<surname><![CDATA[Quatieri]]></surname>
<given-names><![CDATA[T.F.]]></given-names>
</name>
<name>
<surname><![CDATA[Reynolds]]></surname>
<given-names><![CDATA[D.A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Missing Feature Theory with Soft Spectral Subtraction for Speaker Verification]]></article-title>
<source><![CDATA[Interspeech]]></source>
<year>2006</year>
<publisher-loc><![CDATA[Pennsylvania^ePittsburgh Pittsburgh]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ming]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
</person-group>
<source><![CDATA[Robust Speaker Recognition in Noisy Conditions]]></source>
<year>2007</year>
<volume>15</volume>
<page-range>1711-1723</page-range><publisher-name><![CDATA[IEEE Transactions on Audio, Speech and Language Processing]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pullella]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
<name>
<surname><![CDATA[Kuhne]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Togneri]]></surname>
<given-names><![CDATA[R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Robust Speaker Identification Using Combined Feature Selection and Missing Data Recognition]]></article-title>
<source><![CDATA[International Conference in Acoustics]]></source>
<year>2008</year>
<publisher-loc><![CDATA[Las Vegas ]]></publisher-loc>
<publisher-name><![CDATA[Speech and Signal Processing (ICASSP-08)]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[El-Maliki]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
<name>
<surname><![CDATA[Drygajlo]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Missing features detection and handling for robust speaker verification]]></source>
<year>1999</year>
<publisher-loc><![CDATA[Budapest ]]></publisher-loc>
<publisher-name><![CDATA[Eurospeech]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Shao]]></surname>
<given-names><![CDATA[Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Wang]]></surname>
<given-names><![CDATA[D.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Robust speaker recognition using binary time-frequency masks]]></article-title>
<source><![CDATA[ICASSP]]></source>
<year>2006</year>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[G.M.]]></surname>
<given-names><![CDATA[Davis]]></given-names>
</name>
</person-group>
<source><![CDATA[Noise Reduction in Speech Applications]]></source>
<year>2002</year>
<publisher-loc><![CDATA[New York ]]></publisher-loc>
<publisher-name><![CDATA[CRC PRESS LLC]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Duda]]></surname>
<given-names><![CDATA[R.O.]]></given-names>
</name>
<name>
<surname><![CDATA[Hart]]></surname>
<given-names><![CDATA[P.E.]]></given-names>
</name>
</person-group>
<source><![CDATA[Pattern Classification and Scene Analysis]]></source>
<year>1973</year>
<publisher-loc><![CDATA[New York ]]></publisher-loc>
<publisher-name><![CDATA[Wiley]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Bozkurt]]></surname>
<given-names><![CDATA[B.]]></given-names>
</name>
<name>
<surname><![CDATA[Couvreur]]></surname>
<given-names><![CDATA[L.]]></given-names>
</name>
<name>
<surname><![CDATA[Dutoit]]></surname>
<given-names><![CDATA[T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Chirp group delay analysis of speech signals]]></source>
<year>2007</year>
<volume>49</volume>
<page-range>159-176</page-range><publisher-name><![CDATA[Speech Communication]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Ortega]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Gonzalez]]></surname>
<given-names><![CDATA[J.]]></given-names>
</name>
<name>
<surname><![CDATA[Marrero]]></surname>
<given-names><![CDATA[V.]]></given-names>
</name>
</person-group>
<source><![CDATA[AHUMADA: A large speech corpus in Spanish for speaker characterization and identification]]></source>
<year>2000</year>
<volume>31</volume>
<page-range>255-264</page-range><publisher-name><![CDATA[Speech Communication]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Drygajlo]]></surname>
<given-names><![CDATA[A.]]></given-names>
</name>
<name>
<surname><![CDATA[El-Maliki]]></surname>
<given-names><![CDATA[M.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Speaker Verification in Missing Features Detection and Handling for Robust Speaker Verification]]></article-title>
<source><![CDATA[EUROSPEECH'99]]></source>
<year>1999</year>
<publisher-loc><![CDATA[Budapest ]]></publisher-loc>
</nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Reynolds]]></surname>
<given-names><![CDATA[D.A.]]></given-names>
</name>
<name>
<surname><![CDATA[Quatieri]]></surname>
<given-names><![CDATA[T.F.]]></given-names>
</name>
<name>
<surname><![CDATA[Dunn]]></surname>
<given-names><![CDATA[R.B.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Speaker Verification Using Adapted Gaussian Mixture Models]]></article-title>
<source><![CDATA[Digital Signal Processing]]></source>
<year>2000</year>
<volume>10</volume>
<page-range>19 41</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
