Somlal Das, Md. Ekramul Hamid, Keikichi Hirose and Md. Khademul Islam Molla
Single-Channel Speech Enhancement by NWNS and EMD
Somlal Das firstname.lastname@example.org
Dept. of Computer Science and Engineering
University of Rajshahi
Md. Ekramul Hamid email@example.com
Department of Network Engineering
King Khalid University
Abha, Kingdom of Saudi Arabia
Keikichi Hirose firstname.lastname@example.org
Dept. of Information and Communication Eng.
The University of Tokyo
Md. Khademul Islam Molla email@example.com
Dept. of Information and Communication Eng.
The University of Tokyo
This paper presents the problem of noise reduction from observed speech by means of improving quality and/or intelligibility of the speech using single-channel speech enhancement method. In this study, we propose two approaches for speech enhancement. One is based on traditional Fourier transform using the strategy of Noise Subtraction (NS) that is equivalent to Spectral Subtraction (SS) and the other is based on the Empirical Mode Decomposition (EMD) using the strategy of adaptive thresholding. First of all, the two different methods are implemented individually and observe that, both the methods are noise dependent and capable to enhance speech signal to a certain limit. Moreover, traditional NS generates unwanted residual noise as well. We implement nonlinear weight to eliminate this effect and propose Nonlinear Weighted Noise Subtraction (NWNS) method. In first stage, we estimate the noise and then calculate the Degree Of Noise (DON1) from the ratio of the estimated noise power to the observed speech power in frame basis for different input Signal-to-Noise-Ratio (SNR) of the given speech signal. The noise is not accurately estimated using Minima Value Sequence (MVS). So the noise estimation accuracy is improved by adopting DON1 into MVS. The first stage performs well for wideband stationary noises and performed well over wide range of SNRs. Most of the real world noise is narrowband non-stationary and EMD is a powerful tool for analyzing non-linear and non-stationary signals like speech. EMD decomposes any signals into a finite number of band limited signals called intrinsic mode function (IMFs). Since the IMFs having different noise and speech energy distribution, hence each IMF has a different noise and speech variance. These variances change for different IMFs. Therefore an adaptive threshold function is used, which is changed with newly computed variances for each IMF. In the adaptive threshold function, adaptation factor is the ratio of the square root of added noise variance to the square root of estimated noise variance. It is experimentally observed that the better speech enhancement performance is achieved for optimum adaptation factor. We tested the speech enhancement performance using only EMD based adaptive thresholding method and obtained the outcome only up to a certain limit. Therefore, further enhancement from the individual one, we propose two-stage processing technique, NWNS+EMD. The first stage is used as a pre-process for noise removal to a certain level resulting first enhanced speech and placed this into second stage for further removal of remaining noise as well as musical noise to obtain final enhancement of the speech. But traditional NS in the first stage produces better output SNR up to 10 dB input SNR. Furthermore, there are musical noise and distortion presented in the enhanced speech based on spectrograms and waveforms analysis and also from informal listening test. We use white, pink and high frequency channel noises in order to show the performance of the proposed NWNS+EMD algorithm.
Keywords: speech enhancement, non linear weighted noise subtraction, degree of noise, empirical mode decomposition, adaptive thresholding.
In many speech related systems like mobile communication in an adverse environment, the desired signal is not available directly; rather it is mostly contaminated with some interference sources of noise. These background noise signals degrade the quality and intelligibility of the original speech, resulting in a severe drop in the performance of the applications. The degradation of the speech signal due to the background noise is a severe problem in speech related systems and therefore should be eliminated through speech enhancement algorithms. In our previous study, we have proposed a two stage noise reduction algorithm by noise subtraction and blind source separation . In that report, we recommended further research to improve the algorithm over wide ranges of SNRs as well as noise reduction performance for narrow-band noises.
Research on speech enhancement techniques started more than 40 years ago at AT&T Bell Laboratories by Schroeder as mentioned in . Schroeder proposed an analog implementation of the spectral magnitude subtraction method. Then, the method was modified by Schroeder’s colleagues in a published work . However, more than 15 years later, the spectral subtraction method as proposed by Boll  is a popular speech enhancement techniques through noise reduction due to its simple underlying concept and its effectiveness in enhancing speech degraded by additive noise. The technique is based on the direct estimation of the short-term spectral magnitude. Recent studies have focused on a non-linear approach to the subtraction procedure [5-7]. In Martin  algorithm modifies the short time spectral magnitude of the corrupted speech signal such that the synthesized signal is perceptually as close as possible to the clean speech signal. The estimating noise is obtained as the minima values of a smoothed power estimate of the noisy signal, multiplied by a factor that compensates the bias. The algorithm eliminates the need of speech activity detector by exploiting the short time characteristics of speech signal. Martin’s study compared the result with Malah , and found an improved SNR. However, this noise estimation is sensitive to outliers, and its variance is about twice as large as the variance of a conventional noise estimator. These approaches have been justified due to the variation of signal-to-noise ratio across the speech spectrum. Unlike white Gaussian noise, which has a flat spectrum, the spectrum of real-world noise is not flat. Thus, the noise signal does not affect the speech signal uniformly over the whole spectrum. Some frequencies are affected more adversely than others. In high frequency channel noise (HF channel), for instance, in the low frequencies, where most of the speech energy resides, are affected more than the high frequencies. Hence it becomes imperative to estimate a suitable factor that will subtract just the necessary amount of the noise spectrum from each frequency bin (ideally), to prevent destructive subtraction of the speech while removing most of the residual noise. Then it is usually difficult to design a standard algorithm that is able to perform homogeneously across all types of noise. For that, a speech enhancement system is based on certain assumptions and constraints that are typically dependent on the application and the environment.
There are some crucial restrictions of the Fourier spectral analysis : the system must be linear; and the data must be strictly periodic or stationary; otherwise the resulting spectrum will make little physical sense. From this point of view, Fourier filter methods will fail when the processes are nonlinear. The empirical mode decomposition (EMD), proposed by Huang et.al  as a new and powerful data analysis method for nonlinear and non-stationary signals, has made a new path for speech enhancement research. EMD is a data-adaptive decomposition method, which decompose data into zero mean oscillating components, named as intrinsic mode functions (IMFs). It is mentioned in  that most of the noise components of a noisy speech signal are centered on the first three IMFs due to their frequency characteristics. Therefore EMD can be used for effectively identifying and removing these noise components. Xiaojie et. al.  proposed EMD that effectively identify and remove noise components. Recently there are many speech enhancement methods [12-14] have been developed in dual-channel and single-channel modes using EMD. In  EMD based speech enhancement is achieved by removing those IMFs whose energies exceeded a predefined threshold value. The IMFs, which represent empirically, observed applying EMD in observed speech contaminated with white Gaussian noise generates noise model. In  speech enhancement based on EMD-MMSE is performed by filtering the IMFs generated from the decomposition of speech contaminated with white Gaussian noise. In , an optimum gain function is estimated for each IMF to suppress residual noise that may be retained after single-channel speech enhancement algorithms.
In our previous study, Hamid  proposed noise subtraction (NS) technique where noise is estimated using minimum value sequence (MVS) and the noise floor is updated with the help of estimated degree of noise (DON). The main drawback of this method is that we estimate DON on the basis of pitch period over the frame and the pitch period of unvoiced sections is not accurately estimated. To solve this problem, in this paper, we estimate EDON on the basis of estimated SNRs of clean and noisy speech spectrums. Then, the EDON is estimated in two stages from a function, which is previously prepared as the function of the parameter of the degree of noise . We consider the valleys of the observed smoothed power spectrum of a noisy speech signal to estimate noise power. This spectrum is tuned by EDON to adjust the noise level for a particular SNR. We also perform suitable steps to minimize the residual noise problem. Now the estimated noise spectrum with a controlled non-linear factor is subtracted from the observed spectrum in time domain to obtain noise reduced speech. This paper presents a parametric formulation to estimate noise weight on the basis of EDON. The weighting factor increases with increasing SNRs, and results non-linear weighting factor with speech activity. Although Fourier transform and wavelet analysis make great contributions, they suffer from many shortcomings in case of nonlinear and nonstationary signals. For this reason, for further enhancement, EMD technique has been used for robust noisy speech analysis in this work.
Since the IMFs in EMD having different noise and speech energy distribution, hence each IMF has a different noise and speech variance. These variances change for different IMFs. Therefore an adaptive threshold function is used, which is changed with newly computed variances for each IMF. Moreover, since IMFs are generated from EMD and therefore, we call the proposed method as EMD based adaptive thresholding technique. To enhance the speech, EMD based adaptive thresholding algorithm applied into each IMFs for removing the noise embedded in the underlying IMFs. In the adaptive threshold function, adaptation factor is the ratio of the square root of added noise variance to the square root of estimated noise variance. It is experimentally observed that the better speech enhancement performance is achieved for optimum adaptation factor. We tested the speech enhancement performance using only EMD based adaptive thresholding method and obtained the outcome only up to a certain limit. Moreover, each individual method has some performance limitations.
Therefore, further enhancement from the individual one, we propose two-stage processing technique, namely, a time domain NS or NWNS followed by an EMD based adaptive thresholding. The first stage is used as a pre-process for noise removal to a certain level resulting first enhanced speech and placed this into second stage for further removal of remaining noise as well as musical noise to obtain final enhancement of the speech. But traditional NS in the first stage produces better output SNR up to 10 dB input SNR. Furthermore, there are musical noise and distortion presented in the enhanced speech based on spectrograms and waveforms analysis and also from informal listening test. EMD based adaptive thresholding does not work well on distorted speech and not be able to recover the speech from the distorting speech when it cascaded with NS. As a result, the overall performance of enhanced speech obtained from NS+EMD based adaptive thresholding is not so good based on the objective and subjective measures. In the first stage, the performance of speech enhancement improves by introducing nonlinear weight in NS, namely NWNS, to control the noise level and improves its overall performance for wide range of input SNRs provide first enhanced speech without distortion and with minimum effect of musical noise. Moreover, the overall performance is further improved by cascading NWNS in the first stage and EMD based adaptive thresholding in the second stage. In this two-stage processing, NWNS is influenced to increase the performance of EMD based adaptive thresholding. The advantage of the method is the effective removal of noise and produces better output SNR for wide range of input SNR and also improves the speech quality with reducing residual noise.
The main component of speech noise reduction is noise estimation that is a most difficult task for a single-channel enhancement system. The noise estimate can have a major impact on the quality of the enhanced speech. That is, with a better noise estimation, a more correct SNR is obtained, resulting in the enhanced speech with low distortion. We have assumed that speech and noise are uncorrelated to each other. We further assume that signal and noise are statistically independent.
The sections of consecutive samples are used as a single frame l(320 samples) and spaced l’(100 samples) achieving an almost 62.75% overlap. The short-term representation of a signal y(n) is obtained by Hamming windowing and analyzed using N=512 point Discrete-Fourier transform (DFT) at sampling frequency 16KHz. Initially, noise spectrum is estimated from the valleys of the amplitude spectrum . The algorithm for noise estimation is as follows:
Compute the RMS value Yrms of the amplitude spectrum Y(k). We detect the minima of Y(k) by obtaining the vector kmin such that Y(kmin) are the minima in Y(k). Then the interpolation is performed between adjoining minima positions to obtain Ymin(k) representing the minimum value sequences (MVS). We smooth the sequences by taking partial average called smoothed minimum value sequences (SMVS). An estimation of noise from the SMVS is survived by an overestimation and underestimation of the SNR which is controlled by proposed EDON. The block diagram of the noise estimation process is shown in Figure 1.
1st estimated DON, Z1m
Figure 1: Block diagram of the 1st estimated DON, Z1m.
In a single-channel method, we only know the power of the observed signal. To obtain EDON, we estimate noise of the observed signal in every analysis frame m. First white noise of various SNR is added to voiced vowel sounds. Now for each SNR, DON of each phoneme is estimated and averaged which corresponds the input SNR. Then each of these estimated 1st averaged DONs of each frame m for corresponding input SNR expressed as . The estimated is aligned with the true DON (Ztr) using the least-square (LS) method results the 1st estimated DON Z1mof that frame. The true DON (Ztr) is given by
where dB is input SNR. The 1st averaged DON is
where, ^ are the noise added frames; P(m)and Pobs(m) are the powers of noise and observed signals, respectively. Here it obvious that we consider only the voiced phonemes in our experiment. So the value of should be limited to voiced portion of a speech sentence. We used the same experiment with unvoiced speech. Practically the unvoiced portion contaminated with higher degree of noise. Hence the estimated noise is higher for unvoiced frame than from voiced frame. Consequently higher DON value is obtained from unvoiced frame than from voiced frame that is logically resemblance. The degree of noise estimated from a function using least square method is given as
Here a and b are unknown. We estimate a and b via LS method, yielding and and the estimated degree of noise is given by
where Z1m is the 1st estimated DON of frame m. The value os Z1m is applied to update the MVS. Next, the noise level is re-estimated and updated with the help of Z1m. Finally, from the estimated noise, we again estimate 2nd averaged DON () and similarly the 2nd estimated DON (Z2m) which is used to estimate the noise weight for non linear weighted noise subtraction.
We detect the minima values of amplitude spectrum Y(k) when the following condition (Y(k)
where Yrms is the rms value of the amplitude spectrum. Then we made some updates of Dm(k), the updated spectrum is again smoothed by three point moving average, and lastly the main maximum of the spectrum is identified and are suppressed . Figure 2 shows the spectrums.
Figure 2: Noise spectrums (true and estimated).
Noise reduction in the front-end is based on implementation of the traditional spectral subtraction (SS) require an available estimation of the embedded noise, here, in time domain we named noise subtraction (NS). The goal of this section is to modify the noise subtraction process by adopting a non linear weight for minimizing the effect of residual noise in the processed speech and then to improve the performance by using EMD.
For subtraction in time domain, the estimated noise in the previous section is recombined with the phase of the noisy speech and inverse transformed one. Then we obtain by withdrawing the effect of the window. The NWNS is given by:
where is nonlinear weighting factor. We use least-square method for the estimation process. We find that for each input SNR, certain weight is required for best noise reduction results over wide ranges of SNR. In this experiment, we used 7 male and 7 female speakers of 10 different sentences at different SNR levels, randomly selected from the TIMIT database. We use 3rd degree polynomials to derive the above formulation. It is observed from Eq. (1) that it needs the input SNR. The input SNR can be estimated using variance is given by
where, and are the variances of speech and noise respectively. We assume that due to the independency of noise and speech, the variance of the noisy speech is equal to the sum of the speech variance and noise variance. It is found that by adopting nonlinear weighted in NS, a good noise reduction is obtained. Although with the NWNS, we find the good performance with less musical noise by informal listening test but for further enhancement we cascade another method EMD and get better results.
The general block diagram of the proposed system is shown in Figure 3. In the block diagram, first stage is incorporated a Noise Subtraction (NS) method with weight and second stage a Empirical Mode Decomposition (EMD) based adaptive thresholding method.
Figure 3: The block diagram of the two-stage NWNS+EMD method.
The principle of EMD technique is to decompose any signal y(n) into a set of band-limited functions, which are the zero mean oscillating components, called simply the intrinsic mode functions (IMFs) . Although a mathematical model has not been developed yet, different methods for computing EMD have been proposed after its introduction . The very first algorithm, called as the sifting process, is adopted here to find the IMF’s include the following steps;
Once the first IMF is derived, we should continue with finding the remaining IMFs. For this purpose, we should subtract the first IMF c1(n) from the original data to get the residue signal r1(t). The residue now contains the information about the components of longer periods. We should treat this as the new data and repeat the steps 1 to 6 until we find the second IMF.
The soft thresholding strategy proposed in  for a frame, m of length L in transform-domain as
where denotes the average power of the frame, and is the global noise variance of the speech, Yq is qth coefficient of the frame obtained by the required transformation and denotes to the thresholded samples of the frame. The multiplication factor jγ is the linear threshold function while j being the sorted index-number of |Yq|. An estimated value of γ can be obtained as:
where is an adaptation factor and its value is determined experimentally such that 0<<1. It is observed that the first part of Eq. (7) is for signal dominant frame when the condition satisfies, and second part is for noise dominant frame where soft thresholding will have to apply. So the classification of frames either to be signal dominant or noise dominant depends on average power of a frame and global noise variance of the given noisy speech. In this paper, we apply this soft thresholding strategy adaptively in each IMF, as discuss in the next section.
Soft thresholding strategy performs better on wide range of input SNR due to thresholded noise dominant frames only and kept remain the same in case of signal dominant frames but the misclassification of frames is a major drawback that causes musical noise . Therefore this method is mainly appropriate for white noise. All the drawbacks can be significantly reduced with the proposed EMD based adaptive thresholding strategy with some modification of frame classification criteria. Since the IMFs will have different noise and speech energy distribution, so it suggests that each IMF will have a different noise and speech variance. After applying EMD, the soft thresholding technique is applied on each sub-frame of each IMF based on the computed variances. It is obvious that the variances will be changed for different sub-frames as well as with the individual IMF. The threshold will also be changed with newly computed variances and hence this technique is termed as adaptive thresholding. The proposed EMD based adaptive thresholding strategy for subframe of IMF as:
Here, denotes to the thresholded samples of subframe of the IMF, is coefficient of subframe of IMF and the multiplication is the adaptive threshold function while being the sorted index-number of . The threshold factor is varied adaptively for individual IMF according to its variance. An estimated value of can be obtained as:
where, , , adaptation factor and noise variance of the IMF. Since global noise variance is estimated from silent frames, therefore, it assumes each frame as well as subframe belong that variance. That is why; the boundary for the classification of subframes should be set to two times of the globally estimated noise variance when noise variance and speech variance of that subframe are same. The enhanced speech signal of the EMD based adaptive thresholding is given by
where, I=total number of IMFs,
R=total number of subframe and
Q=length of a subframe.
We study the effectiveness of the proposed NWNS+EMD based adaptive thresholding algorithm are tested on the speech data corrupted by three different types of additive noise like white, pink and HF channel noise are taken from NOISEX database. N=56320 samples of the clean speech /she had your dark suit in greasy wash water all year/ from TIMIT database were used for all simulations. The noises are added to the clean speeches at different SNRs from –10dB to 30dB of step 5 to obtain noisy speech signals.
For evaluating the performance of the method, we are used the overall output and average segmental SNRs that are graphically represented as for measuring objective speech quality. The results of the average output SNR obtained from for white noise, pink noise and HF channel noise at various SNR levels are given in Table 1 for pre-processed speech in the first stage and final enhanced speech in the second stage respectively. Since in the real world environments, the noise power is sometimes equal to or greater than the signal power or the noise spectral characteristics sometimes change rapidly with time, NS or NWNS is not so effective in such situations. Because, there have to introduced large errors in the noise estimation process. EMD based adaptive thresholding method plays a vital role for the above case as found in Table 1. Table 2 presents a comparison the overall average output SNR among our previous method WNS and WNS+BSS with proposed method NWNS+EMD.
Table 1: The average output SNR for various types of noises at different input SNR by NWNS and NWNS+EMD (indicated as EMD).
Table 2: The average output SNR for various types of noises at different input SNR by WNS, WNS+BSS (previous methods) and NWNS+EMD (indicated as EMD).
In terms of speech quality and intelligibility, the proposed two-stage (NWNS+EMD based adaptive thresholding method has to given a better tradeoff between noise reduction and speech distortion. We investigate this effect from the enhanced speech waveforms obtained from various methods as shown in Figure 4. It is observed from the waveforms that the enhanced speech is distorted in low voiced parts due to remove the noise in NS method whereas NWNS does not. A little amount of noise is removed from the corrupted speech by NWNS method. So in NS method there is a loss of speech intelligibility while NWNS maintains it. Although the EMD based adaptive thresholding can be able to successfully remove the noise from voiced parts but there is some noise remaining in the silent parts because of misclassification of subframes as signal-dominant. This remedy can be avoided using the proposed method. We also observed that by NS+EMD based adaptive thresholding method, there is loss of information in lower voiced parts and as a result speech intelligibility reduced. Moreover, the wavefrom obtained by NWNS+EMD based adaptive thresholding, it can be seen that there is no loss of information in lower voiced parts and maintains the speech intelligibility. We use two perceptually motivated objective speech quality assessments, namely the average segmental SNR (ASEGSNR) and the Perceptual Evaluation of Speech Quality (PESQ) to study the effectiveness of the proposed method.
Figure 4: Speech waveforms of (from top) clean, noisy (HF noise at 10dB), enhanced by NWNS and NWNS+EMD.
In Figures 5 and 6, it is observed that our proposed NWNS+EMD based adaptive thresholding approach achieve comparable improvements of speech quality. The PESQ scores of the speech at –10dB and –5dB (pink and HF channel
Figure 5: Comparisons of the average output segmental SNR (ASEGSNR) by NWNS and NWNS+EMD methods for pink noise (left) and HF channel noise (right).
Figure 6: Comparison of PESQ scores by NWNS and NWNS+EMD methods for pink noise (left) and HF channel noise (right)
noise) are almost equal to input PESQ scores. This is due to the presence of musical noise in first stage
In this paper, we presented a new algorithm to effectively remove the noise components in all frequency levels of a noisy speech signal. We have combined two powerful methods, nonlinear NWNS and empirical mode decomposition in order to clean the noise signals in two stages. The main advantage of the algorithm is the effective removal of the noise components for all levels of SNR. We not only have better SNR but also a better speech quality without a residual noise. Future studies will most probably give better results.
International Journal of Computer Science and Security
|Types of Figures of Speech||1. Parts of Speech (части речи)|
|Chapter II. Parts of speech (General Survey)||You came into my life without a single thing I gave into your ways which left me with nothing|
|Emergent bilingual speech: from monolingualism to code-switching. A case of young Estonian Russian-speakers||Suffixes we normally use suffixes to change a word to a different part of speech: employ (verb) + er|
|Caught me off guard. By 9pm christmas Eve there wasn’t even a single Christmas tree stand||Эмпирическая модовая декомпозиция (emd) и присоединенный Гильбертов спектральный анализ|
Преобразование Гильберта H[x(t)] действительной функции x(t), t от -∞ до +∞, есть действительная функция, определенная как
|S. P. Shen Hilbert—Huang Transform and Its Applications|
Результаты emd-hsa не имеют ложных гармоник (результатов наложения свойств линейности на нелинейные системы) и не ограничиваются...
1. /Pantera/5 Minutes Alone (Single)/01 - 5 Minutes Alone.txt