Understanding common features used : [Audio/Speech Processing Part 1]

If you’re interested to work on digital audio processing, in particular, speech processing, this is the article for you. Here I will talk about the common feature used for various tasks such as speech-to-text, text-to-speech, noise-cancellations.

This is Part 1 of the the series:


  1. Waveforms
  2. Spectrogram
  3. Mel-spectrograms
  4. Mel Frequency Cepstral Coefficients (MFCC)

All of the following plots can be reproduced here.


Waveform for the word ‘seven’.

This is how the number ‘seven’ look like, in a 1D representation, called a waveform. But what do this actually mean? To really understand, we look at how frequencies are represented on a fundamental level.

Superposition of different frequencies

Consider these two frequencies, 1Hz and 2Hz, when we add them together, they are considered a superposition of these two frequencies.

Now we look back at the number ‘seven’ spoken by a human, it is superposition of many frequencies, that a human can speak and hear (20Hz —20kHz).

Now, how do we actually really pinpoint what are the frequencies? For that we look at a 2D representation of an audio: spectrogram.


Waveform for the word ‘seven’, and its corresponding spectrogram.

By performing Fourier Transform on the subframes of the waveform, we get its frequency representation on the y-axis. Here we limit to 8kHz, because the audio is in 16kHz. The redder the values, means there are greater amplitude of such frequencies in that time period.

Notice there are red values in the top left hand corner. This is because the ‘shhh’ sound in the earliert part of the word ‘seven’ is of higher frequency (4kHz-8kHz), hence the redder values in that region.

On the other hand, there are more of ‘vibrational’ frequencies observed in the middle part, and this is caused by vibration of our vocal tract, in the later half of the word ‘seven’.

To understand this physically, simply touch your adam’s apple, and say the word ‘SHHH-EEEE-VEEEEN’ really slowly, and realise that at the ‘SHHH’ part, your vocal tracts are not vibrating at all.


Overlapping mel filters
Relationship of Linear to Mel : Obtained from here

Another interesting commonly used feature is the Mel filterbank.

The reason this is studied is because, there is a nonlinear relationship as the frequencies increases. That is, the difference between 10Hz to 20Hz is very perceptible, but the difference between 10010Hz to 10020Hz is almost non-perceptible.

On the left are figures demonstrating how the audio experts consider the relationship should be like.

Left : Linear spectrogram. Right : Melspectrogram

Using this relationship, we transform the spectrogram of ‘seven’ to its mel filterbank. Above is a side-by-side comparison of the linear spectrogram, and its mel filterbank conversion. Notice that the the higher frequencies are now ‘squeezed’ together, while that of the lower frequencies are spread further apart.

Note that this conversion is also loosely considered to be a dimensionality-reduction technique as the number of frequency bins reduced from 256 frequency bins to 64 mel bins.


A further post-processed feature of the mel filterbank is the Mel-Frequency Cepstral Coefficients (MFCC). It is constructed by performing dicrete cosine transform (DCT) on each frame of the log mel-spectrogram. Most of the time, we remove the higher coefficients, as they are mostly redundant.

Left : All 64 coefficients of the MFCC. Right : Only keep the lower 32 coefficients.

Here we have two MFCCs, on the left, the MFCC has all 64 coefficients, on the right, we removed the higher 32 coefficients. As observed, we can’t really interpret what this MFCC as well as we could for the previously observed spectrograms, as the DCT causes decorrelation between adjacent frames. Now, how do we do we interpret the reduced dimensionality?

Respective reconstructed Log Melspectrogram

We reconstruct the spectrogram, of course!

Notice that the reconstructed log mel-spectrogram is more ‘blurry’, this is due to the higher coefficients (less important information) being removed. Focusing on the red column, these are the spectral envelope for that particular frame.

Respective reconstructed Log Melspectrogram

The left plot shows the original spectral envelope. The right plot shows the original as blue in the background, and the red as the reconstructed spectral envelope from the MFCC with the higher 32 coefficients removed. Although we remove half of the information, we still largely retain the shape of the spectral envelope.

An interesting thing to note from this is that if we express a particular spectral envlope as one-dimensional plot, in a way this is like expressing the frequencies as a waveform. And removing the higher MFCC coefficients are like removing the high frequency contents of the spectral envelope. Hence, we are removing the high frequency oscillations of the frequencies (frequenception !!!).

If you are interested to reproduce all of the above, or to test with your own audio, please see here.