Speech Data Augmentation : [Audio/Speech Processing Part 2]

5 min readApr 18, 2020

--

In this article, we discuss how to perform data augmentation, and simulate a scenario where data is degraded, and how we can use this to increase the amount of our data, which will be very useful for machine learning, or to test robustness of algorithms.

This is Part 2 of the the series:

Noise

When we collect audio data in the while, there are bound to be some noise in the background, we largely split them in two categories, stationary noise, and non-stationary noise.

Stationary Noise

Most common example is the Gaussian noise. We can simply simulate it using the normal function.

Top left is the clean audio signal, top right is the gaussian noise, to be added.

We add different signal to noise ratio (SNR) of gaussian noise to a word, and observe the effects.

As SNR decreases, the noise level increases in amplitude, which we see more evidently in SNR=0 of the waveform, as compared to the SNR=15.

How does this affect the frequencies in particular ?

In the noise spectrogram, we can see that all frequencies are uniformly affected by noise. This is evident in plots of SNR=15 and SNR=0.

Non-Stationary Noise

However, in reality, we deal mostly with non-stationary and real noise, that have some overlaps in speech.

On the wave-form level, its hard to see the non-stationarity of the noise. On the spectrogram of the babble noise, we can see there are some dynamic frequencies that stand out, and its hard to filter them out from the clean audio, especially in low SNR level.

Reverberation

Reverberation occurs when the conversation is going on in an enclosed space with high reflective walls/ceilings. This phenomenon occurs mostly when you’re in a big hall and you are far away from the speaker.

Comparison of the wave between clean and reverb. Samples are available in the provided links.

Here we used a longer phrase, rather than a word, lasting about two seconds long. This is because you can only tell the difference of the reverberation on a longer phrase with more context.

The spectrogram on the left shows a very clean sample, with very clear frequency distinct from each other. However for the reverb one, the frequencies are smudged, and it becomes ‘blurry’, yet seemingly different from the noise as shown above. This is reverberation.

SpecAugment

SpecAugment is a recent paper by Google Brain which boost accuracy in Automatic Speech Recognition (ASR) tasks. The main gist of the augmentation is to 1. Time Warping, 2. Time Masking, and 3. Frequency Masking.

Time Warping

Time Warping causes certain part of the audio to speed up, and/or another part of the audio to slow down. This doesn’t change the meaning or what is spoken of the sentence, it merely adjusts the speed at different parts of the audio.

Left : Original. Right : Time Warped. Red underline shows matching section

Notice that for sections A,B the duration is shortened. For C,D,E the duration is extended. When you listen to the samples, the earlier part is spoken faster, and the later part is slower. Notice that at parts where is spoken faster, the frequency slightly increases, on the other hand, frequency decreases when spoken slower. This is because increasing speed of audio shortens the wavelength of all frequencies.

Time Masking

Time Masking causes certain words/phones of the audio to be missing.

Left : Time Warped. Right : Time Masked. Red border shows the missing chunks.

This happens in real life, especially when we transmit audio over long distances. Certain bits are either corrupted, or missing, causing the audio to be have missing chunks of audio. Of course, as humans we can make out the words when the missing chunks are below a certain threshold level (If the whole word is missing, of course we wouldn’t be able to guess it without context! Of course, there are language modelling to deal with this, but that is another matter). But computers are prone to fail at them. To enable robustness of algorithms to this kind of problems, we perform Time Masking.

Frequency Masking

Frequency Masking causes certain frequency bins of the audio to be missing.

Left : Time Warped. Right : Time Masked. Red border shows the missing frequency.

We want algorithms to be robust to missing frequency bins, as most spoken words are still perceptible without some of the missing frequencies. This means that we don’t depend on all frequencies to be present in order to recognise a word. We want algorithms to have similar kind of robustness, hence we perform Frequency Masking.