Posted February 19th 2009
A technician working for an operator mailed me a few days ago wondering why the recorded voice clips they use for their IVR sound so bad, "like they're coming from the bottom of a deep well". It turned out that the clips actually sounded OK on a telephone, just not through his laptop's speaker. He asked if I recommend any specific filter parameters when converting audio from 44.1kHz wav to 8kHz Alaw voice clips.
I took this audio snippet from the introduction to an audio book. It was originally a .mp3 file. I converted it to a .wav file with a 44.1kHz sampling rate and 16 bits per sample. For my purposes, artefacts from mp3 are negligible.
1_mono.wav (44.1kHz, 16 bit linear samples)
Next, I converted it to 8kHz Alaw using sox. 8kHz Alaw is what runs on the fixed telephone network in most of the world. (The US uses a minor variant, μlaw):
sox 1_mono.wav --encoding a-law --rate 8000 2_8kHz_alaw.wav
2_8kHz_alaw.wav (8kHz, 8 bit Alaw samples)
That sounds a bit less clear than the original, but it's OK. It's what you'd expect coming out of a telephone. There's some weirdness though. The audible difference between the two files varies from one PC to another and even one playback program to another. Why? Because laptop speakers vary in quality and because playback programs usually quietly convert everything back to 48kHz or 44.1kHz sampling rates, and they do it with different approaches. For fun, I resampled to 44.1kHz:
sox 2_8kHz_alaw.wav --rate 44100 --encoding signed 3_resampled.wav
3_resampled.wav (44.1kHz, 16 bit linear samples)
2_8kHz_alaw.wav and 3_resampled.wav should sound almost the same. But on some PCs they sound markedly different.
The GTH has a simple approach to playing back audio. It just copies the bytes you give it to the destination timeslot. No format or rate conversion happens, though the GTH does make sure the data is played out at the E1's frame rate (8000Hz). The downside of that is that you have to convert all the files for your IVR system before giving them to a GTH, e.g. using sox. The upside is that it's simple. Nothing happens behind your back.
To convert an audio recording to raw a-law for GTH:
sox original.wav --rate 8000 --channels 1 --encoding a-law --type raw gth.raw
To convert a raw recording from a GTH to something most audio programs can play:
sox --type raw --rate 8000 --channels 1 --encoding a-law gth.raw --rate 44100 --encoding signed gth.wav
(May 2016: I updated this section because SOX has changed since I first wrote this in 2009. The options above work for sox 14.4.1.)
There's a certain sound quality level expected in telephone networks, and part of that is that the network carries everything up to about 3500Hz. Analog local loop specifications mention that, and pretty much all digital telephone systems use an 8kHz sampling rate, which is what you need to be able to carry audio up to 3.5kHz. Even the GSM and AMR codecs start off with the assumption that the incoming audio is limited to 3500Hz.
So the bar is set pretty low. I haven't come across any systems which set out to provide higher quality, e.g. even skype compresses the hell out of the audio to save bandwidth. Even when both parties in a conversation have huge amounts of it. Surprising, why not aim for VOIP to sound much better than a regular telephone?
Permalink | Tags: GTH, questions-from-customers