Speech and Audio are two preferred ways of communication. By speech, we generally mean the method of two-way sound communication used in telephony or similar situation, for example, video conferencing that includes audio communication. By audio we will generally mean higher quality sound used in broadcast, CD, DVD, music and video.
Speech or audio coding is used to digitize the analog sound and then reduce the number of bits required to represent the sound. The challenge to the coding is to use as few bits as possible to represent the sound while maintaining the decoded sound as close to the original sound as possible. In general, higher bit rate will represent a better quality and a lower bit rate will represent inferior quality. The bit rate and the quality of the decoded sound will depend upon the coding algorithm and scheme for a particular application. An understanding of the sound, environment, channel conditions and limitation is important to make a correct choice of a coding scheme.
Issues in Speech Coding
The basic requirement of a Speech coding is to produce high quality sound while taking as little bandwidth as possible. The codec should require small processing power. The delay in coding should be small. The codec should perform well under error prone network conditions. Finally, the interconnected networks, where there multiple coder and decoders are interconnected together should offer acceptable speech quality.
Quality and Bandwidth
Quality is the foremost important parameter against which all the codecs will be compared. The goal is to satisfy generally contradictory requirements of lower data rate and higher quality. Two important parameters for comparing two codecs are the bandwidth requirement and the quality. The quality is measure in terms if MOS or Mean Opinion Score.
MOS – Mean Opinion Score is used as a way to evaluate the performance of speech coders. To find the MOS of coders, listeners are asked to classify the quality of the encoded speech in one out of five categories – excellent(5), good(4), fair(3), bad(2) or poor(1). The average of numerical value assigned by of all the listeners are taken to produce the MOS rating. A MOS rating of 4.0 or higher indicates a good quality. It is not a mathematical way of evaluating speech codec. It is subjective to the listener and it is time consuming to perform MOS for a given codec. The results do vary from experiment to experiment. It is however, still widely used as a measure of the quality of the codec.
Signal to Noise ratio is another way to express the quality of codec.
Codec Complexity
If two codecs achieve roughly same MOS figure for a given rate, preference is made for the codec that needs lower computational complexity. The computational complexity of the codec is measured in MIPS ( Million Instructions per second). The MIPS figure generally refers to the DSP rather than the CPU. A lower MIPS figure will generally be cheaper to implement. Lower computational power requirement can lower the burden on the processor. Another figure that could be of importance is the maximum and the average amount of the memory required during the run time of a coding. It is apparent that the coding schemes should be implemented in a way that needs as small processing power and memory as possible.
Coding Delay
Another issue in the implementation of a speech codec is the delay encountered in coding a given sample of speech. Human ear will not detect end to end delay upto 150 ms. A delay of 400 ms or more will be definitely annoying and is supposed to hamper the ability to comprehend the speech smoothly . A delay between 150 ms to 400 ms is the grey area where delay of as much as 250 ms is found acceptable in most cases.
The delay consists of three part – coding delay, network delay and decoding delay. Speech codecs are compared for coding delays if the overall delay exceeds 150 ms.
The ITU-T G.114 standard provides an overview of the effect of the delay on the satisfaction rating users. According to it, a delay of upto about 200 ms keeps the users in very satisfied range. A coding delay of 400 ms or more makes many users dissatisfied.
Asynchronous Tandem Connection
Between end to end conversation, there may be a number of networks interconnected together. It is essential sometimes to decode the digital speech, perform the digital to analog conversion, and re-encode the analog signal. The term asynchronous tandeming refers to the interconnection of networks in which, coded speech has to be converted into analog signal and needs to be re-encoded.
Asynchronous tandem connection gives rise to two undesired issues. First it degrades the audio quality because of reconstruction and re-sampling. Secondly, it adds to the delay due to the decoding and recoding.
Error and Packet loss in Network
Unlike the normal data transfer, packet retransmission is either not an option or the scheme if implemented introduces further delay in the network. Packet loss for speech packets is a common phenomenon in an IP bases networks. The success of speech codec will depend upon how well it performs under the error prone network and under the condition where there is a possibility of packet loss.
Speech Coding
A speech waveform f(t) can be represented as a function of time t. The waveform coding method collects a defined number of samples per second. Each sample is then digitized to represent the amplitude of the waveform.
It is obvious that, higher the sample rate, more accurate the coding will be. A 2 ksps will be inferior in quality than a 8 ksps coding. The sample rate 8 ksps used in speech coding comes from Nyquist Criteria which states the relation between the sampling rate to cover a given bandwidth.
A pulse code modulation is the simplest form of coding. In speech domain a PCM will consist of 8 kilo samples per second with each sample coded with 8 bits giving a bit rate of 64 kpps.
Areas of Development and Refinement
There are many areas on Speech coding that needs refinement and further research. The methods used in one set of application needs to be tested and applied in other scenario and situation. For example the packet concealment developed for G.711 needs to investigated to be applied and evaluated for the other codecs as well.
Error Concealment
VoIP is gaining importance to the extent that it is about to overtake other forms of telephony and speech communication. Packet loss is a common phenomenon is cable based network and to some extent in DSL network. The research in error concealment algorithms ( G.711 concealment, Global IP Sound’s iLBL codec) has been recent developments. Redundancy algorithms can be used to overcome the poor quality of the coded resulting from packet loss. More development needs to be done. This area promises scope of improvement in the audio coding for VoIP.
Codec Performance Assessment
If a mathematical way could be found to assess the performance of codecs, it will be a great tool to replace the human oriented MOS type tool. Modelling mathematical codec assessor is a challenge. More challenging is the ability to assess codec performance under varying channel conditions, error conditions and packet loss conditions.
Network specific Speech codec design
Original codecs were designed for PSTN e.g. G.711 , G.726 and G.728. Modifications to these codecs were made to make them suitable for other networks e.g. – voice over Wi-Fi. As a result, packet loss concealment, comfort noise addition, were added to expand their capabilities. However, designing a codec with a particular network in mind is expected to provide better quality, lower rate, low delay and robustness to error.
By Vikas Shukla
Vikas Shukla is currently working as Senior Design Engineer at BL Healthcare. He has degree in Computer Science and Engineering from IT-BHU, Varanasi, India. Mr. Shukla has over 15 years of experience in design of microprocessor-based systems. His expertise includes signal integrity, architecture and design of remote patient monitoring systems. The views expressed are his own.
Hardware Design, Programming
design, hardware