IP Audio Delay

Revision as of 05:30, 12 April 2010 by Jrietschel (talk | contribs)

Whenever you transport audio over an IP network - that is true for every vendor and every technology, you will incur a delay. That is unavoidable. Why ? This page will explain the general concept of how Audio over IP works, and explain specifically the reasons for an inherent delay. CURRENTLY WORK IN PROGRESS - COME BACK in 2 hours ..

Concept of Audio over IP

To transmit Analog Audio over an IP Network, it needs to be sampled (a measure be taken) with the sample rate. These samples then can be handled in the digital domain and transferred over the network to the decoder, which ultimately will convert the samples back to an analog voltage on the audio output, accurate with the sample frequency. The original signal as it was present at the input is reproduced on the output.

Sending every sample over the network immediately

In theory, every sample could be directly sent as a block over the IP network with minimum delay, but doing this will generate a huge amount of traffic (48.000 blocks per second for 48khz sample rate) and bandwidth requirement (a min. ethernet frame is 60bytes, so you are actually using more than 23MBps if you do this) !

Ethersound does this, actually - all channel samples at a given timepoint (one per channel) go together in a frame on the network, one frame per sample set .. generating practically a constant network load of 100Mbps (not over IP, but over Ethernet).

For standard applications where the audio stream needs to coexist on the network with other services, something must be done to generate less network load

Collecting Samples, sending Packets

Let us assume the network load, expressed in number of packets per second, needs to be limited to no more than 100 (blocks per second). To achieve that, the device will need to to collect 10ms worth of samples before sending them out together in an IP block.

With a 48kHz sample frequency, that means 480 samples (960 bytes) have to be collected before the block can be sent. That is making perfect sense, as one Ethernet block can carry up to about 1400 bytes of Data.

Receiving and Decoding

At the receiving end (the "Decoder"), a constant stream of samples at the sample frequency must be generated. To do this, the decoder always needs samples "in storage" it can use, if it ever runs out of samples to generate the stream, an "underrun" condition exists, which will cause dropouts and other issues. Consequently, a buffer of samples must be maintained, high enough so that it always is replenished before the D/A runs out of data.

Delays introduced in the chain

As already discussed above, to limit the bandwidth used to a reasonable level, samples must be collected and send as blocks to the destination, via the network.


Audio needs to be sampled into the digital domain at the encoder. In theory, every sample could be directly sent as a block onto the network with minimum delay, but doing this you will create a huge amount of traffic (48.000 blocks per second for 48khz sample rate) and bandwidth requirement (a min. ethernet frame is 60bytes, so you are actually using more than 23MBps if you do this ! So - that is not really a good way of doing this ... the encoder needs to packetize data, send multiple samples in one block. Here is your first source of delay. Let's assume you want to limit the network load to 100 packets per second, you will need to collect 10ms worth of samples before sending them over the network. With a 48kHz sample frequency, that means 480 samples (960 bytes) for one channel. Makes perfect sense, Ethernet blocks carry up to about 1400 bytes of Data. Your first sample taken will leave with 10ms delay. If, instead of PCM, you want to use MP3, the delay will be much more, because (a) MP3 encoding is done on "frames". somewhere between 20ms and 40ms of audio is first sampled, then analyzed and encoded. The encoding process is resource intensive/time consuming, but obviously, it must be average finished within the same time range otherwise you get work to be backed up .. In the MP3 case, expect a "transmit" delay to the network of 2x the frame duration, let's say 60ms.

Now that a block is going to the network, some delay already is introduced (let's stick with the example, 10ms for PCM and 60ms for MP3). The block now needs to be sent over the network, potentially fighting with other traffic for bandwidth. On a local LAN, the delay will typically be quite low (msecs maybe), but beware .. if there is "sometimes" a fight over bandwidth/buffering in a switch or router, you may see average very low delays but PEAK delays could be substantially higher. Why is that a problem ? Well .. the receiving side always needs to be fed with samples before its buffers run empty .. now, if there is a block delayed, let's say 30ms, the receiver must have buffers configured so that it can live with that delay before running into an error condition (empty buffer). The difference between the min. delay and the max. delay of a network block arriving at the decoder is commonly called "jitter". Jitter can be significant especially with Wifi networks, as there are invisible retries happening in a lower level protocol - you might see (ping command shows all this) an average delay of only 5ms, and zero block loss, but a max. delay of 200ms ... any devce with a receive buffer configured of less than 200ms will mean that you will encur dropouts. period.

Now, let's take a close look at the decoding side. As already stated before, to maintain a constant, consistent stream of samples (which are then converted to an analog value and sent to the audio interface), buffering is a must. The buffers must be able and configured to hold as much data as necessary to cover/survive the longest possible "dry period" when no block comes in from the source, for whatever reason. If (example above), the source sends a block every 10ms, very precise timing, and zero jitter is introduced by the network (an unlikely scenario), in theory a buffer of one frame is sufficient. When the block arrives from the network, it will be copied into the buffer (let's say, holding 480 samples, in our example with 10ms) and the output can start converting to analog. The buffer now drains, but right when it is getting to "empty", the next block comes in from our perfect encoder through the perfect network infrastructure. That is, unfortuntately, not a real life scenario, jitter is always introduced somewhere on the way. A realistic setup uses a buffer holding several blocks. Oh yes, and for MP3, you have to add another reason for delay at the decoder side. Once a frame is received, it cannot be output immediately, but it first needs to be decoded, which is resource intensive and can take several ms ..

The last addition of delay is not really necessary technology wise, but a fact in Barix devices. We use a Main CPU for network tasks and a DSP driving the D/A, which turns the samples back to analog audio. The DSP is necessary for MP3 (and AAC etc etc) decoding, for PCM it mainly works as a pass-through. The interface between the main CPU and the DSP also introduces a buffer at the DSP side, which is counted in bytes, and can introduce quite hefty delays for low bitrate/sample rate streams. Why ? Well, the buffer is counted in Bytes as i said, and let's assume for these examples here it is 2kBytes (2000 bytes), for a PCM 48kHz stream that means 96Bytes per ms, so roughly 20ms - but if you send a 8Khz PCM stream, you get 6x that delay.

So, at the end you have several sources introducing delay, with the buffering for network jitter being often the most significant one, but a "base delay", depending on sample rate, encoding format etc is always present. As you can figure from the above examples, if you have the bandwidth, it often makes sense to configure higher sample rates and bitrate streams, as that will effectively lower the delay due to the fact that the constant (byte wise) buffers in the chain have a smaller through delay.

One last comment, now specific to Barix devices: The Exstreamer 1xx and 2xx decoder devices use a DSP with ample buffering. There is reasons for that, not to be further detailed here. In contrast, the Exstreamer 1000 and the "to be announced monday at NAB" device (hint hint !) as well as all Annuncicom devices use a different DSP with lower buffers, and we are currently beta testing a DSP software patch which reduces the buffers much further to almost non-existing ! With the Exstreamer 1000 and the new device, you can currently achieve delays of below 50ms - the software has not been optimized for very low delay. However, with optimized software, you can get the delay down to well below 20ms - that has been proven in our labs (for a specific project). We are in the process to bring this down further, obviously, this can only be done by sending many more blocks over the network, for example, one per ms - 1000 blocks per second .. ask your Wifi router what it thinks about that ..), so it needs optimized software.

Anyone doing IP codecs is tied by the things explained above - if you can show me a device from any manufacturer, which routes Audio over IP, works over a network with 50ms jitter, sends less than 100 packets per second and introduces an end-to-end delay from input to output of less than 10ms, this manufacturer defied physics and i will be happy to buy you a beer and work for them :)



Latency measurements

Both latencies were measured in the below described test.

Test environment

The latency was measured with the following equipment:

  • Digigiram ES220 as an input device
  • another ES220 as a reference output device
  • Exstreamer 100 as an output device (VLSI based device)
  • Annuncicom 100 as an output device (Micronas based device)
  • HP procurve 1700-8 switch where all devices were connected to

Both Barix devices were loaded with the version 00.03 of the Ethersound decoder module.

The input of the ES220 was fed with a 500Hz tone from a signal generator. The input, both output Barix devices as well as the reference ES220 device were monitored with an oscilloscope.

A trigger was set to capture the waveform after switching on the signal generator.

Test results

The below three diagrams show the results of the measurement. The top line (channel 4, green) is the clock source. Channels 1 (yellow), 2 (blue) and 3 (violet) are: the VLSI device (Exstreamer 100), the Micronas device (Annuncicom 100) and the reference ES220.

On the first diagram the 57.2ms latency of the VLSI decoder (Exstreamer 100) can be seen.


The second screen shot shows the end-to-end latency using a Micronas based device Annuncicom 100. The latency is significantly lower - only 6ms.

The last picture shows the end-to-end delay using the reference ES220 device. The latency is 1.44ms.