MediaTech Intelligence

Embracing Immersive Audio

Journal Article from Genelec

Wed 07, 04 2021

Aki Mäkivirta
R&D Director, Genelec

The popularity of OTT broadcasting is really helping to drive the growth of immersive content, and this presents both opportunities and challenges for the broadcast audio world. Mixing in immersive allows the audio engineer to create a sense of envelopment and realism like never before, but as channel count and mix complexity increases, the importance of neutral, uncoloured studio monitoring with precise imaging becomes even more important.

At Genelec we’ve long been involved with designing monitoring systems that are scalable from stereo to surround to immersive, so here we’ll examine some of the principles of immersive audio, and some of the considerations that audio professionals need to be aware of.

The principles

Immersive audio formats not only surround the listener, they also encircle them in the height dimension too. One way to understand the capability of an immersive audio system is to describe how many height layers an immersive playback system offers. The two-channel stereo and conventional surround formats offer only one height layer, and this layer is located at the height of the listener’s ears, with all loudspeakers located at an equal distance from the listener (in terms of acoustic delay), and playing back at the same level.

The different channel layouts for immersive formats serve several purposes. One target is to create envelopment, and a realistic sense of being inside an audio field. One height layer alone cannot create this sensation with sufficient realism, because a significant part of the listening experience is created by the sound arriving at the listener from above. So the extra height layers of a true immersive system provide this envelopment, and therefore add a significant dimension to the experience.

The second aim for immersive systems used with video is to be able to localise the apparent source of audio at any location across the picture. This is the reason why the 22.2 immersive format (pioneered by NHK in Japan) has three height layers, including the layer below the listener's ears. Since the UHDTV picture can be very large, extending from floor to ceiling, the audio system has to be able to localise audio across the whole area of the picture.

Genelec immersive room with GLM4 immersive set up for web

The growth of immersive

With the demand for immersive content gaining momentum at increasing speed, several systems are competing for dominance in the world of 3D immersive audio recordings. The front-runners are now the cinema audio formats, who are trying to increase their presence in the audio-only area and enter the television broadcast market too.

Whereas the cinema industry is always searching for the next ‘wow-effect’ to lure the audience from the comfort of their homes into theatres, the growth of immersive audio has been slightly slower in the world of television. But the pace is now really picking up, with several companies studying 3D immersive sound as a companion to ultra-high definition television formats, and the International Telecommunication Union (ITU) issuing recommendations about the sound formats to accompany UHDTV pictures. In preparation for the delayed Tokyo Olympic Games, NHK has already started to deliver 8K programming, with 22.2 audio.

How many layers?

We touched on this earlier, but modern immersive formats offer two or three height layers, while current cinema formats offer two - and the emerging broadcasting formats have three or more.

One of the height layers is always at the height of the listener’s ears, and this typically creates a layout with backwards compatibility to both surround formats and basic stereo. Typically, other layers are above the listener and, as previously mentioned, layers can also be located below the listener, to enhance the sense of envelopment.

Certain encoding methods for broadcast applications can compress 3D immersive audio into a very compact data package for storage or transmission to the customer. These formats offer a very interesting advantage over the many immersive audio formats, since the channel count and the presentation channel orientations can be selected according to the playback venue or room. Essentially any number of height layers and density of loudspeaker locations can be used - and furthermore, this density does not need to be constant.

Creating the feeds for loudspeakers dynamically from the transport format is called rendering. The compact audio transport package is decoded, and the feeds to all the loudspeakers are calculated in real time while the immersive audio is played back in the user’s location. This compact delivery format plus the freedom to adjust and optimise the number and location of the playback loudspeakers makes these flexible formats very exciting.

Common assumptions

Popular immersive audio playback systems typically share two assumptions about the loudspeaker layout and one assumption about the loudspeaker characteristics. Concerning layout, it is assumed that the same level of sound will be delivered to the listening location from all loudspeakers, and the time taken for the audio to travel from each loudspeaker to the listener will also be the same. If each loudspeaker in the system has similar internal audio delay, then this can be achieved by positioning each loudspeaker at an equal distance from the listening position. Otherwise, electronic adjustments of the level and delay are required to align the system.

Concerning loudspeaker characteristics, a fundamental assumption is the similarity of the frequency response for all the loudspeakers in the playback system. Sometimes this is taken to mean that all the loudspeakers in the system should be of the same make and model. In reality, loudspeaker sound is affected by the acoustics of the room in many ways. This can significantly change the character of the audio signal, so that even when the same make and model of loudspeaker is used throughout the system, the individual locations of the loudspeakers will change the audio in a way that renders each loudspeaker performance slightly different.

Getting aligned

To turn these assumptions into reality, Genelec have created a comprehensive range of Smart Active Monitors that integrate tightly with our own GLM (Genelec Loudspeaker Manager) software. This allows the creation of immersive systems in excess of 80 monitors and subwoofers, thus making it compatible with all existing audio playback formats.

GLM 4, the newest version of GLM, takes care of the essentials of calibrating an immersive audio playback system, providing systematic and controlled monitoring. This includes the alignment of levels and time of flight at the listening location, subwoofer integration, and compensation for the acoustical effects of loudspeaker placement. This ensures that all the loudspeakers in the system deliver a consistent and neutral sound character.

For the audio engineer, this will improve both the quality of the production and the speed of the working process, allowing them to produce reliable mixes that will translate consistently to any playback medium. Additionally, one of the key requirements for immersive monitoring is to accurately maintain a standard playback level to the listener, in line with new recommendations about maintaining loudness in broadcast signals - including a definition of the SPL at the listening location. Happily, GLM’s powerful monitor control features make this a simple and repeatable process.

Using headphones

We’d always recommend that in-room loudspeaker monitoring is the best method for evaluating an immersive mix, since our head, outer ear shapes and head movements provide us with a wonderful ability to localise sound sources. However, good headphones are also a useful complementary tool – particularly for mobile audio professionals working remotely in ad-hoc environments. Headphones, however, break the link to these natural mechanisms that we have acquired over our lifetime for localising sound. This causes sound to appear ‘inside’ our head when presented over headphones, rather than appearing all around us.

Fortunately, Genelec has a solution for this challenge too – in the form of our Aural ID technology. Aural ID contains all the information about the user’s personal sound localisation. When we create the Aural ID for the user, we compute how their head, external ear and upper body affect and colour audio arriving from any given direction. This effect is called the Head-Related Transfer Function (HRTF), and is totally unique to every user. Aural ID computer models the acoustics of the head and upper torso, based on data extracted from a simple 360 degree smartphone video showing the user from all directions. The user’s individual HRTF is then delivered as a SOFA-format file, which can be integrated into the audio workstation’s signal processing for the headphone output. This makes the immersive headphone listening experience much more truthful and reliable, with a far more natural sense of space and direction.

This is a subject that Genelec is researching intensively, so stay tuned for more developments from us in this area.

Need help?

So, whether you’re a Genelec user or not, if you need help and advice on any aspect of immersive audio, then our free helpdesk is ready to guide you through the principles, technologies and practicalities involved in handling immersive audio content. Staffed by our team of global experts, we can advise on room layout, acoustics, loudspeaker choice and placement, dimensioning, room calibration, playback standards and the other equipment choices you may find useful in the immersive recording and mixing process.

So, for personal advice, feel free to contact us at immersive.helpdesk@genelec.com and for a wealth of useful general information on immersive audio, please download our Immersive Solutions Guide here