Immersive Audio: Spatial Sound Objects and Acoustic Augmentation

1. What is Immersive Audio?

Immersive audio is an umbrella term that is widely used, but rarely defined consistently. In practice, it overlaps with terms such as spatial audio, surround sound, 3D sound, voice-lift, soundscapes and electronic acoustic enhancement. Even within the professional audio industry, these terms are often used to describe different technologies, design philosophies or product features.

As a result, clients, consultants and system integrators are frequently confronted with unclear expectations. Similar terminology is used for fundamentally different technical approaches, while comparable systems are marketed under entirely different names. This makes it difficult to assess capabilities, understand design trade-offs, or determine which solution is appropriate for a specific venue or application.

To resolve this ambiguity, immersive audio must be described within a neutral and internationally accepted framework. Such a framework must accommodate both spatial sound objects and the acoustic environment in which those objects exist. Only then can immersive audio systems be specified, compared and designed in a transparent and meaningful way.

This framework exists in the form of the ISO 12913 Soundscape standard, which provides the conceptual foundation for understanding immersive audio as a combination of sound sources, space and perception.

2. The Soundscape Framework (ISO 12913)

The concept of a soundscape originates from the work of composer and researcher R. Murray Schafer in the 1960s. He introduced the idea that every environment has its own acoustic identity, shaped by human activity, natural sounds and the built environment. In this context, sound is not merely a signal, but an integral part of how a space is experienced.

Schafer classified the audible elements of a soundscape into three categories:

  • Keynote sounds
    Continuous or regularly occurring background sounds that define the character of a place, such as distant traffic or wind through trees.

  • Signals
    Foreground sounds intended to convey information or prompt action, such as bells, alarms or warning sounds.

  • Soundmarks
    Locally distinctive sounds that function as acoustic landmarks, helping to identify a specific place.

See Figure 1 below for the concept of a Soundscape.

Soundscape components: keynote sounds, signals and soundmarks in an acoustic environment.

Figure 1: Concept of a Soundscape

Recognizing the need for a standardized definition, the International Organization for Standardization introduced ISO 12913-1:2014, which formally defines a soundscape as consisting of two fundamental elements:

  • Sound sources
    Sounds generated by natural processes or human activity.

  • Acoustic environment
    The sound at the listener’s position, shaped by reflections, absorption and spatial characteristics of the environment.

SIAP-sound-scape-framework

Figure 2: ISO 12913 Soundscape Framework

This definition directly aligns with the core components of immersive audio systems. Immersive audio is not only about placing sounds in three-dimensional space, but also about shaping the acoustic environment in which those sounds are perceived.

From this perspective, immersive audio can be defined as:

An immersive audio system is the controlled introduction of artificial sound sources and acoustic augmentation into a natural soundscape.

This definition is intentionally broad. It accommodates spatial audio objects, active and virtual acoustics, hybrid systems and voice-lift solutions within a single coherent framework. It also makes clear that immersive audio is not a single technology, but a system-level approach that integrates sound sources, space and perception into one controlled acoustic experience.

3. Sound Sources in Immersive Audio

Within the soundscape framework, immersive audio systems do not replace natural sound sources. Instead, they introduce artificial sound sources into the existing acoustic environment. These artificial sources are commonly referred to as spatial audio objects.

Spatial audio objects are audio signals that are positioned and rendered within a three-dimensional space. Depending on the application, they can represent musical instruments, voices, sound effects or environmental sounds, and they may be perceived as stationary or moving relative to the listener.

Spatial audio objects can be categorized as:

  • Static objects
    Fixed in position relative to the audience or venue, such as an off-stage sound effect or an overhead ambience source.

  • Moving objects
    Objects that follow a defined trajectory through the three-dimensional sound field, for example a performer moving across the stage or a simulated fly-over effect.

To render these objects convincingly, immersive audio systems rely on dedicated spatial rendering techniques. In practice, two primary approaches are used: surround sound (3-D sound) and Wave Field Synthesis (WFS).

Immersive audio overview combining spatial audio objects with acoustic environment augmentation.

Figure 3: A Spatial Audio System and the Acoustic Environment

3.1 Surround-Based Spatial Rendering

Surround-based rendering determines the X-Y-Z location of a sound object using:

• level panning (VBAP) to create phantom images
• level + delay panning to improve low-frequency localisation
• dense horizontal and vertical loudspeaker coverage to maintain accuracy

Accurate localisation requires multiple horizontal rows and overhead loudspeakers, particularly in smaller rooms.

In live applications, spatial object movement can be controlled using performer tracking systems, such as optical, infrared or RF-based tracking. This allows sound objects to move in real time with performers, stage action or lighting cues. Accurate surround-based localisation requires sufficient loudspeaker density in both horizontal and vertical planes, particularly in smaller venues where listening distances are short.

Surround-based spatial rendering with loudspeaker arrays around audience for 3D sound.

Figure 4: Surround sound

3.2 Wave Field Synthesis (WFS)

Wave Field Synthesis takes a fundamentally different approach to spatial audio reproduction. Instead of creating phantom images through level and delay differences, WFS aims to reconstruct the physical sound wavefronts that a real sound source would produce.

In a WFS system, a large number of closely spaced loudspeakers are arranged in linear or circular arrays around the listening area. Each loudspeaker reproduces a small portion of the sound signal so that, collectively, the emitted wavefront matches that of an actual acoustic source.

The result is a virtual sound source that remains perceptually stable across a wide listening area and for multiple listeners simultaneously.

Wave Field Synthesis reconstructing virtual sound source wavefronts using dense loudspeaker arrays.

Figure 5: WFS

How WFS differs from conventional panning

In a conventional left-right loudspeaker system, level panning can create the illusion of a sound moving between two speakers. However, this illusion only holds for listeners positioned near the center line. Outside this narrow sweet spot, localization becomes unstable.

Wave Field Synthesis addresses this limitation by reconstructing wavefronts rather than relying on psychoacoustic illusions. This allows virtual sources to maintain their perceived position even as listeners move through the space.

Key benefits of WFS

  • A large usable listening area with consistent spatial imaging

  • Stable localization for multiple listeners

  • Smooth and continuous motion of moving sound sources

  • A highly realistic and immersive spatial experience

Practical limitations

Despite its advantages, WFS also introduces technical challenges:

  • It requires a large number of closely spaced loudspeakers

  • Loudspeaker spacing limits the usable upper frequency before spatial aliasing occurs

  • Room reflections and reverberation can reduce spatial accuracy

  • Precise loudspeaker placement and calibration are critical

As a practical reference, loudspeaker spacing determines the upper frequency limit for accurate wavefront reconstruction:

  • Approximately 1.7 kHz with 10 cm spacing

  • Approximately 3.4 kHz with 5 cm spacing

Hybrid WFS systems

To overcome these limitations, many high-performance immersive systems use a hybrid approach that combines WFS with level and delay panning techniques.

A typical hybrid architecture includes:

  • Dense WFS loudspeaker arrays for low and mid frequencies

  • Near-field, overhead or clustered loudspeakers for high-frequency detail

  • A real-time DSP engine capable of WFS, VBAP and delay-based rendering

A common hybrid rendering strategy is:

  • Global scene elements such as ambience and distant sources are rendered using WFS

  • Local and high-frequency detail such as speech, soloists and transients are rendered using VBAP and delay panning

  • Moving sources use WFS for smooth low and mid-frequency motion, with additional high-frequency cues from local loudspeakers

  • Frequency-dependent routing, typically:

    • Below approximately 800–1500 Hz: WFS dominant

    • Above approximately 1–2 kHz: panning dominant

    • Overlapping bands are smoothly crossfaded to avoid timbral artifacts

This hybrid strategy combines the spatial stability of WFS with the precision and clarity of conventional panning, making it suitable for complex, large-scale immersive audio installations.

3.3 Hybrid Spatial Rendering

In practical immersive audio installations, a single spatial rendering technique is rarely sufficient. To achieve both wide-area stability and precise localisation, systems often employ a hybrid spatial rendering strategy.

This approach typically combines:

  • Wave Field Synthesis (WFS) for low and mid frequencies, providing stable wavefront reconstruction and consistent spatial imaging across a large listening area

  • VBAP and delay-based panning for higher frequencies, delivering accurate localisation, sharp transients and detailed spatial cues close to the listener

By distributing frequency ranges across different rendering methods, hybrid spatial rendering maintains consistent timbre, avoids spatial artefacts and ensures that both stationary and moving sound objects remain perceptually coherent throughout the venue.

Hybrid spatial rendering is therefore considered best practice in contemporary immersive audio design, as it balances perceptual stability, localisation accuracy and system scalability within a single rendering architecture.

4. The Acoustic Environment

Immersive audio does not exist in isolation. The perception of spatial audio objects is fundamentally influenced by the acoustic environment in which they are reproduced. Even the most advanced spatial rendering will fail to convince if the surrounding acoustic field does not support it.

In practice, immersive audio systems are often used to augment or transform a venue’s natural acoustics. Two typical scenarios can be distinguished:

  • Quiet acoustic environments
    Spaces with minimal natural reverberation, such as modern cinemas, black box theatres or multi-purpose halls. Electronic acoustic augmentation can transform these spaces into venues suitable for speech, amplified music or orchestral performance by introducing controllable reverberation.

  • Application-specific acoustic environments
    Spaces designed for a primary function, such as concert halls optimized for symphonic music. While acoustically successful for their main purpose, these venues often require flexibility to accommodate a wider range of performances. Traditionally this was achieved through architectural variable acoustics. Today, active acoustic systems provide a more flexible and cost-effective alternative.

In both cases, immersive audio extends beyond spatial sound objects and becomes a tool for shaping the acoustic identity of the space itself.

4.1 Virtual Acoustics and Active Acoustics

Electronic reverberation enhancement for live environments generally follows one of two approaches: Virtual Acoustics or Active Acoustics. While both aim to increase acoustic flexibility, their underlying principles and perceptual results differ significantly.

Virtual Acoustics

Virtual Acoustics is based on capturing direct sound source signals, either via close microphones or line-level inputs, before significant interaction with the room. These dry signals are processed through DSP using pre-recorded or synthesized acoustic responses, and subsequently distributed through the venue’s loudspeaker system.

In essence, Virtual Acoustics imposes the acoustic signature of a known space, often a renowned concert hall, onto the current venue.

This approach is particularly effective in environments where natural acoustics are minimal, such as:

• Outdoor performance spaces
• Rehearsal rooms
• Very dry or acoustically neutral venues

Advantages:
• Access to a known and predictable acoustic character
• Fast setup and tuning
• Consistent results across different locations

Limitations:

  • Potential conflict between the imposed acoustic signature and the venue’s natural acoustics in larger or more reflective spaces
  • The audience is not part of the enhancement process, which can result in unnatural perception of applause and audience response

In practice, Virtual Acoustics prioritizes predictability and repeatability, while Active Acoustics prioritizes authenticity by working with the venue’s native acoustic identity.

Figure 6: An Inline Virtual Acoustics System and the Acoustic Environment

ACTIVE Acoustics

Active Acoustics takes a fundamentally different approach. Instead of imposing an external acoustic signature, it augments the existing acoustic environment.

Distributed microphones capture sound within the space, including direct sound from performers and audience, as well as early and late reflections. This captured sound is processed and redistributed through loudspeakers to shape the perceived reverberation and spatial impression of the venue.

Active Acoustics therefore works with the venue’s native acoustic identity, extending or modifying it rather than replacing it.

In summary, both Virtual and Active Acoustics enable venues to host a wider range of events without permanent architectural changes. The choice between them depends on the acoustic characteristics of the space and the desired level of authenticity (+budget!)

4.2 Types of Active Acoustics

Active acoustic systems are not a single technology but a family of approaches that differ in signal flow, control strategy and perceptual outcome. In practice, three main categories can be distinguished: regenerative systems, inline systems and hybrid systems. Each approach offers specific advantages and limitations, depending on the acoustic characteristics of the venue and the intended use.

4.2.1 Regenerative (Non-Inline) Active Acoustic Systems

Regenerative active acoustic systems reinforce a venue’s existing acoustic identity by creating a controlled electro-acoustic feedback loop between microphones and loudspeakers.

How it works

Capture:
Distributed microphones capture direct sound from performers and audience, as well as early and late reflections from the room.

Processing:
The DSP stabilizes the microphone-loudspeaker loop to prevent acoustic feedback. Rather than introducing synthetic reverberation, the system primarily reinforces the room’s natural acoustic energy using techniques such as equalization, gain control and feedback suppression.

Redistribution:
The reinforced acoustic energy is redistributed through the loudspeaker system, effectively extending the natural reverberation time of the space.

Regenerative active acoustics using microphone-loudspeaker feedback to extend natural reverberation.

Figure 7: A Regenerative (non-Inline) Acoustics System and the Acoustic Environment

Advantages

• Preserves and extends the native acoustic character of the venue
• Enhances both audience and performer experience
• Often perceived as highly authentic, as it builds on the hall’s natural identity

Limitations

• Some degree of coloration may be introduced to maintain stability
• Adaptive filtering can further affect timbre
• Reinforces poor acoustics if the native room sound is weak
• Heavily damped venues with very short reverberation times offer limited energy for regeneration
• Excessive reverberation extension can lead to unwanted amplification of audience noise or mechanical sounds

4.2.2 Inline Active Acoustic Systems

Inline active acoustic systems use a direct electronic signal path between source microphones and loudspeakers. Unlike regenerative systems, the signal does not first propagate acoustically through the room before being processed.

How it works

Captured microphone signals are routed directly into DSP algorithms that generate a synthetic diffuse field. These algorithms may include reverberation engines, convolution, decorrelation, spectral shaping and diffuse-field synthesis. The processed signals are then distributed to the loudspeaker system to create the desired acoustic impression.

Inline systems are therefore capable of actively shaping or correcting aspects of the acoustic environment, independent of the venue’s natural reverberation.

Inline active acoustics generating diffuse reverberant field using DSP and loudspeakers.

Figure 8: An Inline Active  Acoustics System and the Acoustic Environment

Advantages

• Avoids the “longer is louder” limitation associated with regenerative systems
• Strong feedback control without heavy coloration of the reverberant field
• Capable of correcting deficiencies in the natural acoustic environment
• Highly predictable and repeatable behaviour

Limitations

• Often perceived as less authentic than regenerative systems
• More complex configuration required for natural audience-to-audience and audience-to-stage interaction

Enhancement zones in inline systems

Stage to audience:
Provides reverberation enhancement so the audience perceives an appropriately sized venue for the performance type.

Audience to stage:
Allows performers to perceive audience presence and ambience.

Stage to stage:
Functions as an electronic orchestra shell, improving ensemble coherence.

Audience to audience:
Creates an immersive diffuse field that envelops listeners within the acoustic space.

4.2.3 Hybrid Active Acoustic Systems

Hybrid active acoustic systems combine the authenticity of regenerative processing with the control and flexibility of inline processing. Rather than treating these approaches as separate layers, hybrid systems integrate both processes into a unified acoustic architecture.

Key characteristics

• Regenerative processing extends the native acoustic environment
• Inline processing provides corrective tools and extended control
• Both processes operate in tight synchronization
• Greater tuning flexibility across diverse acoustic conditions

Scalability and venue-dependent balance

In hybrid systems, the balance between regenerative and inline contributions is not fixed. It is adjusted based on venue-specific parameters, including:

• Room geometry
• Absorption characteristics
• Modal behaviour
• Gain-before-feedback margins

By dynamically balancing both processes, hybrid active acoustic systems can deliver consistent and natural results across a wide range of venues.

The perceptual success of a hybrid active acoustic system is ultimately determined by the quality of integration between regenerative and inline processes. Poorly synchronized systems behave as layered effects, while well-integrated systems function as a single coherent acoustic field.

5. Voice-lift

Voice-lift is an active acoustic application designed to improve speech intelligibility by subtly reinforcing a presenter’s voice, without the need for handheld or body-worn microphones. Rather than conventional sound reinforcement, voice-lift provides natural, evenly distributed speech that integrates seamlessly with the acoustic environment.

Voice-lift systems operate by capturing speech through the existing active acoustic microphones, processing the signals via DSP, and redistributing them through amplifiers and loudspeakers. The result is clear and comfortable speech throughout the space, without the perceptual impression of amplification.


Target level and purpose

Conversational speech typically measures approximately 65–70 dB SPL at the listener position. Voice-lift systems are designed to reproduce speech at this level, avoiding the artificially elevated sound pressure levels (often 85 dB SPL or higher) commonly associated with public address or sound reinforcement systems.

The purpose of voice-lift is therefore to enhance intelligibility while preserving a natural listening experience, suitable for lecture halls, council chambers, courtrooms and similar environments.


Practical limitations and hybrid approaches

In practice, many voice-lift implementations are limited by available gain before feedback, which can lead to inconsistent results in larger or acoustically challenging spaces. To improve performance, some systems augment the active acoustic microphones with directional or beam-steering microphones, typically positioned near the presenter location.

This hybrid approach can increase usable gain and speech clarity, but requires careful system design and tuning to maintain transparency and avoid perceptual artifacts.


Voice-lift within immersive audio systems

Within an immersive audio framework, voice-lift should not be treated as an isolated function. Speech reinforcement interacts directly with both spatial audio objects and the acoustic environment. Well-designed immersive audio systems therefore integrate voice-lift as a subtle, transparent layer within the overall acoustic architecture, ensuring intelligibility without compromising spatial coherence or acoustic authenticity.

Voice-lift system reinforcing speech intelligibility via active acoustics microphones and loudspeakers.

Figure 9: A Voice-lift Active Acoustics System and the Acoustic Environment

6. SIAP’s Approach to Immersive Audio

Immersive audio systems are defined by the interaction between spatial audio objects and the acoustic environment in which they are reproduced. If the augmented acoustic environment lacks coherence or authenticity, spatial rendering alone cannot deliver a convincing immersive experience. In practice, this mismatch is a common cause of unsatisfactory results.

This system-level approach distinguishes SIAP from solutions that treat spatial audio, active acoustics and voice-lift as separate or loosely coupled subsystems.

SIAP’s approach to immersive audio is therefore based on the principle that acoustic augmentation and spatial rendering must function as a single, integrated system, rather than as independent layers.


Integrated system architecture

SIAP immersive audio systems are built on a hybrid architecture that combines regenerative and inline active acoustics with spatial audio rendering. These processes are tightly synchronized at hardware, software and firmware level to ensure phase-aligned operation across the entire system.

By avoiding loosely coupled or layered processing stages, SIAP systems maintain temporal and spectral coherence within the augmented acoustic field. This architectural integration allows immersive audio systems to scale reliably across different venue types, from highly absorptive spaces to reflective concert halls.


Tuning as a defining factor

While system architecture determines technical capability, the perceived quality of immersive audio is largely defined by tuning. SIAP therefore places strong emphasis on detailed tuning control.

SIAP systems provide parameters for spectral shaping, spatial distribution of reverberant energy and precise control of decay behaviour. This enables optimization of the acoustic response for each venue, while avoiding common artifacts such as double-decay or perceptual layering.


Scalable hybrid integration

In SIAP systems, regenerative and inline processes are treated as complementary elements within a unified acoustic framework. The balance between these processes is adjustable and venue-dependent, based on factors such as room geometry, absorption characteristics, modal behaviour and gain-before-feedback margins.

This hybrid integration ensures consistent performance and perceptual stability across a wide range of applications.


A unified immersive audio framework

By integrating spatial audio rendering, active acoustics and voice-lift within a soundscape-based framework, SIAP delivers immersive audio systems that are both technically robust and perceptually authentic, enabling confident deployment across diverse venues and use cases.

7. Applications

Within the soundscape-based immersive audio framework, these systems are commonly applied in:

• Concert halls
• Theatres
• Multipurpose venues
• Houses of worship
• Museums
• Education spaces
• Immersive art installations
• Spatial music performance
• Post-production & research environments

8. FAQ - Immersive Audio

your questions, our answers

What is immersive audio?

Immersive audio refers to systems that create a three-dimensional sound field around the listener. This includes spatial sound sources and acoustic enhancements that modify how a space sounds and feels. SIAP uses the internationally recognised ISO 12913 soundscape framework to define immersive audio objectively.

Spatial audio focuses on the positioning of individual sound objects in space, while immersive audio includes both spatial rendering and modification of the acoustic environment. SIAP systems combine both to deliver a complete immersive experience.

Active acoustics uses microphones, loudspeakers and signal processing to modify a room’s natural acoustics in real time. This allows venues to switch between different acoustic settings – for example, from a dry speech setup to a reverberant concert hall – at the push of a button.

Wave Field Synthesis (WFS) uses a dense array of loudspeakers to recreate the physical wavefronts of virtual sound sources. It provides stable localisation and smooth motion across large areas, but requires precise loudspeaker spacing and room tuning.

Yes. Immersive audio systems like SIAP’s are often retrofitted into existing venues. The design adapts to the room’s dimensions, acoustics and loudspeaker positions, with minimal architectural impact.

The number depends on the application: spatial audio typically requires dense horizontal and vertical coverage, while active acoustics needs well-distributed microphones and loudspeakers. SIAP systems are scalable – from small rooms to large halls – with flexible channel counts.

Hybrid active acoustics combines regenerative and inline processing. This approach merges the natural interaction of a room with precise control and tuning, resulting in transparent, dynamic acoustic enhancement across frequencies.

SIAP Acoustics systems  process spatial audio and active acoustics in a unified platform, ensuring phase alignment, flexible routing and consistent timbre. This allows sound sources and room response to work as one coherent system.

Yes. SIAP’s voice-lift functionality enhances speech intelligibility by reinforcing natural speech with minimal latency and coloration – ideal for auditoriums, churches and education spaces.

9. Contact

Discuss your immersive audio project or request a technical consultation. Contact SIAP

NEWSLETTER

Subscribe to our newsletter for exclusive insights into our latest projects, acoustic innovations, and upcoming events! Be the first to hear about demonstration opportunities, from intimate spaces to large venues, where you can experience the SIAP difference firsthand. Join our community today and stay connected with everything happening at SIAP!