How Resilient is Redundancy?

How Resilient is Redundancy?

Articles

Thought leadership articles by IABM and our members
Articles taken from IABM's journal and at show papers
To submit your article email marketing@theiabm.org

How Resilient is Redundancy?

Journal Article from Lawo

Thu 11, 08 2022

Christian Struck

Senior Product Manager Audio Production, Lawo

Axel Kern

Senior Director Cloud and Infrastructure Solutions, Lawo


Redundancy has always been a major topic for broadcast operations to ensure that the show goes on despite a defective power supply or other failure. While all eggs were in one basket—i.e. in one place and close to one another—this approach was certainly helpful. Calling such a setup resilient would nevertheless be a stretch.

Conventional solutions launched before open-standards-based IP came along may be redundant up to a point, but that doesn’t make the operation as a whole resilient. Most are able to exchange control, audio and/or video data over an on-prem network, but connectivity to a wide-area network (WAN) spanning several cities or continents is not on the cards for them.

Widen the Area

While a WAN-based infrastructure has enabled operators to leverage processing resources in off-premise data centers—of which there may be two, for redundancy, as in Eurosport’s and many other cases—and while accessing these resources from just about anywhere are indisputable assets of an IP setup, a lot more is required to make an operation resilient.

Separating the mixing console from the processing unit and the I/O stageboxes was an important step. Building WAN-communication into all of these devices through ST2110 and RAVENNA/AES67 compliance led to pinpointed expectations regarding resilience, as a recently conducted trial confirms. The aim was to show the prospective client that immersive audio mixing remains possible even if one of the two processing cores is down. It involved one A__UHD Core in a sporting arena in Hamburg, and a second near Frankfurt, at the production facility.

The team successfully demonstrated that if the preferred core in Hamburg, controlled from an mc² console near Frankfurt for the live broadcast mix, becomes unavailable, the core closer to the console immediately takes over. The physical location of the second core is irrelevant, by the way, as long as it is connected to the same network. WAN-based redundancy is an important element of a solid resilience strategy, even though the unavailability of a processing core is only the seventh likeliest incident in a series of plausible failures from which operators can recover automatically, according to Lawo’s customer service. This degree of redundancy involves so-called “air gapped units”, i.e. hardware in two separate locations, to ensure continuity if the “red” data center is flooded or subject to a fire: the redundant, “blue” data center automatically takes over.

Strictly speaking, the five likeliest glitches—control connection loss, routing failure, media connection failure, control system failure, and power supply failure—require no hardware redundancy when the audio infrastructure is built around an A__UHD Core. That said, having a spare unit online somewhere is always a good idea. It is also required as fail-over for incident number six, DSP/FPGA failure.

Explode to Reinforce

A second important aspect is to decentralize what used to be in one box. Even some IP-savvy solutions are still supplied as a single unit that handles both control and processing. For maximum resilience, one device should do processing, while a COTS server or dockerized container transmits the control commands it receives from a mixing console to the processing core, and a switch fabric does the routing. Separating control, processing and routing, and making all three redundant minimizes the risk of downtimes. Plus, except for at least one switch close to each required component, all devices or CPU services can be in different geographic locations.

It doesn’t stop there. A redundant IP network with red and blue lines is built around a switch fabric. Without going into too much detail, certain management protocols (PIM and IGMP) may cause issues that could seriously affect broadcast workflows or even make them impossible. The first is related to situations where the red and blue lines are routed to the same spine switch. An issue with that switch means that this part of the network not only ceases to be redundant but may also stop working altogether: it is a single point of failure. The second issue is related to how switches distribute multicast streams over the available number of ports when they are not bandwidth-aware. In a non-SDN network, this may lead to situations where one port is oversubscribed, i.e. asked to transmit more gigabits per second than it can muster, which causes errors at the receiving end.

These and other topics are being addressed by companies like Arista and Lawo via a Multi Control Service routine and direct influence of the VSM studio manager software on traffic shaping. The goal is to avoid failures, oversubscription of network ports, and to allow operators of large installations to immediately confirm the status of their switching and routing operations.

Combining the above with the HOME management platform for IP infrastructures adds yet another building block. HOME not only assists operators with automatic discovery and registration, but also with controlling processing cores by hosting the MCX control software for mc² consoles either on networked COTS servers or directly in a virtualized environment—and to dynamically switch from one processing core to the other, one console surface to the next, or one MCX control instance to another if the need arises.

Stay in Control

Resilience necessarily includes control. VSM achieves seamless control redundancy with two pairs of COTS servers stationed in two different locations and automatic fail-over routines. Hardware control panels are not forgotten: if one stops working, connecting a spare, or firing up a software panel, and assigning it the same ID—which takes less than a minute—restores interactive control. (The control processes as such are not affected by control hardware failures, by the way.)

As installs migrate towards a private cloud/data center infrastructure, provisioning two (or in HOME’s case, three) geographically distanced COTS servers with permanent status updates between the main and redundant units allows users to remain in control. If the underlying software architecture is cloud-ready, those who wish can ultimately move from hardware servers to service-based infrastructures in the cloud. Technologies like Kubernetes and AWS Load Balancer can then be solicited to provide elastic compute capacity that instantly grows and shrinks in line with changing workflow requirements. A welcome side effect is that no new hardware servers need to be purchased to achieve this kind of instant, high-level resilience.

After experiencing the benefits of resilient, elastic control, some operators may wonder whether a similar strategy is also possible for audio and video processing. The short answer is: “If you like.” Quite a few operators are wary of the “intangible cloud” and may be relieved to learn that the ability to architect private data centers in a redundant configuration already allows them to achieve a high degree of resilience.

One Leap Closer

A genuinely resilient broadcast or AV network is a self-healing architecture that always finds a way to get essences from A to B in a secure way. Users may not know—or care—where those locations are, but the tools they use to control them do. And they quickly find alternatives to keep the infrastructure humming.

The only remaining snag was to provide operators with an almost failsafe infrastructure. A lot has been achieved to make broadcast and AV infrastructures resilient by design while keeping them intuitive to operate.

Search For More Content


X