top of page

Enabling Fault-Tolerant Multicast in Cloud-Native Architectures: Bridging On-Prem and Multi-Cloud Environments

  • 35 minutes ago
  • 7 min read

Tejas Gajjar Technical Lead & Principal Architect Macy’s Inc. | Cloud | Infrastructure and Platform Engineering IEEE Senior Member | BCS Fellow

Collaborators: swXtch.io, Oracle Cloud Infrastructure (OCI), Azure Networking Team


This article/white paper was originally published as a preprint on IEEE TechRxiv on August 4, 2025.

Original source: Tejas Gajjar. "Enabling Fault-Tolerant Multicast in Cloud-Native Architectures - Bridging On-Prem and Multi-Cloud Environments." TechRxiv. DOI: 10.36227/techrxiv.175433366.65304469 Direct link: https://www.techrxiv.org/doi/full/10.36227/techrxiv.175433366.65304469


ABSTRACT

Cloud providers lack native support for multicast, limiting deployment of high-availability applications that depend on real-time coordination and consistent data propagation. This white paper presents a solution for enabling software-defined, fault-tolerant multicast communication in Oracle Cloud Infrastructure (OCI) and multi-cloud environments using swXtch.io’s cloudSwXtch platform. It covers technical challenges, overlay architecture, license management, NSG troubleshooting, and validation in real-world use cases.


The solution supports sub-millisecond synchronization, seamless failover, and elastic scalability across Availability Domains (ADs), enabling enterprises to modernize legacy data center multicast patterns in the cloud.


  1. INTRODUCTION

Multicast communication is essential for applications like distributed databases, real-time analytics, trading systems, and event coordination tools. However, cloud platforms lack support for foundational multicast protocols [8][9][10]. This prevents businesses from migrating real-time systems to the cloud without significant architecture changes.

To solve this, we developed and tested a multicast overlay using swXtch.io’s cloudSwXtch platform [1]. This solution brings resilient, high-performance multicast support to public cloud environments using enhanced UDP, time synchronization, and intelligent routing.


  1. BUSINESS NEED

More and more, industries like financial services, logistics, media, and AI need systems that are low-latency and fault-tolerant. Because of the limits of the cloud, alternatives like TCP unicast replication have to be used, which slows down performance and adds extra work.

The goal of this project was to make multicast replication possible in OCI’s Ashburn region, with automatic failover, flexibility, and compatibility across availability domains.


  1. TECHNICAL CHALLENGES

Cloud-based multicast implementation faces multiple hurdles:

  • Lack of native IGMP/MLD protocol support in cloud networking [8][9].

  • Unreliable UDP transport layer and dynamic packet routing.

  • Clock synchronization gap across availability zones [3].

  • NSG (Network Security Group) misconfiguration [13] and inconsistent logging in OCI.

  • Complex overlay membership and routing table management.

  • Ensuring seamless failover with minimal latency.


4. OVERLAY-BASED MULTICAST SOLUTION

4.1 Architecture

The multicast overlay uses swXtch.io nodes with dual virtual NICs, forming a distributed mesh. Enhanced UDP transport supports packet reliability, while PTP ensures sub-millisecond synchronization [3]. IGMP logic is embedded in the overlay, making multicast joins/leaves dynamic.

My Role:

  • Designed and implemented the multi-cloud orchestration framework using Terraform and Helm [6][7].

  • Architected deployment across OCI, Azure, and GCP with consistent overlay behavior.

  • Created self-healing, fault-tolerant routes with < 50ms failover recovery [2].

4.2 Membership & Routing

Overlay nodes natively propagate multicast group joins and leaves, eliminating the need for tunneling. Custom routing logic ensures deterministic delivery across cloud regions and allows dynamic scale-out.

4.3 Fault Tolerance & Synchronization

Using SDN Fast Failover techniques and PTP-based synchronization, the system ensures both network resilience and strict event ordering.

Contribution:

  • Developed failover detection logic that reduced node recovery time by 50%.

  • Authored auto-synchronization routines to dynamically maintain group consistency.

  • Integrated Prometheus and Grafana for full-stack monitoring and observability [4][5].


5. IMPLEMENTATION IN ORACLE CLOUD (OCI)

5.1 swXtch.io VM Deployment & Configuration

  • Deployed swXtch.io overlay nodes with dual vNICs and updated routing using netplan.

  • Aligned MAC and IP settings to ensure multicast consistency.

  • Used iptables and static routing to accept multicast from X.X.X.0/21.

My Role:

  • Led the design of overlay deployment strategy in OCI.

  • Troubleshot complex routing and firewall issues across subnets.

5.2 License Provisioning and Service Restart SOP

A validated procedure was created for safely applying licenses without causing disruption:

Validate fingerprint

curl -s http://X.X.X.Y/top/dashboard | grep -m 2 -Eo ‘”fingerprint”[^,]*’ | head -1

Backup and apply for license

sudo cp /swxtch/license.json /swxtch/license.json.$(date +%Y%m%d%H%M%S) sudo cp NEWLICENSEfile /swxtch/license.json

Restart services

sudo systemctl restart swxtch-ctrl.service swxtch-repl.service

Contribution:

  • Authored and validated the license update SOP.

  • Ensured zero-downtime upgrades through controlled validation.


5.3 NSG Troubleshooting and Resolution

Issue: A UDP packet drop between overlay nodes occurred due to an incorrect NSG (Network Security Group) source CIDR configuration. Cloud flow logs misleadingly displayed an “ACCEPT” action despite actual packet drops[13]

Original NSG Configuration: X.X.X.X/22

Failing Source IP: X.X.X.Y (outside of the configured CIDR block)

Root Cause: The NSG source rule did not fully cover all relevant subnets in the non-production VCN, leading to silent packet drops during overlay node communication.

Resolution: Expanded the NSG ingress rule to X.X.X.X/21, which encompassed all required subnets. This update immediately restored connectivity and resolved the UDP drop issue between overlay nodes.

Fig:1 NSG troubleshooting Workflow

Additional Insight: The inconsistency between NSG flow logs and actual packet routing behavior was identified as a misdirection point in troubleshooting. This issue was escalated as a product-level feedback item to the cloud provider for improved diagnostic transparency.

  1. CASE STUDY: DISTRIBUTED CONSENSUS WITH MULTICAST REPLICATION

6.1 Use Case

A globally distributed database cluster implementing Paxos required multicast to replicate state rapidly across nodes.

6.2 Integration

  • Deployed swXtch.io nodes alongside database nodes.

  • Enabled IGMP/FEC support with application layer validation.

6.3 Results

Metric

Baseline

With Overlay

Synchronization latency

18 ms

2.4 ms

Failover response time

~120 ms

<50 ms

Packet loss rate

0.1%

<0.01%

Cluster scale-out

1x

3x without config change

Contribution:

  • Led the design and live multi-cloud validation of the database overlay integration.

  • Benchmarked and optimized performance, influencing future database architecture.


7. EVALUATION & RESULTS

  • Performance: Sub-millisecond propagation verified across regions.

  • Observability: Prometheus/Grafana stack monitored group health and performance.

  • Resilience: Packet loss under 0.01%; seamless failover in failure simulation.


  1. INDUSTRY IMPACT AND INNOVATION

8.1 Bringing Multicast into the Cloud-Native Era

Traditionally, multicast communication has been confined to on-premises environments due to its reliance on Layer-2 and Layer-3 protocols like IGMP, MLD, and PIM [8][9][10]. These protocols are not natively supported by most public cloud providers, which has hindered the migration of real-time, event-driven applications to cloud platforms like Oracle Cloud Infrastructure (OCI), Microsoft Azure, and Google Cloud Platform.


The multicast overlay solution introduced in this white paper, built on swXtch.io’s cloudSwXtch platform [1][13], directly addresses this limitation [1]. It enables lossless, fault-tolerant multicast communication over software-defined networks in cloud and hybrid environments. This effectively extends multicast capabilities into the cloud-native era, making it possible to run latency-sensitive, distributed applications that were once considered cloud-incompatible.


8.2 Broad Industry Applicability

The ability to support multicast in cloud environments opens the door to significant benefits across multiple industries:

Financial Services

Facilitates the distribution of real-time market data and order book synchronization using

protocols such as FIX and FAST.


Reduces end-to-end data transmission latency from tens of milliseconds to sub-3ms, enhancing

competitiveness in high-frequency trading [10][21].


Retail and Logistics

Supports distributed inventory synchronization, warehouse automation, and edge-device communication.


Real-time updates for promotions and loyalty programs across geographically dispersed point-of-sale systems improve customer experience and operational efficiency [13].


Media and Broadcasting

Enables multicast replication for IPTV and live streaming.

Integrates with cloud-based ingest, encoding, and distribution pipelines using overlay networking, without the need for physical broadcast infrastructure [10][1][13].


Industrial IoT and Smart Manufacturing

Powers event-driven propagation in MES/SCADA systems.

Ensures accurate, real-time telemetry sharing across production zones or plants (Grafana, n.d.; Prometheus, n.d.).


Cloud AI and Digital Twins

Facilitates low-latency model parameter exchange for collaborative AI training or federated learning systems.

Provides real-time broadcasting of simulation state for use cases such as metaverse development

and smart infrastructure monitoring (Shawish & Salama, 2014).


8.3 Key Technical Contributions

This work introduces several original engineering enhancements:

Engineering Advancements

Implements reliable multicast over UDP using Forward Error Correction (FEC) and redundant overlay paths (swXtch.io, n.d.).

Achieves sub-millisecond synchronization using Precision Time Protocol (PTP), critical for maintaining order in event-driven architectures (Berkeley AutoLab, n.d.).

Introduces auto-healing group orchestration to dynamically manage joins/leaves across distributed nodes.

Utilizes SDN-based fast failover strategies to maintain event propagation with under-50ms recovery (arXiv, 2022).

Observability and Monitoring

Integrates Prometheus and Grafana for real-time metrics and alerting, ensuring visibility into group health and packet metrics (Prometheus, n.d.; Grafana, n.d.).

Enables per-group and per-node telemetry for fault isolation and performance tuning.

DevOps and Infrastructure-as-Code

Deploys infrastructure seamlessly across OCI, Azure, and GCP using Terraform and Helm charts

(Helm, n.d.; GitHub Actions, n.d.).

Automates license provisioning and NSG configuration for scalable, reproducible rollouts

(Azure, n.d.; Oracle, n.d.).

Standards Readiness

Supports future collaboration with IETF’s MBONED and INTAREA working groups to standardize multicast-over-cloud architectures (IETF MBONED, n.d.).

Proposes encapsulation models aligned with RFC 7348 (VXLAN) and RFC 5110 for compatibility across multi-vendor environments (IETF, 2008; IETF, 2007).

8.4 Strategic Impact

This multicast overlay solution offers a transformative pathway for enterprises:

  • Accelerates Cloud Adoption: Makes it feasible to lift-and-shift mission-critical multicast-dependent systems.

  • Improves Resilience: Achieves fault tolerance through auto-failover and self-healing overlays.

  • Enhances Performance Predictability: Ensures application responsiveness by maintaining low latency and consistent packet delivery.

  • Fosters Ecosystem Collaboration: Encourages cross-industry standardization and interoperability through active participation in IETF and open-source communities.

8.5 Adoption Readiness and Competitive Advantage

Performance benchmarks during real-world OCI deployments confirm the solution’s robustness:

  • Latency improved by up to 86%

  • Failover recovery time cut by up to 60%

  • Packet loss reduced to less than 0.01%

These gains make the multicast overlay ready for enterprise production environments and position it as a reference model for cloud-native multicast implementations. For vendors and platforms, adopting this architecture offers a strategic advantage in delivering reliable, real-time communication at a scale.


  1. FUTURE WORK


  1. CONCLUSION

By engineering and validating a multicast overlay across OCI and multi-cloud settings, this work resolves a long-standing gap in public cloud capabilities. The project demonstrates that real-time, lossless, and fault-tolerant communication is achievable in cloud-native architecture with proper tooling, orchestration, and technical leadership.


REFERENCES

  1. swXtch.io Technical Documentation

  1. SDN Fast Failover Research

    arXiv Preprint, 2022.

  1. Precision Time Protocol for Distributed Systems (PTP)

Berkeley AutoLab Project.

  1. Prometheus Monitoring Toolkit

  1. Grafana Observability Platform

  1.  Kubernetes Helm

  1. Kubernetes Horizontal Pod Autoscaler (HPA)

  1. IETF RFC 2236 – Internet Group Management Protocol, Version 2 (IGMPv2)

  1. IETF RFC 3810 – Multicast Listener Discovery Version 2 (MLDv2)

  1. IETF RFC 7761 – Protocol Independent Multicast – Sparse Mode (PIM-SM)

  1. IETF RFC 7348 – Virtual eXtensible Local Area Network (VXLAN)

  1. IETF RFC 5110 – Overview of Multicast in MPLS/BGP IP VPNs

  1. Oracle Cloud Infrastructure Documentation – Networking, VCN, and NSG configuration

  1. Azure Virtual Network and NSG Documentation

  1. GitHub Actions Documentation – CI/CD pipelines for Helm charts

  1. ArtifactHub – Helm Chart Repository

  1. Helm Best Practices Guide

  1. MBONED IETF Working Group Charter

  1. arXiv. (2022). SDN Fast Failover Research. Retrieved from

  1. Azure. (n.d.). Network security groups overview. Retrieved from

  1. Berkeley AutoLab. (n.d.). Precision Time Protocol for Distributed Systems (PTP).


bottom of page