How to keep control when modern networks fail

Ramtin Rampour, principal solutions architect, Opengear looks at network resilience.

Operational environments are growing more heterogeneous by the month, as workloads span SD-WAN, 5G, and cloud interconnects, while MPLS and fibre continue to carry the majority of critical traffic. Complexity climbs all the time, yet customer expectations remain simple. Services need to always be available, without excuses and without long recovery windows.

The pressure is rising because density and data flow are accelerating inside AI estates, which strains both network links and switching fabrics. The push to the edge, driven by the need to reduce latency and support real-time processing, is also multiplying exposure.

Sensors, gateways, and remote servers increase operational reach, but they also expand the number of places a fault can originate and attackers can probe. Outages continue to impact revenue, widen security risk, and threaten credibility.

That’s why resilience has moved from a best practice to an essential requirement. Resilience is about reducing failure to its lowest possible chance while still recognizing that failure can occur and having a plan for it. Resilience is also about maintaining control when failure happens and recovering fast enough that disruption does not turn into long-lasting damage.

Scoping the challenge

Modern networks fail in familiar ways, even if the architecture looks new. Misconfiguration triggers outages. Routine changes introduce instability. Alerts often arrive too late, after users feel the impact. When a fault occurs within a remote site, the recovery plan too often depends on travel time, local access, or on a sequence of manual steps carried out under pressure.

The economics make that fragility harder to tolerate. A recent New Relic study puts the cost of high-impact outages for UK and Irish businesses at around $1 million to $3 million per hour, with a mean annual cost per organisation running far higher. Numbers vary by sector, but the point holds across all of them. Downtime drains money quickly, and the bill rarely stays contained to the IT function. Security pressure adds another layer of risk and operational complexity.

Distributed environments expand the attack surface, especially where edge sites have limited local support and less room for error. Verizon’s 2025 Data Breach Investigations Report highlights how common ransomware remains, linking it to 75% of system-intrusion breaches. Attackers exploit confusion during disruption, and a degraded network can slow containment precisely when speed is critical.

AI is often presented as the answer, and it can help, particularly for detection and operational automation. However, it is not a fail-safe option against downtime or attacks. Effective AI defence also depends on data quality, reliable telemetry, and stable and secure access to the systems it is meant to protect. When the primary network is unstable or when access has been restricted for containment, in-band tools can fail right when teams need them most.

Finding a solution

Organisations need to treat complexity as something to reduce and improve, not something to tolerate. The goal is practical resilience: preventing avoidable incidents, detecting degradation early, and recovering fast when disruption does occur.

Automation plays an important role here by removing repetitive steps and reducing the risk of misconfiguration. Many outages still begin with routine changes. A small update, a rushed fix, or an overlooked dependency can have a system-wide impact in complex environments.

By standardising common actions and enforcing configuration consistency, automation also reduces the number of manual interventions required. Predictive analytics builds on that foundation. Telemetry that feeds predictive models allows teams to see stress before users do, guiding capacity planning, maintenance windows, and early intervention.

This shifts operations away from reactive response and towards control, particularly in environments where AI workloads and distributed architectures leave little room for performance headroom. When changes are predictable, recovery becomes faster and less error-prone.

The key role of out-of-band

Alongside automation and analytics, maintaining access during disruption is critical. Traditional in-band management depends on the same production network that may fail during outages or become restricted during security incidents. When those paths are unavailable, visibility and control disappear at exactly the wrong moment. Out-of-band (OOB) management addresses this gap by providing an independent control plane that remains available when the primary network is down, unstable, or untrusted.

Independent access is often the difference between a brief disruption and a prolonged outage. It allows teams to remotely diagnose issues, reboot or reconfigure devices, validate system state and service health, and begin recovery without waiting for partial network restoration or on-site intervention. In distributed and edge environments, where travel time alone can add hours to recovery, that capability directly affects availability and operational confidence.

OOB management also changes how incidents are handled during periods of pressure. With access preserved, teams can work methodically rather than reactively. If certain telemetry or logs are being passed through the OOB network, the team will already know where the issue is and what to fix live, as the issue is occurring. That reduces the risk of compounding an incident through rushed fixes or incomplete visibility, which is a common reason outages persist over long periods.

Security response is another area where independent access proves its value. During a cyber incident, production networks are often segmented or locked down to contain the spread. OOB access ensures operators can still reach critical infrastructure to isolate affected systems, restore trusted configurations, and support remediation without reopening compromised paths. It supports faster containment and reduces reliance on temporary workarounds that introduce fresh exposure.

When combined, automation, predictive analytics, and OOB management create a more resilient operating model. Automation reduces the likelihood of error, analytics provide earlier warning, and independent access preserves control when the primary network or control plane collapses. The outcome is fewer extended outages, faster recovery, and maintained control even when conditions are at their worst.