Critical server outages cost businesses an average of US$300,000 hourly, with many cases exceeding $5 million. As network demands increase and maximum uptime becomes a necessity, it is crucial to implement the proper processes and systems to ensure organizations constantly are able to mitigate the threat of outages.
A host of factors can cause network or system downtime, from ISP carrier problems to power cuts and simple human errors. Additionally, network infrastructures are becoming more complicated, and as software stacks need more frequent updates, they become more susceptible to increasingly more effective cyberattacks, exploits or glitches and bugs.
The onward move to virtualization and SD-WAN is also an issue. It’s true that these solutions enable greater flexibility and more efficient services, reduce costs, and can enable cloud-based control, but they also introduce points of failure.
What if the SD-WAN overlay goes down in vulnerable locations like the last-mile connection? Firmware updates go wrong? Or a security breach occurs in a visibility blind spot?
All this equates to more possibilities of downtime, which can quickly spell disaster for a brand, impacting revenues and hurting an organization’s ability to provide services. To help avoid and mitigate the impacts of downtime, both now and in the future, following are a few critical tips.
Resilience vs. Redundance
If you want to develop a business plan to limit the chance of downtime and mitigate the impacts of a problem if it does occur, where should you start? To begin, it is critical to consider outages from two different angles: the operating network and the physical infrastructure supporting it.
For the physical infrastructure, organizations must consider hardware components of the network, such as power and cooling systems. To ensure systems are kept up and running, many large data center environments will have redundant components, such as backup generators, redundant power sources and uninterruptible power supplies.
Redundance also is important on the IT side. Organizations have a lot of choices to implement the right ecosystem for their needs. For instance, a company may choose to host and run applications in multiple locations and implement virtualization to add elements that allow for smooth transfer of load.
Additionally, an organization may need to enable the business to migrate to another location, like a second data center, colocation site or hybrid cloud environment, if there is a critical failure.
While a resilient network may contain some redundance, a redundant system isn’t necessarily resilient enough to ensure business continuity. Therefore, it’s critical for businesses to be able to distinguish between just implementing network redundance and having the network resilience to monitor and keep core “backbone” and mission-critical networks up and running, even in today’s complex and challenging virtualized environments.
If there is a primary network failure or something goes wrong with a piece of hardware other than the redundant elements, the network could remain down.
In many instances, simply adding more routers or switches will not increase the security of a network. Imagine if an engineer cuts a cable — the network could go down regardless of how much duplicate equipment is installed.
Additionally, the capital and O&M costs associated with redundance outside of a data center often can make it expensive, so many businesses choose not to spend sizable sums on data connections and backup equipment that likely will remain idle most of the time.
If an organization truly values maximizing network uptime, it has to go beyond redundant equipment. That’s where a strategy for end-to-end resilience is so vital. Resilience is about recovering swiftly to ensure that the organization operates normally soon after a network outage, and this often can be achieved by implementing an alternative path, such as a cellular network, to devices located at remote sites when the primary network is down.
Future-Proofing for the Edge
Historically, many enterprises have focused on shoring up large data centers or cloud environments at the core of their operations. Times are changing, however, and the need for infrastructure closer to the user (triggered by new data-intensive applications) is challenging traditional cloud computing for IT delivery in matters like performance, data security and operational costs. This is driving many networks to move to the edge for faster delivery speed, reduced costs and enhanced scalability.
While edge computing provides many benefits, it also provides a challenging ecosystem to protect. For instance, the level of resilience and redundance that organizations and their customers have come to rely on is more difficult to maintain at the edge.
In this environment, network outages may become more prevalent, and it may become harder to recover from them. So what can organizations do to prepare for this and future-proof their network for what’s to come? A first move may be to consider the network infrastructure from a holistic point of view.
To begin the work of building a future-proofed infrastructure, an organization should start by homing in on customer expectations for high-level uptime and resilience. They should build on this point when considering how they deploy their network, systems, architecture, redundancies and the resilience they need to put in place.
Tools Needed to Ensure Resilience
When striving to meet the needs of customers, tools that ensure network resilience will be critical to success. One thing to consider here is that true network resilience cannot be achieved by providing resilience to one single piece of equipment, whether that be a core switch or a router.
Instead it is important that any solution for resilience can plug into all equipment at an edge site or data center, map what is there, and establish what is offline and online at any time.
One priority must be ensuring a business has visibility and the agility to pivot if problems do occur. Consider a large finance or healthcare enterprise with a network operations center that may require constant uptime for applications and customer service. They may have several branch locations spread across the world with attendant time zone issues.
As a result, they may struggle to get visibility that an outage has even occurred, because they are not proactively notified if something goes offline. Even when they are aware, it may be difficult to understand which piece of equipment at which location has a problem if nobody is on site to physically look.
To solve errors, an organization may need to perform a quick system reboot remotely. If that does not work, there may be a problem with a software update. That problem can be addressed by utilizing the latest smart out-of-band (OOB) management systems. An image of the core equipment and its configuration — whether it be a router or switch, for example — can be retained, and the device quickly can be reconfigured remotely, without the need for sending an engineer on site.
If an outage did occur, it would be possible to ensure network resilience via failover to cellular. That would enable the business to keep up and running while the original fault was being remotely addressed, even while the primary network was down.
Though incorporating extra resilience through OOB costs money, the ROI can exceed the expense. This alternative access path may be used by an organization infrequently. However, when it is required, it becomes a critical success factor.
It’s also worth considering that resilience is usually much cheaper than having to buy large amounts of redundant equipment. This is increasingly true as the deployment of edge locations increases. Though it may be feasible for an organization to purchase redundancy at a core data center, that same redundancy can’t be built in every data closet or rack at a small remote location.
Beyond ensuring an ironclad backup solution with tools like smart OOB management and failover to cellular, organizations can provide further protection and achieve cost saving by stacking tools like NetOps automation on top of solutions for secure, offsite provisioning. This can eliminate a lot of repetitive tasks, remove potential for human error, and free up time.
Consider the Customer
Organizations and their leadership should consider the customer experience they are providing at the edge and ensure their systems can deliver it consistently. Otherwise, they risk downtime and subpar service.
If a problem does occur, it is vital for a business to communicate clearly when issues arise. Having comprehensive visibility and resilient failover options plays an important role in quickly informing customers about what has happened, and how a situation is being rectified.
Unfortunately, network outages are a challenge that every organization has to face. It is difficult to prevent downtime entirely. However, a variety of smart tools, like OOB management systems, failover to cellular, and NetOps automation can help by providing essential benefits — ranging from resource-efficient remote monitoring and management to continued Internet connectivity if an ISP or physical problem occurs.
Implementing the right processes and systems for network resilience is essential, so businesses significantly can mitigate the threat of outages. This helps eliminate problems like loss of critical systems or social media blow-ups from dissatisfied customers, which can have a dramatic impact on a business’ bottom line.
Therefore, implementing a program for network resilience isn’t just a luxury for large corporations but actually Loss Prevention 101.
Social Media
See all Social Media