Retail

EXPERT ADVICE

Ensuring True High Availability in an Online Retail Environment

information technology executive decision-makers

Unlike most of the world, an online retail business never sleeps, which means that the systems powering an online retailer’s critical operations can never sleep either. If the organization wants to sell products and meet the needs of customers on a 24-by-7 basis, it’s going to need a way to ensure that its infrastructure remains not only online but also operational and accessible.

That “operational and accessible” part is often overlooked. Cloud service providers can offer high availability (HA) configurations with a service level agreement (SLA), guaranteeing that at least one node in a multi-node cluster will be online 99.99% of the time. However, that SLA doesn’t ensure that the applications or data powering an online business will be operational or accessible.

The node can be online, but if that node cannot access the applications or the data supporting the business — because of human error, compatibility issues, the data needed was sitting on storage that’s gone offline or any of a dozen other reasons — then the business is effectively offline.

Online retailers that want to avoid this fate need to configure their infrastructures to ensure the uninterrupted availability of critical applications and data, and that requires more than a redundant hardware infrastructure.

They need to ensure that their active infrastructure can fail over to a standby infrastructure — located in a separate data center that will not be affected by whatever incident has caused the active infrastructure to go offline — and they need to ensure that that infrastructure can access all applications and data.

Building a Failover Infrastructure

At the heart of a true HA solution — defined as one that ensures that your applications and data will be accessible no less than 99.99% of the time — lies a set of server nodes configured in a failover cluster (FC). This can be done whether the infrastructure runs on Windows or Linux, on-premises, or in the cloud.

A failover cluster always involves at least two nodes; optimally, each node is located in a physically separate data center for disaster protection. One node might be on-prem and the other in the cloud; both could be in geographically separated on-premises data centers. Or both could be in the cloud in different availability zones. Typically, one of the nodes in the FC operates as the primary node, and the other(s) act as secondary or standby nodes.

An FC relies on cluster failover management software that monitors the health of the nodes in the cluster. If the cluster management software detects that the primary node has gone offline, it orchestrates a failover of operations to one of the secondary nodes. That (formerly) secondary node then becomes the primary node actively supporting operations. The cluster management software should also perform related housekeeping tasks, such as updating routing tables, logical names, and the like to ensure that your operations can continue on the new primary infrastructure without interruption.

When the former primary node becomes operational again, the cluster management software should automatically recognize it as a secondary node in the cluster that can be called into service in case a second failover should become necessary. However, these features of a failover cluster don’t ensure access to data that had been used by the applications running on the old primary infrastructure.

In traditional brick-and-mortar data centers, all nodes in an FC might have been connected to a shared storage area network (SAN). In the cloud or an on-prem/cloud hybrid environment, you’re more likely to attach local storage to each of the nodes of your FC. The challenge then becomes one of replicating data in real time from storage attached to the primary node to storage connected to the secondary node(s). Then, in the event of a failover, the secondary node can access an identical copy of the data the old primary node had been using.

Application-Centric Data Replication Solutions

There are several ways to meet that challenge. Some well-known database vendors, including Oracle, Microsoft, and SAP, offer services that can automatically replicate database content from one node to another.

In Microsoft SQL Server, for example, you’d configure the databases on each cluster node in an “Availability Group” (AG), and the AG feature in SQL Server would automatically replicate any updates to the database on the primary node to instances of the database sitting on each of the secondary nodes.

If the primary node were to go offline suddenly, the cluster would fail over to a secondary node where all the data in the SQL Server database would already be waiting and ready to go.

SAP and Oracle have similar kinds of data replication offerings. Still, each also suffers from one weakness that undercuts the utility of SQL Server’s AG functionality: These services replicate only the data associated with particular SAP, Oracle, and SQL Server databases. If you have any other critical data residing in storage, that data won’t be replicated by these application-specific services.

Also, depending on how many databases you want to replicate — and to how many secondary nodes — you may have to uplevel your database licenses to gain access to the replication services you seek.

Suppose you’re replicating more than one SQL Server database or any number of SQL Server databases to more than one secondary node. In that case, you’ll need to use the Always On AG services bundled into SQL Server Enterprise Edition rather than SQL Server Standard Edition — and that can involve a steep price increase, particularly if you’re not using any of the other features that are only available in SQL Server Enterprise Edition.

Application-Agnostic Data Replication Solutions

Alternatively, you can accomplish the same data replication goals through third-party tools that are fundamentally application agnostic. These tools create what is known as a SANless cluster, and they perform synchronous, block-level data replication from storage on one node to storage on another.

It doesn’t matter whether the data is associated with an Oracle database, a SQL Server database, a media file, or a text file. The SANless Clustering software isn’t paying attention to the content of a given data block; it’s only updating changes from one data block to another.

The advantages of a third-party approach are that you can use a SANless Clustering solution with any software infrastructure that might be supporting your online retail operations — Microsoft, Oracle, SAP, anyone. Moreover, because the SANless Clustering tools are application agnostic, there are no limitations on the number of databases you might want to replicate or the number of secondary nodes you might want to copy to.

So, while you’ll need to license the software supporting a SANless Clustering for each of the nodes in your FC, you don’t run into the huge price hike that you’d encounter going from SQL Server Standard Edition to SQL Server Enterprise edition just because you want to replicate more than two SQL Server databases to your secondary infrastructure.

What is the downside of a third-party approach to ensuring HA for your retail infrastructure? The software to support a SANless Clustering will involve yet another vendor and licensing software to provide replication functionality that may already be present in the database software you’re using.

SANless Clustering software is essentially a set-it-and-forget-it solution from a management standpoint, but it is one more solution that your system admins will need to understand. At the same time, if your need for data replication extends beyond the narrow confines of the replication systems built into the solutions you are already using, the assurance of HA that these third-party products provide is well worth the management burden of relying on them to support uninterrupted access to your online retail solution.

Todd Doane

Todd Doane is a Solutions Architect at SIOS Technology with more than two decades of designing and implementing HA solutions. Todd has generated highly resilient reference architectures for telecommunications, financial and healthcare applications that have stood the test of time. His work has been implemented by Fortune 500 companies and government entities.

Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

E-Commerce Times Channels