3 Determining Your High Availability Requirements

This chapter includes the following topics:

Why It Is Important to Determine High Availability Requirements
Analysis Framework for Determining High Availability Requirements
High Availability Architecture Requirements

3.1 Why It Is Important to Determine High Availability Requirements

Since high availability is a critical issue for any modern day enterprise, an enterprise that is designing and implementing a high availability strategy must perform a thorough analysis and have a complete understanding of the business drivers that require high availability. Implementing high availability may involve critical tasks such as:

Retiring legacy systems
Investment in more sophisticated and robust systems and facilities
Redesign of the overall IT architecture to adapt to this high availability model
Redesign of business processes
Hiring and training of personnel

Higher degrees of availability reduce downtime significantly. An analysis of the business requirements for high availability and an understanding of the accompanying costs enables an optimal solution that both meets business managers' needs for a highly available system and be within the financial and resource limitations of the business. This chapter provides a simple framework that can be used effectively to evaluate the high availability requirements of a business.

3.2 Analysis Framework for Determining High Availability Requirements

The elements of this analysis framework are:

Business Impact Analysis
Cost of Downtime
Recovery Time Objective
Recovery Point Objective

3.2.1 Business Impact Analysis

A rigorous business impact analysis identifies the critical business processes within an organization, calculates the quantifiable loss risk for unplanned and planned IT outages affecting each of these business processes, and outlines the impacts of these outages. It takes into consideration essential business functions, people and system resources, government regulations, and internal and external business dependencies. This analysis is done using objective and subjective data gathered from interviews with knowledgeable and experienced personnel, reviewing business practice histories, financial reports, IT systems logs, and so on.

The business impact analysis categorizes the business processes based on the severity of the impact of IT-related outages. For example, consider a semiconductor manufacturer, with chip design centers located worldwide. An internal corporate system providing access to human resources, business expenses and internal procurement is not likely to be considered as mission-critical as the internal e-mail system. Any downtime of the e-mail system is likely to severely affect the collaboration and communication capabilities among the global R&D centers, causing unexpected delays in chip manufacturing, which in turn will have a material financial impact on the company.

In a similar fashion, an internal knowledge management system is likely to be considered mission-critical for a management consulting firm because the business of a client-focused company is based on internal research accessibility for its consultants and knowledge workers. The cost of downtime of such a system is extremely high for this business. This leads us to the next element in the high availability requirements framework: cost of downtime.

3.2.2 Cost of Downtime

A well-implemented business impact analysis provides insights into the costs that result from unplanned and planned downtimes of the IT systems supporting the various business processes. Understanding this cost is essential because this has a direct influence on the high availability technology chosen to minimize the downtime risk.

Various reports have been published, documenting the costs of downtime across industry verticals. These costs range from millions of dollars for each hour of brokerage operations and credit card sales, to tens of thousands of dollars for each hour of package shipping services.

While these numbers are staggering, the reasons are quite obvious. The Internet has brought millions of customers directly to the businesses' electronic storefronts. Critical and interdependent business issues such as customer relationships, competitive advantages, legal obligations, industry reputation, and shareholder confidence are even more critical now because of their increased vulnerability to business disruptions.

3.2.3 Recovery Time Objective

A business impact analysis, as well as the calculated cost of downtime, provides insights into the recovery time objective (RTO), an important statistic in business continuity planning. It is defined as the maximum amount of time that an IT-based business process can be down before the organization starts suffering significant material losses. RTO indicates the downtime tolerance of a business process or an organization in general.

The RTO requirements are proportional to the mission-critical nature of the business. Thus, for a system running a stock exchange, the RTO is zero or very near to zero.

An organization is likely to have varying RTO requirements across its various business processes. Thus, for a high volume e-commerce Web site, for which there is an expectation of rapid response times and for which customer switching costs are very low, the Web-based customer interaction system that drives e-commerce sales is likely to have an RTO close to zero. However, the RTO of the systems that support backend operations such as shipping and billing can be higher. If these backend systems are down, then the business may resort to manual operations temporarily without a significantly visible impact.

A systems statistic related to RTO is the network recovery objective (NRO), which indicates the maximum time that network operations can be down for a business. Components of network operations include communication links, routers, name servers, load balancers, and traffic managers. NRO impacts the RTO of the whole organization because individual servers are useless if they cannot be accessed when the network is down.

3.2.4 Recovery Point Objective

Recovery point objective (RPO) is another important statistic for business continuity planning and is calculated through an effective business impact analysis. It is defined as the maximum amount of data an IT-based business process may lose before causing detrimental harm to the organization. RPO indicates the data-loss tolerance of a business process or an organization in general. This data loss is often measured in terms of time, for example, 5 hours or 2 days worth of data loss.

A stock exchange where millions of dollars worth of transactions occur every minute cannot afford to lose any data. Thus, its RPO must be zero. Referring to the e-commerce example, the Web-based sales system does not strictly require an RPO of zero, although a low RPO is essential for customer satisfaction. However, its backend merchandising and inventory update system may have a higher RPO; lost data in this case can be re-entered.

3.3 High Availability Architecture Requirements

Using the high availability analysis framework, a business can:

Complete a business impact analysis
Identify and categorize the critical business processes that have the high availability requirements
Formulate the cost of downtime
Establish RTO and RPO goals for these various business processes.

This enables the business to define service level agreements (SLAs) in terms of high availability for critical aspects of its business. For example, it can categorize its businesses into several high availability tiers:

Tier 1 business processes have maximum business impact. They have the most stringent high availability requirements, with RTO and RPO close to zero, and the systems supporting it need to be available on a continuous basis. For a business with a high-volume e-commerce presence, this may be the Web-based customer interaction system.
Tier 2 business processes can have slightly relaxed high availability and RTO/RPO requirements. The second tier of an e-commerce business may be their supply chain / merchandising systems. For example, these systems do not need to maintain extremely high degrees of availability and may have nonzero RTO/RPO values. Thus, the high availability systems and technologies chosen to support these two tiers of businesses are likely to be different from those of tier 1 processes.
Tier 3 business processes may be related to internal development and quality assurance processes. Systems supporting these processes need not have the rigorous high availability requirements of the other tiers.

The next step for the business is to evaluate the capabilities of the various high availability systems and technologies and choose the ones that meet its SLA requirements, within the guidelines as dictated by business performance issues, budgetary constraints, and anticipated business growth.

Figure 3-1 illustrates this process.

Figure 3-1 Planning and Implementing a Highly Available Enterprise

Description of the illustration haovw001.gif

The following sections provide further details about this methodology:

High Availability Systems Capabilities
Business Performance, Budget and Growth Plans

See Also:
"Choosing the Correct High Availability Architecture"

3.3.1 High Availability Systems Capabilities

A broad range of high availability and business continuity solutions exists today. As the sophistication and scope of these systems increase, they make more of the IT infrastructure, such as the data storage, server, network, applications, and facilities, highly available. They also reduce RTO and RPO from days to hours, or even to minutes and seconds. Increased availability often comes with an increased cost, and on some occasions, with an increased impact on systems performance. Higher availability does not always equate higher cost, however, and the high availability approach to satisfying business requirements may differ for a legacy system.

Organizations need to carefully analyze the capabilities of these high availability systems and map their capabilities to the business requirements to ensure they have an optimal combination of high availability solutions to keep their business running. Consider the business with a significant e-commerce presence as an example.

For this business, the IT infrastructure supporting the system that customers encounter, the core e-commerce engine, needs to be highly available and disaster-proof. The business may consider clustering for the Web servers, application servers and the database servers serving this e-commerce engine. With built-in redundancy, clustered solutions eliminate single points of failure. Also, modern clustering solutions are application-transparent, provide scalability to accommodate future business growth, and provide load-balancing to handle heavy traffic. Thus, such clustering solutions are ideally suited for mission-critical high-transaction applications.

If unplanned and planned outages occur, the data that supports the high volume e-commerce transactions must be protected adequately and be available with minimal downtime. This data should not only be backed up at regular intervals at the local data centers, but should also be replicated to databases at a remote data center connected over a high-speed, redundant network. This remote data center should be equipped with secondary servers and databases readily available, and be synchronized with the primary servers and databases. This gives the business the capability to switch to these servers at a moment's notice with minimal downtime if there is an outage, instead of waiting for hours and days to rebuild servers and recover data from backed-up tapes. Factors to consider when planning a remote data center include the network bandwidth and latency (distance) between sites, as well as usage consideration (such as whether the sites are fully or partially staffed). These factors should be used to determine whether remote data centers are feasible and their location in relation to the primary data center.

Maintaining synchronized remote data centers is an example where redundancy is built along the entire system's infrastructure. This may be expensive; however, the mission-critical nature of the systems and the data it protects may warrant this expense. Considering another aspect of the business, the high availability requirements are less stringent for systems that gather clickstream data and perform data mining. The cost of downtime is low, and the RTO and RPO requirements for this system could be a few days, because even if this system is down and some data is lost, that will not have a detrimental effect on the business. While the business may need powerful machines to perform data mining, it does not need to mirror this data on a real-time basis. Data protection may be obtained by simply performing regularly scheduled backups, and archiving the tapes for offsite storage.

For this e-commerce business, the back-end merchandising and inventory systems are expected to have higher high availability requirements than the data mining systems, and thus they may employ technologies such as local mirroring or local snapshots, in addition to scheduled backups and offsite archiving.

The business should employ a management infrastructure that performs overall systems management, administration and monitoring, and provides an executive dashboard. This management infrastructure should be highly available and fault-tolerant.

Finally, the overall IT infrastructure for this e-commerce business should be extremely secure, to protect against malicious external and internal electronic attacks.

3.3.2 Business Performance, Budget and Growth Plans

High availability solutions must also be chosen keeping in mind business performance issues. For example, a business may use a zero-data-loss solution that synchronously mirrors every transaction on the primary database to a remote database. However, considering the speed-of-light limitations and the physical limitations associated with a network, there will be round-trip-delays in the network transmission. This delay increases with distance, and varies based on network bandwidth, traffic congestion, router latencies, and so on. Thus, this synchronous mirroring, if performed over large WAN distances, may impact the primary site performance. Online buyers may notice these system latencies and be frustrated with long system response times; consequently, they may go somewhere else for their purchases. This is an example where the business must make a trade-off between having a zero data loss solution and maximizing system performance.

High availability solutions must also be chosen keeping in mind financial considerations and future growth estimates. It is tempting to build redundancies throughout the IT infrastructure and claim that the infrastructure is completely failure-proof. Although higher availability does not always equate higher cost, going overboard with such solutions may not only lead to budget overruns, it may lead to an unmanageable and unscalable combination of solutions that are extremely complex and expensive to integrate and maintain.

A high availability solution that has very impressive performance benchmark results may look good on paper. However, if an investment is made in such a solution without a careful analysis of how the technology capabilities match the business drivers, then a business may end up with a solution that does not integrate well with the rest of the system infrastructure, has annual integration and maintenance costs that easily exceed the upfront license costs, and forces a vendor lock-in. Cost-conscious and business-savvy CIOs must invest only in solutions that are well-integrated, standards-based, easy to implement, maintain and manage, and have a scalable architecture for accommodating future business growth.