Oracle® Application Server High Availability Guide
10g Release 2 (10.1.2) B14003-03 |
|
Previous |
Next |
This release of Oracle Application Server extends and improves upon the high availability solutions that were available in earlier releases. New flexible and automated high availability solutions for Oracle Application Server have been tested and are described in this guide. All of these solutions seek to ensure that applications that you deploy on Oracle Application Server meet the required availability to achieve your business goals. The solutions and procedures described in this book seek to eliminate single points of failure of any Oracle Application Server components with no or minimal outage in service.
This chapter explains high availability and its importance from the perspective of Oracle Application Server.
This section provides an overview of high availability from a problem-solution perspective. It has the sections:
Mission critical computer systems need to be available 24 hours a day, 7 days a week, and 365 days a year. However, part or all of the system may be down during planned or unplanned downtime. A system's availability is measured by the percentage of time that it is providing service in the total time since it is deployed. Table 1-1 provides an example.
Table 1-1 Availability percentages and corresponding downtime values
Availability Percentage | Approximate Downtime Per Year |
---|---|
95% |
18 days |
99% |
4 days |
99.9% |
9 hours |
99.99% |
1 hour |
99.999% |
5 minutes |
Table 1-2 depicts the various types of failures that are possible with a computer system.
Table 1-2 System downtime and failure types
Downtime Type | Failure Type |
---|---|
Unplanned downtime |
System failure |
|
Data failure |
|
Disasters |
|
Human error |
Planned downtime |
System maintenanceFoot 1 |
|
Data maintenance |
These two types of downtimes (planned and unplanned) are usually considered separately when designing a system's availability requirements. A system's needs may be very restrictive regarding its unplanned downtimes, but very flexible for planned downtimes. This is the typical case for applications with high peak loads during working hours, but that remain practically inactive at night and during weekends.
High availability solutions can be categorized into local high availability solutions that provide high availability in a single data center deployment, and disaster recovery solutions, which are usually geographically distributed deployments that protect your applications from disasters such as floods or regional network outages.
Amongst possible types of failures, process, node, and media failures as well as human errors can be protected by local high availability solutions. Local physical disasters can be protected by geographically distributed disaster recovery solutions.
To solve the high availability problem, a number of technologies and best practices are needed. The most important mechanism is redundancy. High availability comes from redundant systems and components. Local high availability solutions can be categorized, by their level of redundancy, into active-active solutions and active-passive solutions (see Figure 1-1). Active-active solutions deploy two or more active system instances and can be used to improve scalability as well as provide high availability. All instances handle requests concurrently.
Active-passive solutions deploy an active instance that handles requests and a passive instance that is on standby. In addition, a heartbeat mechanism is set up between these two instances. This mechanism is provided and managed through operating system vendor-specific clusterware. Generally, vendor-specific cluster agents are also available to automatically monitor and failover between cluster nodes, so that when the active instance fails, an agent shuts down the active instance completely, brings up the passive instance, and application services can successfully resume processing. As a result, the active-passive roles are now switched. The same procedure can be done manually for planned or unplanned down time. Active-passive solutions are also generally referred to as cold failover clusters.
Figure 1-1 Active-active and active-passive high availability solutions
In addition to architectural redundancies, the following local high availability technologies are also necessary in a comprehensive high availability system:
Process death detection and automatic restart
Processes may die unexpectedly due to configuration or software problems. A proper process monitoring and restart system should monitor all system processes constantly and restart them should problems appear.
A system process should also maintain the number of restarts within a specified time interval. This is also important since continually restarting within short time periods may lead to additional faults or failures. Therefore a maximum number of restarts or retries within a specified time interval should also be designed as well.
Clustering
Clustering components of a system together allows the components to be viewed functionally as a single entity from the perspective of a client for runtime processing and manageability. A cluster is a set of processes running on single or multiple computers that share the same workload. There is a close correlation between clustering and redundancy. A cluster provides redundancy for a system.
Configuration management
A clustered group of similar components often need to share common configuration. Proper configuration management ensures that components provide the same reply to the same incoming request, allows these components to synchronize their configurations, and provides highly available configuration management for less administration downtime.
State replication and routing
For stateful applications, client state can be replicated to enable stateful failover of requests in the event that processes servicing these requests fail.
Server load balancing and failover
When multiple instances of identical server components are available, client requests to these components can be load balanced to ensure that the instances have roughly the same workload. With a load balancing mechanism in place, the instances are redundant. If any of the instances fail, requests to the failed instance can be sent to the surviving instances.
Backup and recovery
User errors may cause a system to malfunction. In certain circumstances, a component or system failure may not be repairable. A backup and recovery facility should be available to back up the system at certain intervals and restore a backup when an unrepairable failure occurs.
Disaster recovery solutions typically set up two homogeneous sites, one active and one passive. Each site is a self-contained system. The active site is generally called the production site, and the passive site is called the standby site. During normal operation, the production site services requests; in the event of a site failover or switchover, the standby site takes over the production role and all requests are routed to that site. To maintain the standby site for failover, not only must the standby site contain homogeneous installations and applications, data and configurations must also be synchronized constantly from the production site to the standby site.
Figure 1-2 Geographically distributed disaster recovery
An overview of high availability for Oracle Application Server is presented in the following sections:
Section 1.2.2, "Oracle Application Server Base Architecture"
Section 1.2.3, "Oracle Application Server High Availability Architectures"
Section 1.2.4, "Choosing the Best High Availability Architecture"
The definitions of terms below are useful in helping to understand the concepts presented in this book:
active-active: In a high availability system, the equivalent members of that system can be servicing requests concurrently. Under normal operation where non of the members have failed, all equivalent members are active and none are on standby. This is called an active-active system.
active-passive: In a high availability system, some members of the system can be actively servicing requests and performing work, while other members can be inactive. These inactive members are known to be passive. They are not activated until one or more of the active nodes have failed. Consumers of services provided by the system may or may not notice the failure. An active-active system generally provides more transparency and options for scalability to consumers than an active-passive system.
failover: When a member of a highly available system fails unexpectedly (unplanned downtime), in order to continue offering services to its consumers, the system undergoes a failover operation. If the system is an active-passive system, the passive member is activated during the failover operation and consumers are directed to it instead of the failed member. The failover process can be performed manually, or it can be automated by setting up hardware cluster services to detect failures and move cluster resources from the failed node to the standby node. If the system is an active-active system, the failover is performed by the load balancer entity serving requests to the active members. If an active member fails, the load balancer detects the failure and automatically redirects requests for the failed member to the surviving active members.
failback: After a system undergoes a successful failover operation, the original failed member can be repaired over time and be re-introduced into the system as a standby member. If desired, a failback process can be initiated to activate this member and deactivate the other. This process reverts the system back to its pre-failure configuration.
hardware cluster: A hardware cluster is a collection of computers that provides a single view of network services (for example: an IP address) or application services (for example: databases, Web servers) to clients of these services. Each node in a hardware cluster is a standalone server that runs its own processes. These processes can communicate with one another to form what looks like a single system that cooperatively provides applications, system resources, and data to users.
A hardware cluster achieves high availability and scalability through the use of specialized hardware (cluster interconnect, shared storage) and software (health monitors, resource monitors). (The cluster interconnect is a private link used by the hardware cluster for heartbeat information to detect node death.) Due to the need for specialized hardware and software, hardware clusters are commonly provided by hardware vendors such as SUN, HP, IBM, and Dell. While the number of nodes that can be configured in a hardware cluster is vendor dependent, for the purpose of Oracle Application Server high availability, only two nodes are required. Hence, this document assumes a two-node hardware cluster for high availability solutions employing a hardware cluster.
cluster agent: The software that runs on a node member of a hardware cluster that coordinates availability and performance operations with other nodes. Clusterware provides resource grouping, monitoring, and the ability to move services. A cluster agent can automate the service failover.
clusterware: A software that manages the operations of the members of a cluster as a system. It allows one to define a set of resources and services to monitor via a heartbeat mechanism between cluster members and to move these resources and services to a different member in the cluster as efficiently and transparently as possible.
shared storage: Even though each hardware cluster node is a standalone server that runs its own set of processes, the storage subsystem required for any cluster-aware purpose is usually shared. Shared storage refers to the ability of the cluster to be able to access the same storage, usually disks, from both the nodes. While the nodes have equal access to the storage, only one node, the primary node, has active access to the storage at any given time. The hardware cluster's software grants the secondary node access to this storage if the primary node fails. For the OracleAS Infrastructure in the OracleAS Cold Failover Cluster environment, its ORACLE_HOME
is on such a shared storage file system. This file system is mounted by the primary node; if that node fails, the secondary node takes over and mounts the file system. In some cases, the primary node may relinquish control of the shared storage, such as when the hardware cluster's software deems the Infrastructure as unusable from the primary node and decides to move it to the secondary.
primary node: The node that is actively executing one or more Infrastructure installations at any given time. If this node fails, the Infrastructure is failed over to the secondary node. Since the primary node runs the active Infrastructure installation(s), it is considered the "hot" node. See the definition for "secondary node" in this section.
secondary node: This is the node that takes over the execution of the Infrastructure if the primary node fails. Since the secondary node does not originally run the Infrastructure, it is considered the "cold" node. And, because the application fails from a hot node (primary) to a cold node (secondary), this type of failover is called cold failover. See the definition for "primary node" in this section.
network hostname: Network hostname is a name assigned to an IP address either through the /etc/hosts
file (in UNIX), C:\WINDOWS\system32\drivers\etc\hosts
file (in Windows), or through DNS resolution. This name is visible in the network that the machine to which it refers to is connected. Often, the network hostname and physical hostname are identical. However, each machine has only one physical hostname but may have multiple network hostnames. Thus, a machine's network hostname may not always be its physical hostname.
physical hostname: This guide differentiates between the terms physical hostname and network hostname. This guide uses physical hostname to refer to the "internal name" of the current machine. In UNIX, this is the name returned by the hostname
command.
Physical hostname is used by Oracle Application Server middle-tier installation types to reference the local host. During installation, the installer automatically retrieves the physical hostname from the current machine and stores it in the Oracle Application Server configuration metadata on disk.
switchover: During normal operation, active members of a system may require maintenance or upgrading. A switchover process can be initiated to allow a substitute member to take over the workload performed by the member that requires maintenance or upgrading, which undergoes planned downtime. The switchover operation ensures continued service to consumers of the system.
switchback: When a switchover operation is performed, a member of the system is deactivated for maintenance or upgrading. When the maintenance or upgrading is completed, the system can undergo a switchback operation to activate the upgraded member and bring the system back to the pre-switchover configuration.
virtual hostname: Virtual hostname is a network addressable hostname that maps to one or more physical machines via a load balancer or a hardware cluster. For load balancers, the name "virtual server name" is used interchangeably with virtual hostname in this book. A load balancer can hold a virtual hostname on behalf of a set of servers, and clients communicate indirectly with the machines using the virtual hostname. A virtual hostname in a hardware cluster is a network hostname assigned to a cluster virtual IP. Because the cluster virtual IP is not permanently attached to any particular node of a cluster, the virtual hostname is not permanently attached to any particular node either.
Note: Whenever the phrase "virtual hostname" is used in this document, it is assumed to be associated with a virtual IP address. In cases where just the IP address is needed or used, it will be explicitly stated. |
virtual IP: Also, cluster virtual IP and load balancer virtual IP. Generally, a virtual IP can be assigned to a hardware cluster or load balancer. To present a single system view of a cluster to network clients, a virtual IP serves as an entry point IP address to the group of servers which are members of the cluster. A virtual IP can be assigned to a server load balancer or a hardware cluster.
A hardware cluster uses a cluster virtual IP to present to the outside world the entry point into the cluster (it can also be set up on a standalone machine). The hardware cluster's software manages the movement of this IP address between the two physical nodes of the cluster while clients connect to this IP address without the need to know which physical node this IP address is currently active on. In a typical two-node hardware cluster configuration, each machine has its own physical IP address and physical hostname, while there could be several cluster IP addresses. These cluster IP addresses float or migrate between the two nodes. The node with current ownership of a cluster IP address is active for that address.
A load balancer also uses a virtual IP as the entry point to a set of servers. These servers tend to be active at the same time. This virtual IP address is not assigned to any individual server but to the load balancer which acts as a proxy between servers and their clients.
The first thing to understand for high availability is the system's base architecture. Then, to make this system highly available, examine every component and connection path between components and make each one of them highly available. This produces a highly available architecture by essentially adding redundancy to the base architecture.
Figure 1-3 illustrates the base architecture of Oracle Application Server.
Figure 1-3 Oracle Application Server base architecture
At a high level, Oracle Application Server consists of the Oracle Application Server middle-tier business applications, Oracle Identity Management, and OracleAS Metadata Repository. The latter two are part of the OracleAS Infrastructure.
Oracle Identity Management software manages user authentication, authorization, and identity information. Functionally, its main components are:
OracleAS Single Sign-On
Oracle Delegated Administration Services
Oracle Internet Directory
Oracle Directory Integration and Provisioning
Architecturally, Oracle Identity Management can be broken down into a Web server tier of Oracle HTTP Server, an OracleAS Single Sign-On/Oracle Delegated Administration Services middle-tier composed of an Oracle Application Server Containers for J2EE (OC4J) instance for these security applications, and an Oracle Internet Directory/Oracle Directory Integration and Provisioning tier at the back end. The OracleAS Metadata Repository is an Oracle database that manages configuration, management, and product metadata for components throughout the OracleAS Infrastructure and OracleAS middle-tier.
The middle tier hosts most of Oracle Application Server business applications, such as:
Oracle Application Server Portal
Oracle Application Server Wireless
Oracle Application Server Integration
These applications rely on Oracle Identity Management and OracleAS Metadata Repository for security and metadata support. The middle tier also includes a Web caching sub-tier (Oracle Application Server Web Cache), a Web server sub-tier (Oracle HTTP Server), and OC4J instance(s). Behind the middle tier, the OracleAS Metadata Repository serves as the data tier. In actual deployments, other databases may also exist in the data tier (for example, a customer database for OC4J applications deployed on the middle tier).
Figure 1-4 shows the various sub-tiers that are traversed by client requests to the Oracle Application Server business applications and the Oracle Application Server Infrastructure services. An overall view of Infrastructure services is provided in Figure 1-5. These services include Oracle Identity Management, metadata repository, and LDAP services.
The base architecture supports many availability features, such as automatic process monitoring and restart, application server backup and recovery; however, it does not provide complete high availability. Several single points of failure exist. To eliminate them, redundancy has to be provided for each component. This can be achieved by extending the base architecture with additional high availability architectures.
Figure 1-4 Sub-tiers of the base architecture of Oracle Application Server
Figure 1-5 Overview of Infrastructure Services
Oracle Application Server provides both local high availability and disaster recovery solutions for maximum protection against any kind of failure with flexible installation, deployment, and security options. The redundancy of Oracle Application Server local high availability and disaster recovery originates from its redundant high availability architectures.
At a high level, Oracle Application Server local high availability architectures include several active-active and active-passive architectures for the OracleAS middle-tier and the OracleAS Infrastructure. Although both types of solutions provide high availability, active-active solutions generally offer higher scalability and faster failover, although, they tend to be more expensive as well. With either the active-active or the active-passive category, multiple solutions exist that differ in ease of installation, cost, scalability, and security.
Building on top of the local high availability solutions is the Oracle Application Server Disaster Recovery solution, Oracle Application Server Guard. This unique solution combines the proven Oracle Data Guard technology in the Oracle Database with advanced disaster recovery technologies in the application realm to create a comprehensive disaster recovery solution for the entire application system. This solution requires homogenous production and standby sites, but other Oracle Application Server instances can be installed in either site as long as they do not interfere with the instances in the disaster recovery setup. Configurations and data must be synchronized regularly between the two sites to maintain homogeneity.
There is no single best high availability solution for all systems in the world, but there may be a best solution for your system. Perhaps the most important decision in designing a highly available system is choosing the most appropriate high availability architecture or type of redundancy based on service level requirements as needed by a business or application. Understanding the availability requirements of the business is critical since cost is also associated with the different levels of high availability.
Oracle Application Server offers many high availability solutions to meet service level requirements. The most comprehensive solution may not necessarily be the best for your application. To choose the correct high availability architecture, ensure you understand your business' service level requirements first.
The high level questions to determine your high availability architectures are:
Local high availability: does your production system need to be available 24 hours per day, 7 days per week, and 365 days per year?
Scalability: is the scalability of multiple active Oracle Application Server instances required?
Site-to-site disaster recovery: is this required?
Based on the answers to these questions, you need to make your selection in two dimensions:
Instance redundancy: base, active-active, or active-passive.
Site-to-site disaster recovery-enabled architecture: yes or no.
Table 1-3 shows the architecture choices based on business requirements.
Table 1-3 Service level requirements and architecture choices
Business Requirements | Architecture Choices | |||
---|---|---|---|---|
Local High Availability | Scalability | Disaster Recovery | Instance Redundancy | Disaster Recovery |
N |
N |
N |
Base |
N |
Y |
N |
N |
Active-passive |
N |
N |
Y |
N |
Active-active |
N |
N |
N |
Y |
Base |
Y |
Y |
Y |
N |
Active-active |
N |
Y |
N |
Y |
Active-passive |
Y |
N |
Y |
Y |
Active-active (middle tier) Base (Infrastructure)Foot 1 |
Y |
Y |
Y |
Y |
Active-active (middle tier) Active-passive and active-active (Infrastructure)Footref 1 |
Y |
Although you can choose different high availability architectures for your OracleAS middle-tier and OracleAS Infrastructure, their local high availability and disaster recovery requirements should be identical. Scalability requirements should be evaluated separately for OracleAS middle-tier and OracleAS Infrastructure. The latter does not usually need to be as scalable as the middle tier because it handles fewer identity management requests.
Because of the differences in scalability requirements, deployment choices for the OracleAS middle-tier and the OracleAS Infrastructure may differ in architecture. For example, if your deployment requires local high availability, site-to-site disaster recovery, scalable middle tier but basic OracleAS Infrastructure scalability, you can choose an active-active middle tier, an active-passive OracleAS Infrastructure, and deploy a standby disaster recovery site that mirrors all middle-tier and OracleAS Infrastructure configuration in the production site.
The following table provides a list of cross-references to high availability information in other documents in the Oracle library. This information mostly pertains to high availability of various Oracle Application Server components.
Table 1-4 Cross-references to high availability information in Oracle documentation
Component | Location of Information |
---|---|
Overall high availability concepts |
In the high availability chapter of Oracle Application Server Concepts. |
Oracle installer |
In the chapter for installing in a high availability environment in Oracle Application Server Installation Guide. |
Oracle Application Server Backup and Recovery Tool |
In the backup and restore part of Oracle Application Server Administrator's Guide. |
Oracle Application Server Web Cache |
Oracle Application Server Web Cache Administrator's Guide |
Identity Management service replication |
In "Advanced Configurations" chapter of Oracle Application Server Single Sign-On Administrator's Guide. |
Identity Management high availability deployment |
In "Oracle Identity Management Deployment Planning" chapter of Oracle Identity Management Concepts and Deployment Planning Guide. |
Database high availability |
Oracle High Availability Architecture and Best Practices |
Distributed Configuration Management commands |
Distributed Configuration Management Administrator's Guide |
Oracle Process Manager and Notification Server commands |
Oracle Process Manager and Notification Server Administrator's Guide |
OC4J high availability |
Oracle Application Server Containers for J2EE Services Guide Oracle Application Server Containers for J2EE User's Guide Oracle Application Server Containers for J2EE Enterprise JavaBeans Developer's Guide |
Java Object Cache |
Oracle Application Server Web Services Developer's Guide |
Load balancing to OC4J processes |
Oracle HTTP Server Administrator's Guide |
Oracle Application Server Wireless high availability |
Oracle Application Server Wireless Administrator's Guide |
Oracle Business Intelligence Discoverer high availability |
Oracle Business Intelligence Discoverer Configuration Guide |
OracleAS Forms Services |
Oracle Application Server Forms Services Deployment Guide |
OracleAS Reports Services |
Oracle Application Server Reports Services Publishing Reports to the Web |
Oracle Application Server Integration InterConnect ini file information |
Oracle Application Server Integration InterConnect User's Guide |
In addition, references to these and other documentation are noted in the text of this guide, where applicable.