Skip Headers
Oracle® Enterprise Manager Advanced Configuration
10g Release 2 (10.2)
B16242-01
  Go To Table Of Contents
Contents
Go To Documentation Library
Home
Go To Product List
Solution Area
Go To Index
Index

Previous
Previous
Next
Next
 

9 Sizing and Maximizing the Performance of Oracle Enterprise Manager

Oracle Enterprise Manager 10g Grid Control has the ability to scale for hundreds of users and thousands of systems and services on a single Enterprise Manager implementation.

This chapter describes techniques for achieving optimal performance using the Oracle Enterprise Manager application. It can also help you with capacity planning, sizing and maximizing Enterprise Manager performance in a large scale environment. By maintaining routine housekeeping and monitoring performance regularly, you insure that you will have the required data to make accurate forecasts of future sizing requirements. Receiving good baseline values for the Enterprise Manager Grid Control vital signs and setting reasonable warning and critical thresholds on baselines allows Enterprise Manager to monitor itself for you.

This chapter also provides practical approaches to backup, recovery, and disaster recovery topics while addressing different strategies when practical for each tier of Enterprise Manager.

This chapter contains the following sections:

9.1 Oracle Enterprise Manager Grid Control Architecture Overview

The architecture for Oracle Enterprise Manager 10g Grid Control exemplifies two key concepts in application performance tuning: distribution and parallelization of processing. Each component of Grid Control can be configured to apply both these concepts.

The components of Enterprise Manager Grid Control include:

Figure 9-1 Overview of Enterprise Manager Architecture Components

Illustration of Enterprise Manager architecture components
Description of "Figure 9-1 Overview of Enterprise Manager Architecture Components"

For more information about the Grid Control architecture, see the Oracle Enterprise Manager 10g documentation:

The Oracle Enterprise Manager 10g documentation is available at the following location on the Oracle Technology Network (OTN):

http://otn.oracle.com/documentation/oem.html

9.2 Enterprise Manager Grid Control Sizing and Performance Methodology

An accurate predictor of capacity at scale is the actual metric trend information from each individual Enterprise Manager Grid Control deployment. This information, combined with an established, rough, starting host system size and iterative tuning and maintenance, produces the most effective means of predicting capacity for your Enterprise Manager Grid Control deployment. It also assists in keeping your deployment performing at an optimal level.

Here are the steps to follow to enact the Enterprise Manager Grid Control sizing methodology:

  1. If you have not already installed Enterprise Manager Grid Control 10g, choose a rough starting host configuration as listed in Table 9-1.

  2. Periodically evaluate your site's vital signs (detailed later).

  3. Eliminate bottlenecks using routine DBA/Enterprise Manager administration housekeeping.

  4. Eliminate bottlenecks using tuning.

  5. Extrapolate linearly into the future to plan for future sizing requirements.

Step one need only be done once for a given deployment. Steps two, three, and four must be done, regardless of whether you plan to grow your Enterprise Manager Grid Control site, for the life of the deployment on a regular basis. These steps are essential to an efficient Enterprise Manager Grid Control site regardless of its size or workload. You must complete steps two, three, and four before you continue on to step five. This is critical. Step five is only required if you intend to grow the deployment size in terms of monitored targets. However, evaluating these trends regularly can be helpful in evaluating any other changes to the deployment.

9.2.1 Step 1: Choosing a Starting Platform Grid Control Deployment

If you have not yet installed Enterprise Manager Grid Control on an initial platform, this step helps you choose a rough approximation based on experiences with real world Enterprise Manager Grid Control deployments. If you have already installed Enterprise Manager Grid Control, proceed to Step 2. Three typical deployment sizes are defined: small, medium, and large. The number and type of systems (or targets) it monitors largely defines the size of an Enterprise Manager Grid Control deployment.

Table 9-1 Management Server

Deployment Size Hosts CPUs/Hosts Memory/Host (GB)

Small (100 monitored targets)

1

1 (3 GHz)

2

Medium (1,000 monitored targets)

1

2 (3 GHz)

2

Large (10,000 monitored targets)

2

2 (3 GHz) 2

2


Table 9-2 Management Repository

Deployment Size Hosts CPUs/Host Memory/Host (GB)

Small

Shares host with Management Server

Shares host with Management Server

Shares host with Management Server

Medium

1

2

4

Large

2

4

6


Table 9-3 Total Management Repository Storage (GB)

Deployment Size Total Management Repository Storage (GB)

Small

10

Medium

30

Large

100


The previous tables show the estimated minimum hardware requirements for each deployment size. Management Servers running on more than one host, as portrayed in the large deployment above, will divide work amongst themselves.

Deploying multiple Management Servers also provides basic fail-over capabilities, with the remaining servers continuing to operate in the event of the failure of one. Use of a Server Load Balancer, or SLB, provides transparent failover for Enterprise Manager UI clients in the event of a Management Server host failure, and it also balances the request load between the available Management Servers. SLBs are host machines dedicated for load balancing purposes. SLBs can be clustered to provide fail-over capability.

Using multiple hosts for the Management Repository assumes the use of Oracle Real Application Clusters (RAC). Doing so allows the same Oracle database to be accessible on more than one host system. Beyond the storage required for the Management Server, Management Repository storage may also be required. Management Server storage is less impacted by the number of management targets. The numbers suggested in the Enterprise Manager Grid Control documentation should be sufficient in this regard.

9.2.1.1 Network Topology Considerations

A critical consideration when deploying Enterprise Manager Grid Control is network performance between tiers. Enterprise Manager Grid Control ensures tolerance of network glitches, failures, and outages between application tiers through error tolerance and recovery. The Management Agent in particular is able to handle a less performant or reliable network link to the Management Service without severe impact to the performance of Enterprise Manager as a whole. The scope of the impact, as far as a single Management Agent's data being delayed due to network issues, is not likely to be noticed at the Enterprise Manager Grid Control system wide level.

The impact of slightly higher network latencies between the Management Service and Management Repository will be substantial, however. Implementations of Enterprise Manager Grid Control have experienced significant performance issues when the network link between the Management Service and Management Repository is not of sufficient quality. The following diagram that displays the Enterprise Manager components and their connecting network link performance requirements. These are minimum requirements based on larger real world Enterprise Manager Grid Control deployments and testing.

Figure 9-2 Network Links Related to Enterprise Manager Components

Description of Figure 9-2  follows
Description of "Figure 9-2 Network Links Related to Enterprise Manager Components"

You can see in Figure 9-2 that the bandwidth and latency minimum requirements of network links between Enterprise Manager Grid Control components greatly impact the performance of the Enterprise Manager application.

9.2.2 Step 2: Periodically Evaluate the Vital Signs of Your Site

This is the most important step of the five. Without some degree of monitoring and understanding of trends or dramatic changes in the vital signs of your Enterprise Manager Grid Control site, you are placing site performance at serious risk. Every monitored target sends data to the Management Repository for loading and aggregation through its associated Management Agent. This adds up to a considerable volume of activity that requires the same level of management and maintenance as any other enterprise application.

Enterprise Manager has "vital signs" that reflect its health. These vital signs should be monitored for trends over time as well as against established baseline thresholds. You must establish realistic baselines for the vital signs when performance is acceptable. Once baselines are established, you can use built-in Oracle Enterprise Manager Grid Control functionality to set baseline warning and critical thresholds. This allows you to be notified automatically when something significant changes on your Enterprise Manager site. The following table is a point-in-time snapshot of the Enterprise Manager Grid Control vital signs for two sites:



EM Site 1 EM Site 2
Site URL
emsite1.acme.com emsite2.acme.com




Target Counts Database Targets 192 (45 not up) 1218 (634 not up)

Host Targets 833 (12 not up) 1042 (236 not up)





Total Targets 2580 (306 not up) 12293 (6668 not up)




Loader Statistics Loader Threads 6 16

Total Rows/Hour 1,692,000 2,736,000

Rows/hour/load/thread 282,000 171,000

Rows/second/load thread 475 187

Percent of Hour Run 15 44




Rollup Statistics Rows per Second 2,267 417

Percent of Hour Run 5 19




Job Statistics Job Dispatchers 2 4

Job Steps/second/dispatcher 32 10




Notification Statistics Notifications per Second 8 1

Percent of Hour Run 1 13




Alert Statistics Alerts per Hour 536 1,100




Management Service Host Statistics Average % CPU (Host 1) 9 (emhost01) 13 (emhost01)

Average % CPU (Host 2) 6 (emhost02) 17 (emhost02)

Average % CPU (Host 3) N/A 38 (em6003)

Average % CPU (Host 4) N/A 12 (em6004)

Number of CPUs per host 2 X 2.8 (Xeon) 4 X 2.4 (Xeon)

Memory per Host (GB) 6 6




Management Repository Host Statistics Average % CPU (Host 1) 12 (db01rac) 32 (em6001rac)

Average % CPU (Host 2)


Average % CPU (Host 3)


Average % CPU (Host 4)


Number of CPUs per host


Buffer Cache Size (MB)


Memory per Host (GB) 6 12

Total Management Repository Size (GB) 56 98

RAC Interconnect Traffic (MB/s) 1 4

Management Server Traffic (MB/s) 4 4

Total Management Repository I/O (MB/s) 6 27




Enterprise Manager UI Page Response/Sec Home Page 3 6

All Host Page 3 30+

All Database Page 6 30+

Database Home Page 2 2

Host Home Page 2 2

The two Enterprise Manager sites are at the opposite ends of the scale for performance.

EM Site 1 is performing very well with high loader rows/sec/thread and high rollup rows/sec. It also has a very low percentage of hours run for the loader and the rollup. The CPU utilization on both the Management Server and Management Repository Server hosts are low. Most importantly, the UI Page Response times are excellent. To summarize, Site 1 is doing substantial work with minimal effort. This is how a well configured, tuned and maintained Oracle Enterprise Manager Grid Control site should look.

Conversely, EM Site 2 is having difficulty. The loader and rollup are working hard and not moving many rows. Worst of all are the UI page response times. There is clearly a bottleneck on Site 2, possibly more than one.

These vital signs are all available from within the Enterprise Manager interface. Most values can be found on the All Metrics page for each host, or the All Metrics page for Management Server. Keeping an eye on the trends over time for these vital signs, in addition to assigning thresholds for warning and critical alerts, allows you to maintain good performance and anticipate future resource needs. You should plan to monitor these vital signs as follows:

  • Take a baseline measurement of the vital sign values seen in the previous table when the Enterprise Manager Grid Control site is running well.

  • Set reasonable thresholds and notifications based on these baseline values so you can be notified automatically if they deviate substantially. This may require some iteration to fine-tune the thresholds for your site. Receiving too many notifications is not useful.

  • On a daily (or weekly at a minimum) basis, watch for trends in the 7-day graphs for these values. This will not only help you spot impending trouble, but it will also allow you to plan for future resource needs.

The next step provides some guidance of what to do when the vital sign values are not within established thresholds. Also, it explains how to maintain your site's performance through routine housekeeping.

9.2.3 Step 3: Use DBA and Enterprise Manager Tasks To Eliminate Bottlenecks Through Housekeeping

It is critical to note that routine housekeeping helps keep your Enterprise Manager Grid Control site running well. The following are lists of housekeeping tasks and the interval on which they should be done.

9.2.3.1 Online Weekly Tasks

  • Check the system error page and resolve the causes of all errors. Some may be related to product bugs, but resolve as many as you can. Look for applicable patches if you suspect a bug. Clear the error table from the Enterprise Manager interface when you are done or when you have resolved all that you can.

  • Check the alerts and errors for any metric collection errors. Most of these will be due to configuration issues at the target being monitored. Resolve these errors by fixing the reported problem. The error should then clear automatically.

  • Try to resolve any open alerts in the system. Also, if there are severities that are frequently oscillating between clear and warning or critical, try adjusting the threshold to stop frequent warning and critical alert conditions. Frequent alert oscillation can add significant load at the Management Server. Adjusting the threshold to a more reasonable level will help Enterprise Manager to work more efficiently for you. Adjusting the threshold for an alert may be the only way to close the alert. This is perfectly acceptable in cases where the tolerances are too tight for a metric.

  • Watch for monitored targets that are always listed with a down status. Try to get them up and working again, or remove them from Oracle Enterprise Manager.

  • Watch the Alert Log error metric for the Management Repository database for critical (ORA-0600, for example) errors. Resolve these as soon as possible. A search on Metalink using the error details almost always will reveal some clues to its cause and provide available patches.

  • Analyze the three major tables in the Management Repository: MGMT_METRICS_RAW, MGMT_METRICS_1HOUR, and MGMT_METRICS_1DAY. If your Management Repository is in an Oracle 10g database, then these tables are automatically analyzed weekly and you can skip this task. If your Management Repository is in an Oracle version 9 database, then you will need to ensure that the following commands are run weekly:

    • exec dbms_stats.gather_table_stats('SYSMAN', 'MGMT_METRICS_RAW', null, .000001, false, 'for all indexed columns', null, 'global', true, null, null, null);

    • exec dbms_stats.gather_table_stats('SYSMAN', 'MGMT_METRICS_1HOUR', null, .000001, false, 'for all indexed columns', null, 'global', true, null, null, null);

    • exec dbms_stats.gather_table_stats('SYSMAN', 'MGMT_METRICS_1DAY', null, .000001, false, 'for all indexed columns', null, 'global', true, null, null, null);

9.2.3.2 Offline Monthly Tasks

  • Drop old partitions. Oracle Enterprise Manager automatically truncates the data and reclaim the space used by partitions older than the default retention times for each table. Due to a database bug, however, Enterprise Manager cannot drop partitions while the Management Service is running because many SQL cursors will be invalidated incorrectly leading to some strange errors in the Enterprise Manager interface. The following command must be run with all Management Servers down:

    • exec emd_maintenance.partition_maintenance;

  • Rebuild and defragment indexes and reorganize tables as required. You may not actually need to rebuild any indexes or tables on a monthly basis. All you should do monthly is evaluate the Management Repository for tables and indexes that have grown significantly and been purged back down to a fraction of their allocated size. Unnecessarily building tables and indexes causes the Management Repository to work harder than necessary to reallocate needed space. The tables that require reorganization are easily identifiable. These tables will be large in allocated size with a relatively small number of rows, or actual size. In a real Management Repository, you may see one table that is approximately 800MB in size but only contains 6 rows. If the table is this badly oversized, it requires reorganization. Tables can be reorganized and rebuilt using a command similar to the following example:

    • exec dbms_redefinition.start_redef_table('SYSMAN','MGMT_SEVERITY');

    This command rebuilds the table and returns its physical structure to its clean initial state. The 800 MB table is an extreme case. Smaller disparities between actual size and row count may also indicate the need for reorganization. The Management Server(s) must be down when reorganizing a table. If you reorganize the table, the indexes must also be rebuilt. This helps make index range scans more efficient again. Indexes can be reorganized using a command similar to the following example:

    • ALTER INDEX SEVERITY_PRIMARY_KEY REBUILD;

    There are a few tables (along with their indexes) that may require rebuilding more frequently than others based on the higher volume of inserts and deletes they typically see. These tables are:

    • MGMT_SEVERITY

    • MGMT_CURRENT_SEVERITY

    • MGMT_SYSTEM_ERROR_LOG

    • MGMT_SYSTEM_PERFORMANCE_LOG

    • MGMT_METRIC_ERRORS

    • MGMT_CURRENT_METRIC_ERRORS

    • MGMT_STRING_METRIC_HISTORY

    • MGMT_JOB_OUTPUT

    These are a sampling of tables that may require more DBA attention than others, but all non-IOT Enterprise Manager tables and indexes should be evaluated monthly for defragmentation and rebuild needs. The following query gives a rough idea of the tables that may require rebuild and reorganization or both:

    SELECT UT.TABLE_NAME, ROUND(UT.NUM_ROWS * UT.AVG_ROW_LEN / 1024 / 1024, 2) "CALCULATED SIZE MB", ROUND(US.BYTES / 1024 /1024,2) "ALLOCATED SIZE MB", ROUND(US.BYTES / (UT.NUM_ROWS * UT.AVG_ROW_LEN),2) "TIMES LARGER" FROM USER_TABLES UT, USER_SEGMENTS US WHERE (UT.NUM_ROWS > 0 AND UT.AVG_ROW_LEN > 0 AND US.BYTES > 0) AND UT.PARTITIONED = 'NO' AND UT.IOT_TYPE IS NULL AND UT.IOT_NAME IS NULL AND UT.TABLE_NAME = US.SEGMENT_NAME AND ROUND(US.BYTES / 1024 /1024,2) > 5 AND ROUND(US.BYTES / 1024 /1024,2) > (ROUND(UT.NUM_ROWS * UT.AVG_ROW_LEN / 1024 / 1024, 2)* 2) ORDER BY 4 DESC;

    Sample query output:

    Table Name Calculated Size MB Allocated Size MB Times Larger
    MGMT_JOB_OUTPUT 2.25 440 195.57
    MGMT_FLAT_TARGET_MEMBERSHIPS 2.33 160 68.67
    MGMT_ANNOTATIONS 1.21 21 17.3
    MGMT_SQL_SUMMARY 6.06 80 13.2
    MGMT_JOB_EXECUTION 1.03 12 11.6
    MGMT_SYSTEM_ERROR_LOG 5.6 61 10.88
    MGMT_JOB_HISTORY 1.11 10 9.04
    MGMT_NOTIFICATION_LOG 1.59 14 8.78


    Note:

    This query calculates the actual size of a table based on the number of rows and average row size. It compares this actual size to the currently allocated size of the table. The final column shows how many times larger the allocated size is than the calculated actual size. Use this as a guide to determine which tables and indexes should be rebuilt. Tables that are many times larger than the actual size should be rebuilt, along with their indexes, using the commands mentioned previously.

    Good housekeeping will prevent many bottlenecks from occurring on your Enterprise Manager Grid Control site, but there may be times when you should investigate performance problems on your site that are not related to housekeeping. This is where the Enterprise Manager vital signs become important.

9.2.4 Step 4: Eliminate Bottlenecks Through Tuning

The most common causes of performance bottlenecks in the Enterprise Manager Grid Control application are listed below (in order of most to least common):

  1. Housekeeping that is not being done (far and away the biggest source of performance problems)

  2. Hardware or software that is incorrectly configured

  3. Hardware resource exhaustion

When the vital signs are routinely outside of an established threshold, or are trending that way over time, you must address two areas. First, you must ensure that all previously listed housekeeping is up to date. Secondly, you must address resource utilization of the Enterprise Manager Grid Control application. The vital signs listed in the previous table reflect key points of resource utilization and throughput in Enterprise Manager Grid Control. The following sections cover some of the key vital signs along with possible options for dealing with vital signs that have crossed thresholds established from baseline values.

9.2.4.1 High CPU Utilization

When you are asked to evaluate a site for performance and notice high CPU utilization, there are a few common steps you should follow to determine what resources are being used and where.

  1. The Management Server is typically a very minimal consumer of CPU. High CPU utilization in the Enterprise Manager Grid Control almost always manifests as a symptom at the Management Repository.

  2. Use the Processes display on the Enterprise Manager Host home page to determine which processes are consuming the most CPU on any Management Service or Management Repository host that has crossed a CPU threshold.

  3. Once you have established that Enterprise Manager is consuming the most CPU, use Enterprise Manager to identify what activity is the highest CPU consumer. Typically this manifests itself on a Management Repository host where most of the Management Service's work is performed. It is very rare that the Management Service itself is the source of the bottleneck. Here are a few typical spots to investigate when the Management Repository appears to be using too many resources.

    1. Click on the CPU Used database resource listed on the Management Repository's Database Performance page to examine the SQL that is using the most CPU at the Management Repository.

    2. Check the Database Locks on the Management Repository's Database Performance page looking for any contention issues.

High CPU utilization is probably the most common symptom of any performance bottleneck. Typically, the Management Repository is the biggest consumer of CPU, which is where you should focus. A properly configured and maintained Management Repository host system that is not otherwise hardware resource constrained should average roughly 40 percent or less total CPU utilization. A Management Server host system should average roughly 20 percent or less total CPU utilization. These relatively low average values should allow sufficient headroom for spikes in activity. Allowing for activity spikes helps keep your page performance more consistent over time. If your Enterprise Manager Grid Control site interface pages happen to be responding well (approximately 3 seconds) while there is no significant (constant) loader backlog, and it is using more CPU than recommended, you may not have to address it unless you are concerned it is part of a larger upward trend.

The recommended path for tracking down the root cause of high Management Repository CPU utilization is captured under step 3.b above. This allows you to start at the Management Repository Performance page and work your way down to the SQL that is consuming the most CPU in its processing. This approach has been used very successfully on several real world sites.

If you are running Enterprise Manager on Intel based hosts, the Enterprise Manager Grid Control Management Service and Management Repository will both benefit from Hyper-Threading (HT) being enabled on the host or hosts on which they are deployed. HT is a function of certain late models of Intel processors, which allows the execution of some amount of CPU instructions in parallel. This gives the appearance of double the number of CPUs physically available on the system. Testing has proven that HT provides approximately 1.5 times the CPU processing power as the same system without HT enabled. This can significantly improve system performance. The Management Service and Management Repository both frequently have more than one process executing simultaneously, so they can benefit greatly from HT.

9.2.4.2 Loader Vital Signs

The vital signs for the loader indicate exactly how much data is continuously coming into the system from all the Enterprise Manager Agents. The most important items here are the percent of hour runs and rows/second/thread. The (Loader) % of hour run indicates whether the loader threads configured are able to keep pace with the incoming data volume. As this value approaches 100%, it becomes apparent that the loading process is failing to keep pace with the incoming data volume. The lower this value, the more efficiently your loader is running and the less resources it requires from the Management Service host. Adding more loader threads to your Management Server can help reduce the percent of hour run for the loader.

Rows/Second/Thread is a precise measure of each loader thread's throughput per second. The higher this number, the better. Rows/Second/Thread as high as 1200 have been observed on some smaller, well configured and maintained Enterprise Manager Grid Control sites. If you have not increased the number of loader threads and this number is trending down, it may indicate a problem later. One way to overcome a decreasing rows/second/thread is to add more loader threads.

The number of Loader Threads is always set to one by default in the Management Server configuration file. Each Management Server can have a maximum of 10 loader threads. Adding loader threads to a Management Server typically increases the overall host CPU utilization by 2% to 5% on a Enterprise Manager Grid Control site with many Management Agents configured. Customers can change this value as their site requires. Most medium size and smaller configurations will never need more than one loader thread. Here is a simple guideline for adding loader threads:

Max total (across all Management Servers) loader threads = 2 X number of Management Repository host CPUs

There is a diminishing return when adding loader threads. You will not yield 100% capacity from the second, or higher, thread. There should be a positive benefit, however. As you add loader threads, you should see rows/second/thread decrease, but total rows/hour throughput should increase. If you are not seeing significant improvement in total rows/hour, and there is a constantly growing loader file backlog, it may not be worth the cost of the increase in loader threads. You should explore other tuning or housekeeping opportunities in this case.

To add more loader threads you can change the following configuration parameter:

em.loader.threadPoolSize=n

Where 'n' is a positive integer [1-10]. The default is one and any value other than [1-10] will result in the thread pool size defaulting to one. This property file is located in the {ORACLE_HOME}/sysman/config directory. Changing this parameter will require a restart of the Management Service to be reloaded with the new value.

9.2.4.3 Rollup Vital Signs

The rollup process is the aggregation mechanism for Enterprise Manager Grid Control. Once an hour, it processes all the new raw data loaded into the Management Repository table MGMT_METRICS_RAW, calculates averages and stores them in the tables MGMT_METRICS_1HOUR and MGMT_METRICS_1DAY. The two vital signs for the rollup are the rows/second and % of hour run. Due to the large volume of data rows processed by the rollup, it tends to be the largest consumer of Management Repository buffer cache space. Because of this, the rollup vital signs can be great indicators of the benefit of increasing buffer cache size.

Rollup rows/second shows exactly how many rows are being processed, or aggregated and stored, every second. This value is usually around 2,000 (+/- 500) rows per second on a site with a decent size buffer cache and reasonable speedy I/O. A downward trend over time for this value may indicate a future problem, but as long as % of hour run is under 100 your site is probably fine.

If rollup % of hour run is trending up (or is higher than your baseline), and you have not yet set the Management Repository buffer cache to its maximum, it may be advantageous to increase the buffer cache setting. Usually, if there is going to be a benefit from increasing buffer cache, you will see an overall improvement in resource utilization and throughput on the Management Repository host. The loader statistics will appear a little better. CPU utilization on the host will be reduced and I/O will decrease. The most telling improvement will be in the rollup statistics. There should be a noticeable improvement in both rollup rows/second and % of hour run. If you do not see any improvement in any of these vital signs, you can revert the buffer cache to its previous size. The old Buffer Cache Hit Ratio metric can be misleading. It has been observed in testing that Buffer Cache Hit Ratio will appear high when the buffer cache is significantly undersized and Enterprise Manager Grid Control performance is struggling because of it. There will be times when increasing buffer cache will not help improve performance for Grid Control. This is typically due to resource constraints or contention elsewhere in the application. Consider using the steps listed in the High CPU Utilization section to identify the point of contention. Grid Control also provides advice on buffer cache sizing from the database itself. This is available on the database Memory Parameters page.

One important thing to note when considering increasing buffer cache is that there may be operating system mechanisms that can help improve Enterprise Manager Grid Control performance. One example of this is the "large memory" option available on Red Hat Linux. The Linux OS Red Hat Advanced Server™ 2.1 (RHAS) has a feature called big pages. In RHAS 2.1, bigpages is a boot up parameter that can be used to pre-allocate large shared memory segments. Use of this feature, in conjunction with a large Management Repository SGA, can significantly improve overall Grid Control application performance. Starting in Red Hat Enterprise Linux™ 3, big pages functionality is replaced with a new feature called huge pages, which no longer requires a boot-up parameter.

9.2.4.4 Job, Notification, and Alert Vital Signs

Jobs, notifications, and alerts are indicators of the processing efficiency of the Management Service(s) on your Enterprise Manager Grid Control site. Any negative trends in these values are usually a symptom of contention elsewhere in the application. The best use of these values is to measure the benefit of running with more than one Management Server. There is one job dispatcher in each Management Server. Adding Management Servers will not always improve these values. In general, adding Management Servers will improve overall throughput for Grid Control when the application is not otherwise experiencing resource contention issues. Job, Notification, and Alert vital signs can help measure that improvement.

9.2.4.5 I/O Vital Signs

Monitoring the I/O throughput of the different channels in your Enterprise Manager Grid Control deployment is essential to ensuring good performance. At minimum, there are three different I/O channels on which you should have a baseline and alert thresholds defined:

  • Disk I/O from the Management Repository instance to its data files

  • Network I/O between the Management Server and Management Repository

  • RAC interconnect (network) I/O (on RAC systems only)

You should understand the potential peak and sustained throughput I/O capabilities for each of these channels. Based on these and the baseline values you establish, you can derive reasonable thresholds for warning and critical alerts on them in Grid Control. You will then be notified automatically if you approach these thresholds on your site. Some Grid Control site administrators can be unaware or mistaken about what these I/O channels can handle on their sites. This can lead to Enterprise Manager Grid Control saturating these channels, which in turn cripples performance on the site. In such an unfortunate situation, you would see that many vital signs would be impacted negatively.

To discover whether the Management Repository is involved, you can use Grid Control to check the Database Performance page. On the Performance page for the Management Repository, click on the wait graph showing the largest amount of time spent. From this you can continue to drill down into the actual SQL code or sessions that are waiting. This should help you to understand where the bottleneck is originating.

Another area to check is unexpected I/O load from non-Enterprise Manager Grid Control sources like backups, another application, or a possible data-mining co-worker who engages in complex SQL queries, multiple Cartesian products, and so on.

Total Repository I/O trouble can be caused by two factors. The first is a lack of regular housekeeping. Some of the Grid Control segments can be very badly fragmented causing a severe I/O drain. Second, there can be some poorly tuned SQL statements consuming much of the site I/O bandwidth. These two main contributors can cause most of the Grid Control vital signs to plummet. In addition, the lax housekeeping can cause the Management Repository's allocated size to increase dramatically.

One important feature of which to take advantage is asynchronous I/O. Enabling asynchronous I/O can dramatically improve overall performance of the Grid Control application. The Sun Solaris™ and Linux operating systems have this capability, but may be disabled by default. The Microsoft Windows™ operating system uses asynchronous I/O by default. Oracle strongly recommends enabling of this operating system feature on the Management Repository hosts and on Management Service hosts as well.

9.2.4.6 The Oracle Enterprise Manager Performance Page

There may be occasions when Enterprise Manager user interface pages are slow in the absence of any other performance degradation. The typical cause for these slow downs will be an area of Enterprise Manager housekeeping that has been overlooked. The first line of monitoring for Enterprise Manger page performance is the use of Enterprise Manager Beacons. These functionalities are also useful for web applications other than Enterprise Manager.

Beacons are designed to be lightweight page performance monitoring targets. After defining a Beacon target on an Management Agent, you can then define UI performance transactions using the Beacon. These transactions are a series of UI page hits that you will manually walk through once. Thereafter, the Beacon will automatically repeat your UI transaction on a specified interval. Each time the Beacon transaction is run, Enterprise Manager will calculate its performance and store it for historical purposes. In addition, alerts can be generated when page performance degrades below thresholds you specify.

When you configure the Enterprise Manager Beacon, you begin with a single predefined transaction that monitors the home page you specify during this process. You can then add as many transactions as are appropriate. You can also set up additional Beacons from different points on your network against the same web application to measure the impact of WAN latency on application performance. This same functionality is available for all Web applications monitored by Enterprise Manager Grid Control.

After you are alerted to a UI page that is performing poorly, you can then use the second line of page performance monitoring in Enterprise Manager Grid Control. This new end-to-end (or E2E) monitoring functionality in Grid Control is designed to allow you to break down processing time of a page into its basic parts. This will allow you to pinpoint when maintenance may be required to enhance page performance. E2E monitoring in Grid Control lets you break down both the client side processing and the server side processing of a single page hit.

The next page down in the Middle Tier Performance section will break out the processing time by tier for the page. By clicking on the largest slice of the Processing Time Breakdown pie chart, which is JDBC time above, you can get the SQL details. By clicking on the SQL statement, you break out the performance of its execution over time.

The JDBC page displays the SQL calls the system is spending most of its page time executing. This SQL call could be an individual DML statement or a PL/SQL procedure call. In the case of an individual SQL statement, you should examine the segments (tables and their indexes) accessed by the statement to determine their housekeeping (rebuild and reorg) needs. The PL/SQL procedure case is slightly more involved because you must look at the procedure's source code in the Management Repository to identify the tables and associated indexes accessed by the call.

Once you have identified the segments, you can then run the necessary rebuild and reorganization statements for them with the Management Server down. This should dramatically improve page performance. There are cases where page performance will not be helped by rebuild and reorganization alone, such as when excessive numbers of open alerts, system errors, and metric errors exist. The only way to improve these calls is to address (for example, clean up or remove) the numbers of these issues. After these numbers are reduced, then the segment rebuild and reorganization should be completed to optimize performance. These scenarios are covered in Section 9.2.3. If you stay current, you should not need to analyze UI page performance as often, if at all.

9.2.5 Step 5: Extrapolating Linearly Into the Future for Sizing Requirements

Determining future storage requirements is an excellent example of effectively using vital sign trends. You can use two built-in Grid Control charts to forecast this: the total number of targets over time and the Management Repository size over time.

Both of the graphs are available on the All Metrics page for the Management Service. It should be obvious that there is a correlation between the two graphs. A straight line applied to both curves would reveal a fairly similar growth rate. After a target is added to Enterprise Manager Grid Control for monitoring, there is a 31-day period where Management Repository growth will be seen because most of the data that will consume Management Repository space for a target requires approximately 31 days to be fully represented in the Management Repository. A small amount of growth will continue for that target for the next year because that is the longest default data retention time at the highest level of data aggregation. This should be negligible compared with the growth over the first 31 days.

When you stop adding targets, the graphs will level off in about 31 days. When the graphs level off, you should see a correlation between the number of targets added and the amount of additional space used in the Management Repository. Tracking these values from early on in your Enterprise Manager Grid Control deployment process helps you to manage your site's storage capacity proactively. This history is an invaluable tool.

The same type of correlation can be made between CPU utilization and total targets to determine those requirements. There is a more immediate leveling off of CPU utilization as targets are added. There should be no significant increase in CPU cost over time after adding the targets beyond the relatively immediate increase. Introducing new monitoring to existing targets, whether new metrics or increased collections, would most likely lead to increased CPU utilization.

9.3 Oracle Enterprise Manager Backup, Recovery, and Disaster Recovery Considerations

The newest release of Oracle Enterprise Manager Grid Control incorporates a portable browser-based interface to the management console and the Oracle application server technology to serve as the middle-tier Management Service. The foundation of the tool remains rooted in database server technology to manage the Management Repository and historical data. This new architecture requires a different approach to backup, recovery and Disaster Recovery (DR) planning. This section provides practical approaches to these availability topics and discusses different strategies when practical for each tier of Enterprise Manager.

9.3.1 Best Practices for Backup and Recovery

For the database, the best practice is to use the standard database tools for any database backup; have the database in archivelog mode, and perform regular online backup using RMAN or OS commands.

There are two cases to consider with regard to recovery:

  • Full recovery of the Management Repository is possible: No special considerations for Enterprise Manager. When the database is recovered, restart the database and Management Service processes. Management Agents will then upload pending files to the Management Repository.

  • Only point in time and incomplete recovery is possible: Enterprise Manager Management Agents will be unable to communicate to the Management Repository correctly until they are reset. This is a manual process that is accomplished by shutting down the Management Agent, deleting the agntstmp.txt and lastupld.xml files in the $AGENT_HOME/sysman/emd directories and then going to the /state and /upload subdirectories and clearing the contents. The Management Agent can then be restarted. This would need to be done for each Management Agent.

For the case of incomplete recovery, Management Agents may not be able to upload data until the previous steps are completed. Additionally, there is no indication in the interface that the Management Agents may not communicate with the Management Service after this type of recovery. This information would be available from the Management Agent logs or command line Management Agent status. If incomplete recovery is required, it is best to perform this procedure for each Management Agent.

9.3.1.1 Oracle Management Service

As the Management Service is stateless, the task is to restore the binaries and configuration files is the shortest time possible. There are two alternatives in this case.

  • Backup the entire software directory structure and restoring that in the event of failure to the same directory path. The Management Agent associated with this Management Service install should also be backed up at the same time and restored with the Management Service files if a restore is required.

  • Reinstall from the original media.

For any highly available Management Service install it is a recommended practice to make sure the /recv directory is protected with some mirroring technology. This is the directory the Management Service uses to stage files send to it from Management Agents before writing their contents to the database Management Repository. After the Management Agent finishes transmission of its XML files to the Management Service, it will delete its copy. In the event of an Management Service disk failure, this data would be lost. Warnings and alerts sent from the Management Agents would then be lost. This may require Management Agent resynchronization steps similar to those used with an incomplete database recovery.

9.3.1.2 Management Agent

This is a similar case to the Management Service except that the Management Agent is not stateless. There are two strategies that can be used:

  • A disk backup and restore is sufficient, assuming the host name has not changed. Delete the agntstmp.txt and the lastupld.xml files from the /sysman/emd directory. The /state and /upload sub-directories should be cleared of all entries before restarting. Starting the Management Agent will then force a rediscovery of targets on the host.

  • Reinstall from the original media.

As with the Management Service, it is a recommended best practice to protect the /state and /upload directories with some form of disk mirroring.

9.3.2 Best Practice for Disaster Recovery (DR)

In the event of a node failure, the database can be restored using RMAN or OS commands. To speed this process, implement Data Guard to replicate the Management Repository to a different hardware node.

9.3.2.1 Management Repository

If restoring the Management Repository to a new host, restore a backup of the database and modify the emoms.properities file for each Management Service manually to point to the new Management Repository location. In addition, the targets.xml for each Management Service will have to be updated to reflect the new Management Repository location. If there is a data loss during recovery, see the notes above on incomplete recovery of the Management Repository.

To speed Management Repository reconnection from the Management Service in the event of a single Management Service failure, configure the Management Service with a TAF aware connect string. The Management Service can be configured with a TAF connect string in the emoms.properities file that will automatically redirect communications to another node using the 'FAILOVER' syntax. An example follows:

EM=
(description=
(failover=on)
(address_list=
(failover=on)
(address=(protocol=tcp)(port=1522)(host=EMPRIM1.us.oracle.com))
(address=(protocol=tcp)(port=1522)(host=EMPRIM2.us.oracle.com)))
(address_list=
(failover=on)
(address=(protocol=tcp)(port=1522)(host=EMSEC1.us.oracle.com))
(address=(protocol=tcp)(port=1522)(host=EMSEC2.us.oracle.com)))
(connect_data=(service_name=EMrep.us.oracle.com)))

9.3.2.2 Oracle Management Service

Preinstall the Management Service and Management Agent on the hardware that will be used for Disaster Recovery. This eliminates the step of restoring a copy of the Enterprise Manager binaries from backup and modifying the Management Service and Management Agent configuration files.

Note that it is not recommended to restore the Management Service and Management Agent binaries from an existing backup to a new host in the event of a disaster as there are host name dependencies. Always do a fresh install.

9.3.2.3 Management Agent

In the event of a true disaster recovery, it is easier to reinstall the Management Agent and allow it to do a clean discovery of all targets running on the new host.

9.4 Configuring Enterprise Manager for High Availability

Oracle customers deploy systems that are considered critical to their business. These systems often have strict availability requirements and maintenance windows. Downtime is often measured in minutes and maintenance windows are short. Oracle has addressed this business need with the rollout of the 'Unbreakable' database and blueprints for highly available systems such as the Maximum Availability Architecture. With the release of Oracle Enterprise Manager 10g Grid Control, Oracle has increased the manageability of highly available systems. This also increases the availability requirements for the manageability infrastructure.

This section describes a highly available deployment of Grid Control. The section should help you understand the steps required to configure each component for high availability. It also discusses the strengths and limitations of the current solution and you will have an understanding of how to recover from outages of each tier.

9.4.1 Architectural Overview

The architecture for a highly available Grid Control deployment is based on two key concepts; redundancy and component monitoring. Each component of Grid Control can be configured to apply both these concepts.

The components of Grid Control discussed in this section include:

  • Management Agent

  • Management Service

  • Management Repository

For more detail about each of these components, see Oracle Enterprise Manager Grid Control Architecture Overview.

The Management Agent uploads collected monitoring data to a Management Service. The Management Service in turn loads the data into the Management Repository. The Management Repository represents the persistent historic view of collected information that is presented to clients using a web user interface.

Changes in a target state either in an availability state change or detection of a notification dependent upon a metric threshold being crossed results in a notification being sent. The Management Agent detects this change and is responsible for forwarding the information to the Management Service that in turn, records the state change in the Management Repository. Any registered users requesting notification have messages posted using registered notification methods by the Management Service and the console display updated.

Details on using Enterprise Manager to configure high availability features such as RMAN and Data Guard can be found in the Oracle documentation and in the Enterprise Manager Grid Control online help.

9.4.2 Installation and Configuration for High Availability

The following sections document best practices for installation and configuration of each Grid Control component.

9.4.2.1 Management Agent

Enterprise Manager uses a software process called the Oracle Management Agent to monitor a target. The Management Agent is a system daemon that consists of two processes, a process that provides monitoring, alerting and job system capabilities, as well as a watchdog process that is responsible for insuring the Management Agent is up and available.

The data that is collected by the Management Agent is stored temporarily on the monitored host in files. Once the Management Agent deems it necessary to upload the information to the Grid Control system, it contacts the Management Service to establish a connection and uploads the data.

The Management Service accepts the data from the Management Agent, stores the information as files local to the Management Service and acknowledges receipt of the information to the Management Agent. Depending on the volume of work the Management Service is performing, a period of time may elapse before the Management Service loads the data into the Management Repository.

Notifications of alerts, warnings and target state changes do not follow this delayed model. When the Management Agent uploads the information, the Management Service commits the data immediately to the Management Repository before acknowledgement is returned to the Management Agent.

The Management Agent and its watchdog are started through the command '$ORACLE_HOME/bin/emctl start agent.'

9.4.2.1.1 Configuring the Management Agent to Automatically Start on Boot and Restart on Failure

The Management Agent is started manually. It is important that the Management Agent be automatically started when the host is booted to insure monitoring of critical resources on the administered host. To that end, use any and all operating system mechanisms to automatically start the Management Agent. For example, on UNIX systems this is done by placing an entry in the UNIX /etc/init.d that calls the Management Agent on boot or by setting the Windows service to start automatically.

9.4.2.1.2 Configuring Restart for the Management Agent

Once the Management Agent is started, the watchdog process monitors the Management Agent and attempts to restart it in the event of a failure. The behavior of the watchdog is controlled by environment variables set before the Management Agent process starts. The variables that control this behavior follow. All testing discussed here was done with the default settings.

  • EM_MAX_RETRIES – This is the maximum number of times the watchdog will attempt to restart the Management Agent within the EM_RETRY_WINDOW. The default is to attempt restart of the Management Agent 3 times.

  • EM_RETRY_WINDOW - This is the time interval in seconds that is used together with the EM_MAX_RETRIES environmental variable to determine whether the Management Agent is to be restarted. The default is 600 seconds.

The watchdog will not restart the Management Agent if the watchdog detects that the Management Agent has required restart more than EM_MAX_RETRIES within the EM_RETRY_WINDOW time period.

9.4.2.1.3 Configuring the Connection Between Management Agents and the Management Service

Management Agents do not maintain a persistent connection to the Management Service. When a Management Agent needs to upload collected monitoring data or an urgent target state change, the Management Agent establishes a connection to the Management Service. If the connection is not possible, such as in the case of a network failure or a host failure, the Management Agent retains the data and re-attempts to send the information later.

Server Load Balancers (SLBs) such as the F5 Networks Big-IP provide logical service abstractions for network clients. Clients establish connections to the virtual service exposed by the SLB. The SLB routes the request to any one of a number of available servers that provide the requested service. The service chosen by an SLB as the destination is dependent upon the virtual service definition. One such criterion is whether a service is capable of accepting connections.

The Grid Control Management Service is a network service that can be fronted by a SLB to address the need for resiliency.

To accomplish the goal of having a highly available Management Service that the Management Agents can use for data upload, configure a virtual pool that consists of the hosts and the services that the hosts provide. In the case of the Management Services pool, the hostname and Management Agent upload port would be specified. To insure a highly available Management Service, you should have two or more Management Services defined within the virtual pool.

9.4.2.1.4 Installing the Management Agent Software on Redundant Storage

The Management Agent persists its intermediate state and collected information using local files in the $AGENT_HOME/$HOSTNAME/sysman/emd sub tree under the Management Agent home directory.

In the event that these files are lost or corrupted before being uploaded to the Management Repository, a loss of monitoring data and any pending alerts not yet uploaded to the Management Repository occurs.

At a minimum, configure these sub-directories on striped redundant or mirrored storage. Availability would be further enhanced by placing the entire $AGENT_HOME on redundant storage The Management Agent home directory is shown by entering the command 'emctl getemhome' on the command line, or from the Management Services and Repository tab and Agents tab in the Grid Control Console.

9.4.2.1.5 Configuring All Out-of-band Notifications

The Enterprise Manager Grid Control deployment is configured out of the box such that connection failures between the Management Service and the Management Agent are detected. This is through a process of heartbeats that the Management Agent performs against the Management Service. If the Management Service determines it has not heard back from the Management Agent, it pings it.

This condition does not, however, correct for a condition where the Management Agents are up and available but there are no Management Services to which to upload or process notifications. For this situation, the Management Agent has the capability of sending an emergency notification when it is still up but has lost contact with the Management Service.

This provides another mechanism to alert the administrator of a Management Service failure. For more information, see Section 9.4.3, "Configuration Within Grid Control".

In the emd.properties file located in the $AGENT_HOME/sysman/config directory, modify the property values for emd_email_address and emd_email_gateway to reflect a valid e-mail address in your system. The parameter emd_from_email_address should also be modified to reflect the name of the system sending the alert for faster root cause identification.

In addition, any custom notification script can be executed by the Management Agent in the event of a failure to communicate with the Management Service. This script can be set to execute by modifying the 'emdFailureScript' entry in the Management Agent emd.properties file.

9.4.2.2 Management Service

The Management Service element of the Enterprise Manager Grid Control product acts both as the receiver of information from Management Agents as well as serves out the User Interface in the form of HTML pages. It does this by maintaining a connection to the configurations database Management Repository and responding to requests over HTTP.

9.4.2.2.1 Configuring the Shared Filesystem Loader

Configure the Management Services to use the Shared Filesystem Loader. In the Shared Filesystem Loader, management data files received from Management Agents are stored temporarily on a common shared location called the shared receive directory. All Management Services are configured to use the same storage location for the shared receive directory. The Management Services coordinate internally and distribute amongst themselves the workload of uploading files into the Management Repository. Should a Management Service go down for some reason, its workload is taken up by surviving Management Services.

9.4.2.2.2 Configuring SLB to Abstract the Underlying Management Service Host Names for Easier Reconnect After Failure

A hardware server load balancer (SLB) such as F5 Networks Big-IP can be used as the front end to abstract the number and location of Management Services and appear as a single service. Under that abstraction, the SLB parcels the work to any number of Management Service processes that it has in its 'virtual pool.' For any Grid Control installation with an availability requirement there should be a minimum of two Management Service processes installed. Coupled with a SLB, this provides a method for constant communication to the Grid Control Console in the event of the failure of a Management Service.

For more details on configuring SLB for Shared Filesystem Loader, see Section 3.6.1, "Load Balancing Connections Between the Management Agent and the Management Service".

9.4.2.2.3 Management Service Installation Should Be Done to Non-Clustered Servers

Management Service processes cannot be installed on any machines running under a cluster, whether it is CRS or vendor cluster software. Install Management Services to single nodes and use the method described previously for failover and availability.

9.4.2.2.4 Configuring Management Service to Use Client Side Oracle Net Load Balancing for Failover and Load Balancing

When you use a RAC cluster, a standby system, or both to provide high availability for the Management Repository, the Management Service can be configured to use an Oracle Net connect string that will take advantage of redundancy in the Management Repository. Correctly configured, the Management service process will continue to process data from Management Agents even during a database node outage.

In the $OMS_HOME/sysman/config directory, modify the emdRepConnectDescriptor entry in the emoms.properities file to point to the appropriate Management Repository instances. The following example shows a connect string required to support a 2-node RAC configuration. Note the backslash (\) before each equal sign (=).

oracle.sysman.eml.mntr.emdRepConnectDescriptor= (DESCRIPTION\=(ADDRESS_LIST\=(FAILOVER\=ON) (ADDRESS\=(PROTOCOL\=TCP)(HOST\=haem1.us.oracle.com) (PORT\=1521))(ADDRESS\=(PROTOCOL\=TCP) (HOST\=haem2.us.oracle.com)(PORT\=1521))) (CONNECT_DATA\=(SERVICE_NAME\=em10)))

9.4.2.2.5 Install the Management Service Software on Redundant Storage

The Management Service contains results of the intermediate collected data before it is loaded into the Management Repository. The loader receive directory contains these files and is typically empty when the Management Service is able to load data as quickly as it is received. Once the files are received by the Management Service, the Management Agent considers them committed and therefore removes its local copy. In the event that these files are lost before being uploaded to the Management Repository, data loss will occur. At a minimum, configure these sub-directories on striped redundant or mirrored storage. When Management Services are configured for the Shared Filesystem Loader, all services share the same loader receive directory. It is recommended that the shared loader receive directory be on a clustered file system like NetApps Filer.

Similar to the Management Agent directories, availability would be further enhanced by placing the entire Management Service software tree on redundant storage. This can also be determined at the command line using the 'emctl getemhome' or by using the Management Services and Repository tab in the Grid Control Console.

9.4.2.3 Management Repository

The Management Repository is the central location for all historical data managed by Grid Control. Redundancy at this tier is provided by standard database features and best practices.

9.4.2.3.1 Install Into an Existing RAC Management Repository

The Grid Control installation process does not directly support installation into a RAC Management Repository. The recommended installation method is to install the database software first and create a RAC database. When this is complete, install the Enterprise Manager software, selecting the Enterprise Manager Grid Control Using an Existing Database installation option.

The installation does not transparently support the installation of the Enterprise Manager 10g Grid Control into a RAC database. Specify the SID of one of the cluster instances when prompted for during the installation. After the installation of the Enterprise Manager 10g Grid Control Management Service, you should modify the connection string the Management Service uses to take advantage of client failover in the event of a RAC host outage (refer to Section 9.4.2.2.4, "Configuring Management Service to Use Client Side Oracle Net Load Balancing for Failover and Load Balancing").

The installation process also does not allow modification of the size of the required Enterprise Manager tablespaces (although it does allow for specification of the name and location of data files that are to be used by the Enterprise Manager 10g Grid Control schema). The default sizes for the initial data file extents depend on using the AUTOEXTEND feature and as such are insufficient for a production installation. This is particularly problematic where storage for the RAC is on a raw device.

If the RAC database being used for the Management Repository is configured with raw devices there are two options for increasing the size of the Management Repository. You can create multiple raw partitions, with the first one equal to the default size of the tablespace as defined by the installation process. Alternatively, you can create the tablespace using the default size, create a dummy object that will increase the size of the tablespace to the end of the raw partition, then drop that object. Regardless, if raw devices are used, disable the default space management for these objects, which is to auto-extend.

9.4.2.3.2 Consider (Physical) Data Guard for Redundancy

Clients who require greater uptime or an off-site copy of the Management Repository can use Oracle Data Guard in conjunction with Grid Control. This alternative can be used regardless of whether or not you are using a RAC database. Currently, only the use of physical data guard is supported.

A Data Guard instance must be created manually using the steps documented in the Data Guard documentation.

9.4.3 Configuration Within Grid Control

Grid Control comes preconfigured with a series of default rules to monitor many common targets. These rules can be extended to monitor the Grid Control infrastructure as well as the other targets on your network to meet specific monitoring needs.

9.4.3.1 Console Warnings, Alerts, and Notifications

The following list is a set of recommendations that extend the default monitoring performed by Enterprise Manager. Use the Notification Rules link on the Preferences page to adjust the default rules provided on the Configuration/Rules page:

  • Ensure the Agent Unreachable rule is set to alert on all agent unreachable and agent clear errors.

  • Ensure the Repository Operations Availability rule is set to notify on any unreachable problems with the Management Service or Management Repository nodes. Also modify this rule to alert on the Targets Not Providing Data condition and any database alerts that are detected against the database serving as the Management Repository.

Modify the Agent Upload Problems Rule to alert when the Management Service status has hit a warning or clear threshold.

9.4.3.2 Configure Additional Error Reporting Mechanisms

Enterprise Manager provides error reporting mechanisms through e-mail notifications, PL/SQL packages, and SNMP alerts. Configure these mechanisms based on the infrastructure of the production site. If using e-mail for notifications, configure the notification rule through the Grid Control Console to notify administrators using multiple SMTP servers if they are available. This can be done by modifying the default e-mail server setting on the Notification Methods option under Setup.

9.4.3.3 Component Backup

Backup procedures for the database are well established standards. Configure backup for the Management Repository using the RMAN interface provided in the Grid Control Console. Refer to the RMAN documentation or the Maximum Availability architecture document for detailed implementation instructions.

In addition to the Management Repository, the Management Service and Management Agent should also have regular backups. Backups should be performed after any configuration change. Best practices for backing up these tiers are documented in the section, Section 9.3, "Oracle Enterprise Manager Backup, Recovery, and Disaster Recovery Considerations".

9.4.3.4 Troubleshooting

In the event of a problem with Grid Control, the starting point for any diagnostic effort is the console itself. The Management System tab provides access to an overview of all Management Service operations and current alerts. Other pages summarize the health of Management Service processes and logged errors These pages are useful for determining the causes of any performance problems as the summary page shows at a historical view of the amount of files waiting to be loaded to the Management Repository and the amount of work waiting to be completed by Management Agents.

9.4.3.4.1 Upload Delay for Monitoring Data

When assessing the health and availability of targets through the Grid Control Console, information is slow to appear in the UI, especially after a Management Service outage. The state of a target in the Grid Control Console may be delayed after a state change on the monitored host. Use the Management System page to gauge backlog for pending files to be processed.

9.4.3.4.2 Notification Delay of Target State Change

The model used by the Management Agent to assess the state of health for any particular monitored target is poll based. Management Agents immediately post a notification to the Management Service as soon as a change in state is detected. This infers that there is some potential delay for the Management Agent to actually detect a change in state.