Aggregating Data, 11 of 12

Performing Partial Aggregations

Maintenance of an analytic workspace must usually be done within a restrictive batch window. For this reason, many DBAs perform partial aggregations rather than full aggregations each time they refresh the data. When all of the data is pre-aggregated, this does not present a problem. However, when partial aggregations are performed on data that uses both pre-aggregation and runtime aggregation, then steps must be taken to ensure that the results are correct. Errors in the data occur when the status list generated by the PRECOMPUTE keyword is outdated.

The PRECOMPUTE clause produces a status list that:

Tells the AGGREGATE command which data should be pre-calculated, and
Tells the AGGREGATE function what the AGGREGATE command has done

If you never use the AGGREGATE command with the AGGREGATE function, you do not need the information.

Aggregation Changes That Cause Problems

You should read this information to address the following circumstances:

Incremental data loading: You have already built your analytic workspace and are now loading new data on a regular basis. You make a change to at least one PRECOMPUTE clause in a RELATION command in an aggregation map.
Using data-dependent PRECOMPUTE clauses: When you use the PRECOMPUTE keyword in an aggregation map, that PRECOMPUTE clause can be data-dependent instead of simply identifying dimension values or levels.
Changing a hierarchy: If you make a change to a dimension's hierarchy after you have already aggregated data, then you will need to aggregate all of the data again. There is a procedure you can use, in some cases, to reduce the time it takes to re-aggregate the data in your analytic workspace.

Incremental Data Loading

Incremental data loading refers to the process of loading new input data into an existing analytic workspace and then aggregating that data. This usually happens on a regular basis, whether it is on a monthly, weekly, or even daily basis.

For example, suppose you design a new analytic workspace. It contains two variables: sales and units. Suppose that when you build the analytic workspace for the first time, you have input data for one year for both variables. Because sales and units contain exactly the same dimensions in exactly the same order in their definitions, you define one aggregation map that will be shared by both sales and units. You load that input data into the analytic workspace, then use the AGGREGATE command to roll up that input data.

You know that you will be getting new input data for sales and units on the first day of every month. For example, suppose it is March 1. On this day, you expect to receive the sales data and units data for the previous month of February. Your responsibility is to load the February data into the existing analytic workspace and aggregate that input data. This is an incremental data load. The next incremental data load will take place on April 1, and so on.

Typically, when you aggregate this new data, you will use a LIMIT command to ensure that only the new input data will be aggregated. For example, to aggregate only the new input data that you have loaded for February, you might use the following commands:

LIMIT month TO 'FEB99'
AGGREGATE sales units USING salesunits.aggmap

This is acceptable as long as you do not change any of the PRECOMPUTE clauses in the aggregation map. If you do, then you must pre-aggregate all of the data.

Problem: PRECOMPUTE Status List Is Inaccurate

If you change a PRECOMPUTE clause, then the status list will change. This means that although the data that is produced by the AGGREGATE command after you change the PRECOMPUTE clause will be correct, Oracle OLAP may not be able to return the data that is requested by a user using the AGGREGATE function. The status list might indicate that a value has already been calculated when in fact it has not.

Solution: Regenerate the PRECOMPUTE Status List

If you make any changes to any PRECOMPUTE clause in one or more RELATION commands in an aggregation map, then you must pre-aggregate all of the data. Otherwise, the AGGREGATE function will use a PRECOMPUTE status list that is out of synchronization with the data, and thus may not generate all of the required values.

Use the following procedure to be sure the data will be aggregated correctly:

Make sure that you have finished making any changes that you want to make to the PRECOMPUTE clauses in your aggregation map.
Load the incremental input data.
Set the current status of all dimensions to ALL. (You can use one ALLSTAT command, or a LIMIT TO ALL command for every dimension in the aggregation map.)
Execute the AGGREGATE command.
Recompile the aggregation map. (Alternatively, you can use the FUNCDATA keyword when you execute the AGGREGATE command in Step 4.)

Using a Data-Dependent PRECOMPUTE Clause

The clause that follows the PRECOMPUTE keyword is like a LIMIT command. You have the flexibility to specify the limit expression using the values of the data. For example, you can specify the five areas with the lowest sales figures in a time period. The RELATION command might look like this:

RELATION geography.parentrel PRECOMPUTE (BOTTOM 5 BASEDON sales)

Problem: Values of the Limit Clause Vary With Each Data Update

Data-dependent limit expressions can vary in their results. In other words, the "bottom five" areas in the analytic workspace that you build in February will not necessarily be the same "bottom five" areas after performing an incremental data load in March. Furthermore, the "bottom five" areas in your March will not necessarily be the same "bottom five" areas after the April incremental data load.

In this situation, the PRECOMPUTE status list is out of synchronization, and the AGGREGATE function may not calculate a needed value because the status list indicates that it was precomputed.

Solution: Maintain a Valueset

Instead of using a data-dependent PRECOMPUTE clause, you can either:

List specific values, or
Create a valueset that stores specific values

As you load and aggregate incremental data over the course of time, the status list that is generated by the PRECOMPUTE keyword remains constant when you use one of these methods. However, the five stores in the limit expression or valueset remains the same, regardless of whether or not they still represent the stores with the lowest sales figures.

To keep the limit phrase current, take the following steps:

Recompute the limit expression each time you load new data.
Change the valueset when the results of your computation are different.
Perform a full aggregation of the affected variables.
Recompile the aggregation map that is used by the AGGREGATE function.

Refer to "Incremental Data Loading" for the general guidelines you should follow.

If you have changed the input data or your hierarchies, then replace any data that has been aggregated with NA values. These are the steps that you might take: Limit the dimensions to the input data, create a new variable, copy the data from the original variable to the new variable, delete the original variable, and rename the new variable to the name of the original variable.

Example 12-15 Listing the Dimension Values

Instead of using data-dependent PRECOMPUTE clauses, use specific dimension values in the PRECOMPUTE clause. After loading the data, issue a data-dependent LIMIT command to identify the dimension values. Then list those values in the PRECOMPUTE clause. For example,

LIMIT time TO '2001'
LIMIT channel TO 'TOTALCHANNEL'
LIMIT product TO 'TOTALPROD'
LIMIT geography TO BOTTOM 5 BASED ON sales

STATUS geography
The current status of GEOGRAPHY is:
BOGOTA, BORDEAUX, EDINBURGH, KYOTO, BRUSSELS

You would then change the PRECOMPUTE clause to list these areas:

RELATION geography.parentrel PRECOMPUTE ('BOGOTA' 'BORDEAUX' 'EDINBURGH' -


'KYOTO' 'BRUSSELS')

If you want to use data-dependent PRECOMPUTE clauses, create and use a valueset with the PRECOMPUTE clause.

Example 12-16 Using a Valueset

A valueset can be used to store a list of values. For example, the following commands create a valueset for the geography dimension. After performing an incremental update, you would need to update the valueset, but you would not need to edit the aggregation map.

The following commands create a valueset for geography:

DEFINE lowsales.geog VALUESET geography
LIMIT time TO '2001'
LIMIT channel TO 'TOTALCHANNEL'
LIMIT product TO 'TOTALPROD'
LIMIT lowsales.geog TO BOTTOM 5 BASED ON sales

The VALUES function returns the status list of the valueset:

SHOW VALUES(lowsales.geog)

BOGOTA
BORDEAUX
EDINBURGH
BRUSSELS
KYOTO

This RELATION command uses the valueset:

RELATION geography.parentrel PRECOMPUTE (lowsales.geog)

Changing a Hierarchy

Once you have defined a hierarchy and you have aggregated data, if you move one or more dimension values to a different parent in the hierarchy, then you have changed the hierarchy.

For example, suppose your geography hierarchy has input data for stores. The store data rolls up into cities. The cities roll up into regions, and so on.

Text description of the illustration hier1.gif

You define your dimensions and variables. You define the hierarchies for your dimensions. You load data and roll it up. Several months later, after you have loaded and rolled up incremental data, one of the stores changes location. For example, STORE22 closes its location in Hull, Massachusetts and then reopens at a new location in Waltham, Massachusetts. Therefore, STORE22 now is part of the WEST BOSTON region instead of SOUTH BOSTON region.

Text description of the illustration hier2.gif

Therefore, you must move the STORE22 dimension value so that its data will roll up to different dimension values in the higher levels of the hierarchy. For example, you must move STORE22 from the HULL path to the WALTHAM path.

When you move one or more dimension values so that their data rolls up in a different path in the hierarchy, you have changed the hierarchy.

Problem: Previously Aggregated Data is Incorrect

Suppose that you receive the most recent month's worth of data for STORE22. You load that data and aggregate it.

Today you find out that last month the store moved to a new city, as well as a new region. This means that you have already aggregated the STORE22 data into HULL, when the STORE22 data now should be aggregated into WALTHAM.

The problem is that you not only need to change the hierarchy, but you need to correct the data so that the STORE22 data aggregates into WALTHAM instead of HULL.

Solution: Re-Aggregate Changed Branches

When you change a hierarchy, you can re-aggregate the data in the analytic workspace (after you have changed the hierarchy) in one of two ways:

Perform a full aggregation. This is the best alternative if you make significant changes to a hierarchy.
Perform a partial aggregation for the dimension value that has moved, as well as a previous sibling of that dimension value. This method is acceptable for very small changes to the hierarchy.

The advantage of a partial aggregation is that it takes a shorter period of time to complete than a full aggregation. However, the advantage of performing a full rollup is that you know the results will be correct.

Therefore, if you move one or two dimension values in your hierarchy, and you have a small window of time to roll up the analytic workspace, you can perform a partial aggregation; otherwise, perform a full aggregation.

How to Aggregate Branches of a Hierarchy

Follow these steps to aggregate the data for the former parents and the current parents of the dimension value that moved in the hierarchy.

Identify the dimension value (or group of dimension values) that has moved in the hierarchy. For example, STORE22 is the dimension value whose data now aggregates to WALTHAM instead of HULL.
Identify a previous sibling of the dimension value that has moved. (If more than one dimension has moved, you must identify a sibling for each one.) For example, STORE22 was previously grouped with STORE23 and STORE24; either one qualifies as a previous sibling of STORE22.
Limit the current status of the dimension to the dimension value that has moved and its previous sibling. For example, use the following command to limit the geography dimension to STORE22 and STORE23.
```
LIMIT time TO 'STORE22' 'STORE23'
```
Aggregate the variable's data. For example, use the following command to aggregate the sales variable.
```
AGGREGATE sales USING sales.agg
```

By identifying the dimension value that has moved, you can recalculate its new ancestors (such as WALTHAM). By identifying a previous sibling of the dimension value that has moved, you can recalculated its previous ancestors (such as HULL).