72 DBMS_PREDICTIVE_ANALYTICS

Data mining can discover useful information buried in vast amounts of data. However, it is often the case that both the programming interfaces and the data mining expertise required to obtain these results are too complex for use by the wide audiences that can obtain benefits from using Oracle Data Mining (ODM).

The DBMS_PREDICTIVE_ANALYTICS package addresses both of these complexities by automating the entire data mining process from data preprocessing through model building to scoring new data. This package provides an important tool that makes data mining possible for a broad audience of users, in particular, business analysts.

Data used by ODM consists of tables or views stored in an Oracle database. Each column in a record (row) holds an item of information. Data mining models are often used to identify important columns or to predict column values.

The DBMS_PREDICTIVE_ANALYTICS package supports the following functionality:

EXPLAIN - Ranks attributes in order of influence in explaining a target column.
PREDICT - Predict the value of a column.

This chapter contains the following topics:

Using DBMS_PREDICTIVE_ANALYTICS
- Overview
Summary of DBMS_PREDICTIVE_ANALYTICS Subprograms

Using DBMS_PREDICTIVE_ANALYTICS

This section contains topics which relate to using the DBMS_PREDICTIVE_ANALYTICS package.

Overview

Overview

Data mining, according to a commonly used process model, requires the following steps:

Understand the business problem.
Understand the data.
Prepare the data for mining.
Create models using the prepared data.
Evaluate the models.
Deploy and use the model to score new data.

DBMS_PREDICTIVE_ANALYTICS automates parts of step 3 and steps 4 and 5 of this process.

The user provides input data in a single table or view. For EXPLAIN, the user identifies a column to explain; for PREDICT, the user identifies a column to predict and also identifies a case id column. The procedure accepts the input, analyzes the data, performs suitable preprocessing, builds and tests models, selects the best model, and applies the model to data.

Input for DBMS_PREDICTIVE_ANALYTICS is the name of a single table or view in an Oracle database. Each column in the table must have one of the following data types:

NUMBER
FLOAT
CHAR
VARCHAR2
DATE
TIMESTAMP

Summary of DBMS_PREDICTIVE_ANALYTICS Subprograms

Table 72-1 DBMS_PREDICTIVE_ANALYTICS Package Subprograms

Subprogram	Purpose
EXPLAIN Procedure	Ranks attributes in order of influence in explaining a target column
PREDICT Procedure	Predicts the value of a column based on values in the input data

EXPLAIN Procedure

The procedure ranks attributes in order of influence in explaining a target column.

The procedure analyzes the input table, performs data preprocessing, builds a model, analyzes the model to identify key columns, and creates a result table listing the important columns and quantifying the explanatory power of each important column.

Syntax

DBMS_PREDICTIVE_ANALYTICS.EXPLAIN (
     data_table_name     IN VARCHAR2,
     explain_column_name IN VARCHAR2,
     result_table_name   IN VARCHAR2,
     data_schema_name    IN VARCHAR2 DEFAULT NULL);

Parameters

Table 72-2 EXPLAIN Procedure Parameters

Parameter	Description
`data_table_name`	Name of input table or view
`explain_column_name`	Name of column to be explained
`result_table_name`	Name of table where results are saved
`data_schema_name`	Name of schema where the input table or view resides. Default: the current schema.

Usage Notes

The result table has the following definition:

column_name               VARCHAR2(30)
explanatory_value         NUMBER
rank                      NUMBER

Table 72-3 describes the columns in the result table.

Table 72-3 EXPLAIN Procedure Result Table

Column Name	Meaning
`column_name`	Name of a column in the input data; all columns except the explained column are listed in the result table.
`explanatory_value`	Value indicating how useful the column is for determining the value of the explained column. Higher values indicate greater explanatory power. Value can range from 0 to 1. An individual column's explanatory value is independent of other columns in the input table. Instead, the values are based on how strong each individual column correlates with the explained column. The value is affected by the number of records in the input table, and the relations of the values of the column to the values of the explain column. An explanatory power value of 0 implies there is no useful correlation between the column's values and the explain column's values. An explanatory power of 1 implies perfect correlation; such columns should be eliminated from consideration for `PREDICT`. In practice, an explanatory power equal to 1 is rarely returned.
`rank`	Ranking of explanatory power. Rows with equal values for `explanatory_power` have the same rank. Rank values are not skipped in the event of ties.

Example

The following example performs an EXPLAIN operation and views the results:

--Perform EXPLAIN operation 
BEGIN 
    DBMS_PREDICTIVE_ANALYTICS.EXPLAIN( 
        data_table_name      => 'census_dataset', 
        explain_column_name  => 'class', 
        result_table_name    => 'census_explain_result'); 
END; 
/ 
--View results 
SELECT * FROM census_explain_result; 

COLUMN_NAME     EXPLANATORY_VALUE  RANK 
-----------     -----------------  ---- 
RELATIONSHIP    .21234788             1 
MARITAL_STATUS  .195201808            2 
CAPITAL_GAIN    .102951498            3 
OCCUPATION      .06883765             4 
EDUCATION       .067517394            5 
EDUCATION_NUM   .067517394            5 
SEX             .055541542            6 
HOURS_PER_WEEK  .032476973            7 
AGE             .021933245            8 
CAPITAL_LOSS    .013083265            9 
RACE            .009670242           10 
WORKCLASS       0                    11 
NATIVE_COUNTRY  0                    11 
WEIGHT          0                    11 
PERSON_ID       0                    11 
 
15 rows selected.

PREDICT Procedure

This procedure is used to predict values of a specific column. The input consists of a table and a target column, the target column containing the values to predict. The input data must contain some cases in which the target value is known (that is, is not NULL). Cases where the target values are known are used to train models.

PREDICT returns a predicted value for every case, including those where the value is known.

Syntax

DBMS_PREDICTIVE_ANALYTICS.PREDICT (
    accuracy                  OUT NUMBER,
    data_table_name           IN VARCHAR2,
    case_id_column_name       IN VARCHAR2,
    target_column_name        IN VARCHAR2,
    result_table_name         IN VARCHAR2,
    data_schema_name          IN VARCHAR2 DEFAULT NULL);

Parameters

Table 72-4 PREDICT Procedure Parameters

Parameter	Description
`data_table_name`	Name of input table or view
`case_id_column_name`	Name of column that uniquely identifies each case in the input data (for example, the column containing the customer id or case id)
`target_column_name`	Name of the column containing the value to predict (the target)
`result_table_name`	Name of table where results will be saved
`data_schema_name`	Name of schema where the input table or view resides and where the result table is written. Default: the current schema.

Usage Notes

The result table has the following definition:

case_id_column_name      VARCHAR2 or NUMBER
prediction               VARCHAR2 or NUMBER
probability_number       NUMBER

Table 72-5 describes the result table of the PREDICT procedure.

Table 72-5 PREDICT Procedure Result Table

Column Name	Meaning
`case_id_column_name`	Each of the cases identified in the case_id column. This is the same as the name that was passed in. The data type is the same as input `case_id` type.
`prediction`	The predicted value of the target column for the given case. The data type is the same as the input `target_column_name` type.
`probability`	For classification (categorical target), the probability of the prediction. For regression problems (numerical target), this column contains `NULL`.

Predictions are returned for all cases in the input data.

Predicted values for known cases may be interesting in some situations, for example, to perform deviation analysis, that is, to compare predicted values and actual values.

Example

The following example performs a PREDICT operation and display the first 10 predictions. In this example, since the target column class is categorical, a probability is returned for each prediction:

--Perform PREDICT operation 
DECLARE 
    v_accuracy NUMBER(10,9); 
BEGIN 
    DBMS_PREDICTIVE_ANALYTICS.PREDICT( 
        accuracy             => v_accuracy, 
        data_table_name      => 'census_dataset', 
        case_id_column_name  => 'person_id', 
        target_column_name   => 'class', 
        result_table_name    => 'census_predict_result'); 
    DBMS_OUTPUT.PUT_LINE('Accuracy = ' || v_accuracy); 
END; 
/ 
--View first 10 predictions
SELECT * FROM census_predict_result where rownum < 10; 

PERSON_ID    PREDICTION      PROBABILITY 
----------   ----------      ----------- 
  2                   1      .418787003 
  7                   0      .922977991 
  8                   0      .99869723 
  9                   0      .999999605 
 10                   0      .9999009 
 11                   0      .999999996 
 12                   1      .953949094 
 15                   0      .99999997 
 16                   0      .999968961 
 
9 rows selected.