4 Using the PL/SQL API and SQL Scoring Functions

This chapter provides information to help you build data mining applications in PL/SQL. It includes sample code for building, testing, and scoring a classification model, and it illustrates the use of SQL functions for model scoring.

4.1 The PL/SQL Sample Applications

The examples included in this chapter are taken from the Data Mining sample applications available on the Database companion CD. When you install the companion CD, the Data Mining sample applications are copied to /rdbms/demo/ in the Oracle home directory.

The following directory listing command lists the sample data mining programs on a Linux system. Use an equivalent command to list the sample programs on other operating systems.

>ls $ORACLE_HOME/rdbms/demo/dm*

Table 4-1 lists the sample PL/SQL programs.

Table 4-1 Sample PL/SQL Programs

Application	Description
`dmabdemo.sql`	Creates an Adaptive Bayes Network model (classification).
`dmaidemo.sql`	Creates an Attribute Importance model.
`dmardemo.sql`	Creates an Association Rules model.
`dmdtdemo.sql`	Creates a Decision Tree model (classification).
`dmkmdemo.sql`	Creates a k_means model (clustering).
`dmnbdemo.sql`	Creates a Naive Bayes model (classification).
`dmnmdemo.sql`	Creates a Non_Negative Matrix Factorization model (feature extraction).
`dmocdemo.sql`	Creates an O-Cluster model (clustering).
`dmsvcdem.sql`	Creates a Support Vector Machine model (classification).
`dmsvodem.sql`	Creates a Support Vector Machine model (one-class classification).
`dmsvrdem.sql`	Creates a Support Vector Machine model (regression).
`dmtxtfe.sql`	Text mining. (term extraction using CTX procedures).
`dmtxtnmf.sql`	Text mining using NMF feature extraction.
`dmtxtsvm.sql`	Text mining using SVM classification.

4.2 The DBMS_DATA_MINING Package

The following types of mining activities are supported by the DBMS_DATA_MINING package:

Creating, dropping, and renaming a model: CREATE_MODEL, DROP_MODEL, RENAME_MODEL.
Scoring a model: APPLY.
Ranking APPLY results: RANK_APPLY.
Describing a model: GET_MODEL_DETAILS, GET_MODEL_SETTINGS, GET_DEFAULT_SETTINGS,GET_MODEL_SIGNATURE.
Computing test metrics for a model: COMPUTE_CONFUSION_MATRIX, COMPUTE_LIFT, and COMPUTE_ROC.
Exporting and importing models: EXPORT_MODEL, IMPORT_MODEL.

Of these, the first set represents DDL-like operations. The last set represents utilities. The rest are query-like operations in that they do not modify the model.

Note:

Detailed information about the DBMS_DATA_MINING package is available in Oracle Database PL/SQL Packages and Types Reference.

You can view the models defined in your schema by querying the DM_USER_MODELS view. The following query on a Linux system lists the models in the schema of DMUSER. These models were created by the sample PL/SQL programs.

>sqlplus dmuser/dmuser_password
SQL> set linesize 200
SQL> set pagesize 100 
SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE from DM_USER_MODELS; 
 
NAME                   FUNCTION_NAME           ALGORITHM_NAME                  TARGET_ATTRIBUTE
---------------------  ----------------------  ------------------------------  -----------------
T_NMF_SAMPLE           FEATURE_EXTRACTION      NONNEGATIVE_MATRIX_FACTOR                    
T_SVM_CLAS_SAMPLE      CLASSIFICATION          SUPPORT_VECTOR_MACHINES         AFFINITY_CARD
AR_SH_SAMPLE           ASSOCIATION_RULES       APRIORI_ASSOCIATION_RULES                    
AI_SH_SAMPLE           ATTRIBUTE_IMPORTANCE    MINIMUM_DESCRIPTION_LENGTH      AFFINITY_CARD
ABN_SH_CLAS_SAMPLE     CLASSIFICATION          ADAPTIVE_BAYES_NETWORK          AFFINITY_CARD
DT_SH_CLAS_SAMPLE      CLASSIFICATION          DECISION_TREE                   AFFINITY_CARD
NB_SH_CLAS_SAMPLE      CLASSIFICATION          NAIVE_BAYES                     AFFINITY_CARD
SVMC_SH_CLAS_SAMPLE    CLASSIFICATION          SUPPORT_VECTOR_MACHINES         AFFINITY_CARD
OC_SH_CLUS_SAMPLE      CLUSTERING              O_CLUSTER                                    
KM_SH_CLUS_SAMPLE      CLUSTERING              KMEANS                                       
NMF_SH_SAMPLE          FEATURE_EXTRACTION      NONNEGATIVE_MATRIX_FACTOR                    
SVMR_SH_REGR_SAMPLE    REGRESSION              SUPPORT_VECTOR_MACHINES         AGE

4.2.1 Build Results

The CREATE_MODEL procedure creates a mining model. The viewable contents of a mining model are provided to you through the GET_MODEL_DETAILS functions for each supported algorithm. In addition, GET_MODEL_SIGNATURE and GET_MODEL_SETTINGS provide descriptive information about the model.

4.2.2 Apply Results

The APPLY procedure creates and populates a pre-defined table. The columns of this table vary based on the particular mining function, algorithm, and target attribute type — numerical or categorical.

The RANK_APPLY procedure takes this results table as input and generates another table with results ranked based on a top-N input. Classification models can also be ranked based on cost. The column structure of this table varies based on the particular mining function, algorithm, and the target attribute type — numerical or categorical.

4.2.3 Test Results for Classification Models

The COMPUTE routines provided in DBMS_DATA_MINING are the most popularly used metrics for classification. They are not tied to a particular model — they can compute the metrics from any meaningful data input as long as the column structure of the input tables fits the specification of the apply results table and the targets tables.

4.2.4 Test Results for Regression Models

The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use the SQL queries, shown in the following sections, to compute these metrics. Simply replace the italicized tokens with table and column names appropriate for your application.

4.2.4.1 Root Mean Square Error

SELECT sqrt(avg((A.prediction - B.target_column_name) * 
                (A.prediction - B.target_column_name))) rmse 
  FROM apply_results_table A, targets_table B 
 WHERE A.case_id_column_name = B.case_id_column_name;

4.2.4.2 Mean Absolute Error

Given the targets_table generated from the test data with the following columns,

(case_id_column_name    VARCHAR2,
target_column_name      NUMBER)

and apply results table for regression with the following columns,

(case_id_column_name    VARCHAR2, 
prediction              NUMBER)

and a normalization table (optional) with the following columns,

(attribute_name        VARCHAR2(30),
scale                  NUMBER,
shift                  NUMBER)

the query for mean absolute error is:

SELECT  /*+PARALLEL(T) PARALLEL(A)*/
    AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error
  FROM (SELECT B.case_id_column_name
              (B.target_column_name * N.scale + N.shift) actual_value
          FROM targets_table B,
               normalization_table N
        WHERE N.attribute_name = B.target_column_name AND
              B.target_column_name = 1) T,
       apply_results_table_name A
 WHERE A.case_id_column_name = T.case_id_column_name;

You can fill in the italicized values with the actual column and table names chosen by you. If the data has not undergone normalization transformation, you can eliminate those references from the subquery. See the SVM regression sample program (dmsvrdem.sql for an example.

4.3 Example: Building a Decision Tree Model

Given demographic data about a set of customers, this example predicts the customer response to an affinity card program using a classifier based on the Decision Tree algorithm.

Note:

This example is taken from the sample program dmdtdemo.sql. See Oracle Data Mining Administrator's Guide for information about the sample programs.

4.3.1 Mining Data

The Decision Tree algorithm is capable of handling data that has not been specially prepared. This example uses data created from the base tables in the SH schema and presented through the following views.

MINING_DATA_BUILD_V       (build data)
MINING_DATA_TEST_V        (test data)
MINING_DATA_APPLY_V       (scoring data)

Note:

Data preparation techniques (using the DBMS_DATA_MINING_TRANSFORM package) are illustrated in many of the sample programs.

4.3.2 Build Settings

The following example creates a settings table and a cost matrix table for the model. The settings override the default classification algorithm (Naive Bayes) and specify the location of the cost matrix table.

set echo off
CREATE TABLE dt_sh_sample_settings (
  setting_name  VARCHAR2(30),
  setting_value VARCHAR2(30));
set echo on

-- CREATE AND POPULATE A COST MATRIX TABLE
--
-- A cost matrix is used to influence the weighting of misclassification
-- during model creation (and scoring).
--
CREATE TABLE dt_sh_sample_cost (
  actual_target_value           NUMBER,
  predicted_target_value        NUMBER,
  cost                          NUMBER);
INSERT INTO dt_sh_sample_cost VALUES (0,0,0);
INSERT INTO dt_sh_sample_cost VALUES (0,1,1);
INSERT INTO dt_sh_sample_cost VALUES (1,0,8);
INSERT INTO dt_sh_sample_cost VALUES (1,1,0);
COMMIT;

BEGIN       
  -- Populate settings table
  INSERT INTO dt_sh_sample_settings VALUES
    (dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree);
  INSERT INTO dt_sh_sample_settings VALUES
    (dbms_data_mining.clas_cost_table_name, 'dt_sh_sample_cost');
  COMMIT;
END;

4.3.3 Model Creation

The following example creates the model using the predefined settings table.

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'DT_SH_Clas_sample',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'mining_data_build_v',
    case_id_column_name => 'cust_id',
    target_column_name  => 'affinity_card',
    settings_table_name => 'dt_sh_sample_settings');
END;

-- DISPLAY MODEL SETTINGS
-- This section illustrates the GET_MODEL_SETTINGS procedure.
-- It is not needed for Decision Tree models, because model
-- settings are present in the model details XML.

column setting_name format a30
column setting_value format a30
SELECT setting_name, setting_value
  FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SETTINGS('DT_SH_Clas_sample'))
ORDER BY setting_name;

-- DISPLAY MODEL SIGNATURE
-- This section illustrates the GET_MODEL_SIGNATURE procedure.
-- It is not needed for Decision Tree models, because the model
-- signature is present in the model details XML.
--
column attribute_name format a40
column attribute_type format a20
SELECT attribute_name, attribute_type
  FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SIGNATURE('DT_SH_Clas_sample'))
ORDER BY attribute_name;

-- DISPLAY MODEL DETAILS
-- NOTE: The "&quot" characters in this XML output are owing to
--       SQL*Plus behavior. Cut and paste this XML into a file,
--       and open the file in a browser to see correctly formatted XML.
--
SET long 2000000000
column dt_details format a320
SELECT 
 dbms_data_mining.get_model_details_xml('DT_SH_Clas_sample').extract('/')
 AS DT_DETAILS
FROM dual;

4.4 Example: Using SQL Functions to Test a Decision Tree Model

The following example computes a confusion matrix and accuracy using the PREDICTION function for Data Mining. It performs the computations both with and without the cost matrix. In this example, the cost matrix reduces the problematic misclassifications, but also negatively impacts the overall model accuracy.

-- DISPLAY CONFUSION MATRIX WITHOUT APPLYING COST MATRIX
--
SELECT affinity_card AS actual_target_value, 
       PREDICTION(DT_SH_Clas_sample USING *) AS predicted_target_value,
       COUNT(*) AS value
  FROM mining_data_test_v
GROUP BY affinity_card, PREDICTION(DT_SH_Clas_sample USING *)
ORDER BY 1,2;
 
-- DISPLAY CONFUSION MATRIX APPLYING THE COST MATRIX
--
SELECT affinity_card AS actual_target_value, 
       PREDICTION(DT_SH_Clas_sample COST MODEL USING *) 
         AS predicted_target_value,
       COUNT(*) AS value
  FROM mining_data_test_v
GROUP BY affinity_card, PREDICTION(DT_SH_Clas_sample COST MODEL USING *)
ORDER BY 1,2;
 
-- DISPLAY ACCURACY WITHOUT APPLYING COST MATRIX
--
SELECT ROUND(SUM(correct)/COUNT(*),4) AS accuracy
  FROM (SELECT DECODE(affinity_card,
               PREDICTION(DT_SH_Clas_sample USING *), 1, 0) AS correct
          FROM mining_data_test_v);
 
-- DISPLAY ACCURACY APPLYING THE COST MATRIX
--
SELECT ROUND(SUM(correct)/COUNT(*),4) AS accuracy
  FROM (SELECT DECODE(affinity_card,
                 PREDICTION(DT_SH_Clas_sample COST MODEL USING *),
                 1, 0) AS correct
          FROM mining_data_test_v);

4.5 Example: Using SQL Functions to Apply a Decision Tree Model

The following example illustrates several ways of scoring the Decision Tree Model. It uses the PREDICTION, PREDICTION_COST, PREDICTION_SET, and PREDICTION_DETAILS functions to predict information for four different business cases:

Find the ten customers who live in Italy and could be convinced, with the least expense, to use an affinity card.
Find the average age of customers who are likely to use an affinity card, based on marital status, education, and household size.
List ten customers with the likelihood and cost that they will use or reject an affinity card.
Find the segmentation for customers who work in Customer Support and are under 25.

------------------
-- BUSINESS CASE 1
-- Find the 10 customers who live in Italy that are least expensive
-- to be convinced to use an affinity card.
--
WITH
cust_italy AS (
SELECT cust_id
  FROM mining_data_apply_v
 WHERE country_name = 'Italy'
ORDER BY PREDICTION_COST(DT_SH_Clas_sample, 1 COST MODEL USING *) ASC, 1
)
SELECT cust_id
  FROM cust_italy
 WHERE rownum < 11;

------------------
-- BUSINESS CASE 2
-- Find the average age of customers who are likely to use an
-- affinity card.
-- Include the build-time cost matrix in the prediction.
-- Only take into account CUST_MARITAL_STATUS, EDUCATION, and
-- HOUSEHOLD_SIZE as predictors.
-- Break out the results by gender.
--
column cust_gender format a12
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
  FROM mining_data_apply_v
 WHERE PREDICTION(dt_sh_clas_sample COST MODEL
                 USING cust_marital_status, education, household_size) = 1
GROUP BY cust_gender
ORDER BY cust_gender;

------------------
-- BUSINESS CASE 3
-- List ten customers (ordered by their id) along with likelihood and cost
-- to use or reject the affinity card (Note: while this example has a
-- binary target, such a query is useful in multi-class classification -
-- Low, Med, High for example).
--
column prediction format 9;
SELECT T.cust_id, S.prediction, S.probability, S.cost
  FROM (SELECT cust_id,
               PREDICTION_SET(dt_sh_clas_sample COST MODEL USING *) pset
          FROM mining_data_apply_v
         WHERE cust_id < 100011) T,
       TABLE(T.pset) S
ORDER BY cust_id, S.prediction;

------------------
-- BUSINESS CASE 4
-- Find the segmentation (resulting tree node) for customers who
-- work in Tech support and are under 25.
--
column education format a30;
column treenode  format a40;
SELECT cust_id, education,
       PREDICTION_DETAILS(dt_sh_clas_sample USING *) treenode
  FROM mining_data_apply_v
 WHERE occupation = 'TechSup' AND age < 25
ORDER BY 1;