Oracle® Data Mining Application Developer's Guide, 10g Release 2 (10.2) Part Number B14340-01 |
|
|
View PDF |
This chapter provides information to help you build data mining applications in PL/SQL. It includes sample code for building, testing, and scoring a classification model, and it illustrates the use of SQL functions for model scoring.
See Also:
|
This chapter contains the following sections:
The examples included in this chapter are taken from the Data Mining sample applications available on the Database companion CD. When you install the companion CD, the Data Mining sample applications are copied to /rdbms/demo/
in the Oracle home directory.
The following directory listing command lists the sample data mining programs on a Linux system. Use an equivalent command to list the sample programs on other operating systems.
>ls $ORACLE_HOME/rdbms/demo/dm*
Table 4-1 lists the sample PL/SQL programs.
Table 4-1 Sample PL/SQL Programs
Application | Description |
---|---|
|
Creates an Adaptive Bayes Network model (classification). |
|
Creates an Attribute Importance model. |
|
Creates an Association Rules model. |
|
Creates a Decision Tree model (classification). |
|
Creates a k_means model (clustering). |
|
Creates a Naive Bayes model (classification). |
|
Creates a Non_Negative Matrix Factorization model (feature extraction). |
|
Creates an O-Cluster model (clustering). |
|
Creates a Support Vector Machine model (classification). |
|
Creates a Support Vector Machine model (one-class classification). |
|
Creates a Support Vector Machine model (regression). |
|
Text mining. (term extraction using CTX procedures). |
|
Text mining using NMF feature extraction. |
|
Text mining using SVM classification. |
See Also: Oracle Data Mining Administrator's Guide for information about installing, running, and viewing the sample programs. |
The following types of mining activities are supported by the DBMS_DATA_MINING
package:
Creating, dropping, and renaming a model: CREATE_MODEL
, DROP_MODEL
, RENAME_MODEL
.
Scoring a model: APPLY
.
Ranking APPLY
results: RANK_APPLY
.
Describing a model: GET_MODEL_DETAILS
, GET_MODEL_SETTINGS
, GET_DEFAULT_SETTINGS
,GET_MODEL_SIGNATURE
.
Computing test metrics for a model: COMPUTE_CONFUSION_MATRIX
, COMPUTE_LIFT
, and COMPUTE_ROC
.
Exporting and importing models: EXPORT_MODEL
, IMPORT_MODEL
.
Of these, the first set represents DDL-like operations. The last set represents utilities. The rest are query-like operations in that they do not modify the model.
Note: Detailed information about theDBMS_DATA_MINING package is available in Oracle Database PL/SQL Packages and Types Reference. |
You can view the models defined in your schema by querying the DM_USER_MODELS
view. The following query on a Linux system lists the models in the schema of DMUSER
. These models were created by the sample PL/SQL programs.
>sqlplus dmuser/dmuser_password
SQL> set linesize 200
SQL> set pagesize 100
SQL> select NAME, FUNCTION_NAME, ALGORITHM_NAME, TARGET_ATTRIBUTE from DM_USER_MODELS;
NAME FUNCTION_NAME ALGORITHM_NAME TARGET_ATTRIBUTE
--------------------- ---------------------- ------------------------------ -----------------
T_NMF_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
AR_SH_SAMPLE ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES
AI_SH_SAMPLE ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH AFFINITY_CARD
ABN_SH_CLAS_SAMPLE CLASSIFICATION ADAPTIVE_BAYES_NETWORK AFFINITY_CARD
DT_SH_CLAS_SAMPLE CLASSIFICATION DECISION_TREE AFFINITY_CARD
NB_SH_CLAS_SAMPLE CLASSIFICATION NAIVE_BAYES AFFINITY_CARD
SVMC_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES AFFINITY_CARD
OC_SH_CLUS_SAMPLE CLUSTERING O_CLUSTER
KM_SH_CLUS_SAMPLE CLUSTERING KMEANS
NMF_SH_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR
SVMR_SH_REGR_SAMPLE REGRESSION SUPPORT_VECTOR_MACHINES AGE
The CREATE_MODEL
procedure creates a mining model. The viewable contents of a mining model are provided to you through the GET_MODEL_DETAILS
functions for each supported algorithm. In addition, GET_MODEL_SIGNATURE
and GET_MODEL_SETTINGS
provide descriptive information about the model.
The APPLY
procedure creates and populates a pre-defined table. The columns of this table vary based on the particular mining function, algorithm, and target attribute type — numerical or categorical.
The RANK_APPLY
procedure takes this results table as input and generates another table with results ranked based on a top-N input. Classification models can also be ranked based on cost. The column structure of this table varies based on the particular mining function, algorithm, and the target attribute type — numerical or categorical.
The COMPUTE
routines provided in DBMS_DATA_MINING
are the most popularly used metrics for classification. They are not tied to a particular model — they can compute the metrics from any meaningful data input as long as the column structure of the input tables fits the specification of the apply results table and the targets tables.
The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use the SQL queries, shown in the following sections, to compute these metrics. Simply replace the italicized tokens with table and column names appropriate for your application.
SELECT sqrt(avg((A.prediction - B.target_column_name) * (A.prediction - B.target_column_name))) rmse FROM apply_results_table A, targets_table B WHERE A.case_id_column_name = B.case_id_column_name;
Given the targets_table
generated from the test data with the following columns,
(case_id_column_name VARCHAR2, target_column_name NUMBER)
and apply results table for regression with the following columns,
(case_id_column_name VARCHAR2, prediction NUMBER)
and a normalization table (optional) with the following columns,
(attribute_name VARCHAR2(30), scale NUMBER, shift NUMBER)
the query for mean absolute error is:
SELECT /*+PARALLEL(T) PARALLEL(A)*/ AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error FROM (SELECT B.case_id_column_name (B.target_column_name * N.scale + N.shift) actual_value FROM targets_table B, normalization_table N WHERE N.attribute_name = B.target_column_name AND B.target_column_name = 1) T, apply_results_table_name A WHERE A.case_id_column_name = T.case_id_column_name;
You can fill in the italicized values with the actual column and table names chosen by you. If the data has not undergone normalization transformation, you can eliminate those references from the subquery. See the SVM regression sample program (dmsvrdem.sql
for an example.
Given demographic data about a set of customers, this example predicts the customer response to an affinity card program using a classifier based on the Decision Tree algorithm.
Note: This example is taken from the sample programdmdtdemo.sql . See Oracle Data Mining Administrator's Guide for information about the sample programs. |
The Decision Tree algorithm is capable of handling data that has not been specially prepared. This example uses data created from the base tables in the SH
schema and presented through the following views.
MINING_DATA_BUILD_V (build data) MINING_DATA_TEST_V (test data) MINING_DATA_APPLY_V (scoring data)
Note: Data preparation techniques (using theDBMS_DATA_MINING_TRANSFORM package) are illustrated in many of the sample programs. |
The following example creates a settings table and a cost matrix table for the model. The settings override the default classification algorithm (Naive Bayes) and specify the location of the cost matrix table.
set echo off CREATE TABLE dt_sh_sample_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(30)); set echo on -- CREATE AND POPULATE A COST MATRIX TABLE -- -- A cost matrix is used to influence the weighting of misclassification -- during model creation (and scoring). -- CREATE TABLE dt_sh_sample_cost ( actual_target_value NUMBER, predicted_target_value NUMBER, cost NUMBER); INSERT INTO dt_sh_sample_cost VALUES (0,0,0); INSERT INTO dt_sh_sample_cost VALUES (0,1,1); INSERT INTO dt_sh_sample_cost VALUES (1,0,8); INSERT INTO dt_sh_sample_cost VALUES (1,1,0); COMMIT; BEGIN -- Populate settings table INSERT INTO dt_sh_sample_settings VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_decision_tree); INSERT INTO dt_sh_sample_settings VALUES (dbms_data_mining.clas_cost_table_name, 'dt_sh_sample_cost'); COMMIT; END;
The following example creates the model using the predefined settings table.
BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'DT_SH_Clas_sample', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', settings_table_name => 'dt_sh_sample_settings'); END; -- DISPLAY MODEL SETTINGS -- This section illustrates the GET_MODEL_SETTINGS procedure. -- It is not needed for Decision Tree models, because model -- settings are present in the model details XML. column setting_name format a30 column setting_value format a30 SELECT setting_name, setting_value FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SETTINGS('DT_SH_Clas_sample')) ORDER BY setting_name; -- DISPLAY MODEL SIGNATURE -- This section illustrates the GET_MODEL_SIGNATURE procedure. -- It is not needed for Decision Tree models, because the model -- signature is present in the model details XML. -- column attribute_name format a40 column attribute_type format a20 SELECT attribute_name, attribute_type FROM TABLE(DBMS_DATA_MINING.GET_MODEL_SIGNATURE('DT_SH_Clas_sample')) ORDER BY attribute_name; -- DISPLAY MODEL DETAILS -- NOTE: The """ characters in this XML output are owing to -- SQL*Plus behavior. Cut and paste this XML into a file, -- and open the file in a browser to see correctly formatted XML. -- SET long 2000000000 column dt_details format a320 SELECT dbms_data_mining.get_model_details_xml('DT_SH_Clas_sample').extract('/') AS DT_DETAILS FROM dual;
The following example computes a confusion matrix and accuracy using the PREDICTION
function for Data Mining. It performs the computations both with and without the cost matrix. In this example, the cost matrix reduces the problematic misclassifications, but also negatively impacts the overall model accuracy.
-- DISPLAY CONFUSION MATRIX WITHOUT APPLYING COST MATRIX -- SELECT affinity_card AS actual_target_value, PREDICTION(DT_SH_Clas_sample USING *) AS predicted_target_value, COUNT(*) AS value FROM mining_data_test_v GROUP BY affinity_card, PREDICTION(DT_SH_Clas_sample USING *) ORDER BY 1,2; -- DISPLAY CONFUSION MATRIX APPLYING THE COST MATRIX -- SELECT affinity_card AS actual_target_value, PREDICTION(DT_SH_Clas_sample COST MODEL USING *) AS predicted_target_value, COUNT(*) AS value FROM mining_data_test_v GROUP BY affinity_card, PREDICTION(DT_SH_Clas_sample COST MODEL USING *) ORDER BY 1,2; -- DISPLAY ACCURACY WITHOUT APPLYING COST MATRIX -- SELECT ROUND(SUM(correct)/COUNT(*),4) AS accuracy FROM (SELECT DECODE(affinity_card, PREDICTION(DT_SH_Clas_sample USING *), 1, 0) AS correct FROM mining_data_test_v); -- DISPLAY ACCURACY APPLYING THE COST MATRIX -- SELECT ROUND(SUM(correct)/COUNT(*),4) AS accuracy FROM (SELECT DECODE(affinity_card, PREDICTION(DT_SH_Clas_sample COST MODEL USING *), 1, 0) AS correct FROM mining_data_test_v);
The following example illustrates several ways of scoring the Decision Tree Model. It uses the PREDICTION
, PREDICTION_COST
, PREDICTION_SET
, and PREDICTION_DETAILS
functions to predict information for four different business cases:
Find the ten customers who live in Italy and could be convinced, with the least expense, to use an affinity card.
Find the average age of customers who are likely to use an affinity card, based on marital status, education, and household size.
List ten customers with the likelihood and cost that they will use or reject an affinity card.
Find the segmentation for customers who work in Customer Support and are under 25.
------------------ -- BUSINESS CASE 1 -- Find the 10 customers who live in Italy that are least expensive -- to be convinced to use an affinity card. -- WITH cust_italy AS ( SELECT cust_id FROM mining_data_apply_v WHERE country_name = 'Italy' ORDER BY PREDICTION_COST(DT_SH_Clas_sample, 1 COST MODEL USING *) ASC, 1 ) SELECT cust_id FROM cust_italy WHERE rownum < 11; ------------------ -- BUSINESS CASE 2 -- Find the average age of customers who are likely to use an -- affinity card. -- Include the build-time cost matrix in the prediction. -- Only take into account CUST_MARITAL_STATUS, EDUCATION, and -- HOUSEHOLD_SIZE as predictors. -- Break out the results by gender. -- column cust_gender format a12 SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age FROM mining_data_apply_v WHERE PREDICTION(dt_sh_clas_sample COST MODEL USING cust_marital_status, education, household_size) = 1 GROUP BY cust_gender ORDER BY cust_gender; ------------------ -- BUSINESS CASE 3 -- List ten customers (ordered by their id) along with likelihood and cost -- to use or reject the affinity card (Note: while this example has a -- binary target, such a query is useful in multi-class classification - -- Low, Med, High for example). -- column prediction format 9; SELECT T.cust_id, S.prediction, S.probability, S.cost FROM (SELECT cust_id, PREDICTION_SET(dt_sh_clas_sample COST MODEL USING *) pset FROM mining_data_apply_v WHERE cust_id < 100011) T, TABLE(T.pset) S ORDER BY cust_id, S.prediction; ------------------ -- BUSINESS CASE 4 -- Find the segmentation (resulting tree node) for customers who -- work in Tech support and are under 25. -- column education format a30; column treenode format a40; SELECT cust_id, education, PREDICTION_DETAILS(dt_sh_clas_sample USING *) treenode FROM mining_data_apply_v WHERE occupation = 'TechSup' AND age < 25 ORDER BY 1;