Oracle® Data Mining Application Developer's Guide, 10g Release 2 (10.2) Part Number B14340-01 |
|
|
View PDF |
Models created by ODM APIs or by Oracle Data Miner are stored in the Database. This chapter provides information about viewing, accessing, configuring, exporting and importing models.
This chapter contains the following topics:
A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. A model is not a database schema object.
You can view the contents of a model — that is, the patterns and rules that constitute a mining model — using algorithm-specific GET_MODEL_DETAILS
functions in the DBMS_DATA_MINING
PL/SQL package. These functions are documented in Oracle Database PL/SQL Packages and Types Reference. See "Exploring Model Details" for information on model details in the Java API.
You can view a list of the models in your schema by querying the DM_USER_MODELS
view. The columns of the DM_USER_MODELS
view are described in Table 3-1.
Column | Data Type | Description |
---|---|---|
|
|
Name of the model. |
|
|
The model function. See Chapter 1 for an overview of mining functions. |
|
|
The algorithm used by the model. See Chapter 1 for algorithms used by the mining functions. |
|
|
The date on which the model was created. |
|
|
The duration of the model build process. |
|
|
The attribute designated as the target of a classification model. |
|
|
The size of the model in megabytes. |
The following query lists the demo programs in the DM_USER
schema.
SQL> select 'NAME', 'FUNCTION_NAME', 'ALGORITHM_NAME' from DM_USER_MODELS; NAME FUNCTION_NAME ALGORITHM_NAME -------------------------------------------------------------------- ABN_SH_CLAS_SAMPLE CLASSIFICATION ADAPTIVE_BAYES_NETWORK AI_SH_SAMPLE ATTRIBUTE_IMPORTANCE MINIMUM_DESCRIPTION_LENGTH AR_SH_SAMPLE ASSOCIATION_RULES APRIORI_ASSOCIATION_RULES DT_SH_CLAS_SAMPLE CLASSIFICATION DECISION_TREE KM_SH_CLUS_SAMPLE CLUSTERING KMEANS NB_SH_CLAS_SAMPLE CLASSIFICATION NAIVE_BAYES NMF_SH_SAMPLE FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR OC_SH_CLUS_SAMPLE CLUSTERING O_CLUSTER SVMC_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES SVMO_SH_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES SVMR_SH_REGR_SAMPLE REGRESSION SUPPORT_VECTOR_MACHINES T_SVM_CLAS_SAMPLE CLASSIFICATION SUPPORT_VECTOR_MACHINES
See Also: Oracle Data Mining Administrator's Guide for information about installing, running, and viewing the demo programs. |
Although ODM models are not stored as Oracle schema objects, their names must conform to Database requirements for nonquoted identifiers. Additionally, model names must be less than 25 bytes long.
Oracle requires that nonquoted identifiers contain only alphanumeric characters, the underscore (_), dollar sign ($), and pound sign (#); the initial character must be alphabetic. Oracle strongly discourages the use of the dollar sign and pound sign in nonquoted literals.
Oracle Data Mining does not support a general privilege model that spans multiple users. GRANT
and REVOKE
of read and update privileges on a mining model across user schemas are not yet supported.
You can only read and update models in your own schema. If you want to modify the settings of a model or view its details, you must be logged in with the identity of the schema that owns the model. Results of all mining operations are generated in the schema that owns the model.
Models in one schema can be exported to other schemas. You can import a model into your own schema once it has been exported to an accessible location.
Mining models are included when a database or schema is exported or imported with the Oracle Data Pump utility. You can export and import individual models or groups of models using the ODM SQL and Java APIs.
You can use the EXPORT_MODEL
procedure in the DBMS_DATA_MINING
package to export a model or a group of models to a dump file. Models can be imported from the dump file using IMPORT_MODEL
.
The Java API uses the ExportTask
and ImportTask
standard JDM interfaces to provide the same export/import functionality.
See Also:
|
A settings table is a relational table that provides configuration information for a specific model. You must create a settings table if you want a model to have any nondefault characteristics. You will supply the name of the settings table when you create the model.
You must create the settings table in the schema of the model. You can choose the name of the settings table, but the column names and their types must be defined as shown.
Column Name | Data Type |
---|---|
setting_name |
VARCHAR2(30) |
setting_value |
VARCHAR2(128) |
The values inserted into the setting_name
column are one or more of several constants defined in the DBMS_DATA_MINING
package. Depending on what the setting name denotes, the value for the setting_value
column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value
column is defined to be VARCHAR2
. You can explicitly cast numerical inputs to string using the TO_CHAR()
function, or you can rely on the implicit type conversion provided by the Database.
The settings described in Table 3-2 apply to a mining function. Use these settings to specify the algorithm that the model will use, the location of cost matrix and prior probabilities tables, and other function-specific characteristics. See Table 1-1, "Predictive Data Mining Functions" and Table 1-2, "Descriptive Data Mining Functions" for information about mining functions.
Table 3-2 Data Mining Function Settings
Algorithm Settings | Setting Value (with Permissible Value Ranges) |
---|---|
|
|
The name of a relational table that specifies a cost matrix. The column requirements for this table are described in "Costs". This input is applicable only for Decision Tree algorithms, since this is the only algorithm that supports a cost matrix at build time. The cost matrix table must be present in the current user's schema. |
|
The name of a relational table that specifies prior probabilities. The column requirements for this table are described in "Priors". This input is applicable only for classification algorithms. Decision Tree is the only classification algorithm that does not use priors. The prior probabilities table must be present in the current user's schema. For SVM classification, this setting identifies a table of weights. |
|
Number of clusters generated by a clustering algorithm. Default is 10. |
|
Number of features to be extracted. Default value estimated from the data by the algorithm. |
|
Maximum rule length for AR algorithm. Default is 4. |
|
Minimum confidence value for AR algorithm Default is 0.1. |
|
Minimum support value for AR algorithm Default is 0.1. |
Table 3-3 through Table 3-9 provide algorithm-specific settings. You can use these settings to tune the behavior of the algorithm.
Table 3-3 Algorithm Settings for Adaptive Bayes Network
Table 3-4 Algorithm Settings for Naive Bayes
Setting Name | Setting Value (with Permissible Value Ranges) |
---|---|
|
Value of singleton threshold for NB algorithm Default value is 0.01 |
|
Value of pairwise threshold for NB algorithm Default is 0.01. |
Table 3-5 Algorithm Settings for Decision Tree
Setting Name | Setting Value (with Permissible Value Ranges) |
---|---|
|
Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity for the entities in the node. Homogeneity is measured in accordance with a metric. For classification (Binary or multi-class targets), the supported metrics are gini and entropy.
|
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). Default is 7. |
|
No child shall have fewer records than this number, which is expressed as a percentage of the training rows. Default is 0.05, indicating 0.05%. |
|
Criteria for splits: minimum number of records in a parent node expressed as a percent of the total number of records used to train the model. No split is attempted if number of records is below this value. Default is 0.1, indicating 0.1%. |
|
No child shall have fewer records than this number. Default is 10. |
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value. Default is 20. |
Table 3-6 Algorithm Settings for Support Vector Machines
Table 3-7 Algorithm Settings for Non-Negative Matrix Factorization
Table 3-8 Algorithm Settings for O-Cluster
Table 3-9 Algorithm Settings for k-Means
This example creates a settings table for an SVM classification model and edits the individual values using SQL DML.
CREATE TABLE drugstore_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(128)) BEGIN -- override the default for convergence tolerance for SVM Classification INSERT INTO drugstore_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081)); COMMIT; END;
The table function GET_DEFAULT_SETTINGS
provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using SQL DML.
BEGIN CREATE TABLE drug_store_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS WHERE setting_name LIKE 'SVMS_%'; -- update the values using appropriate DML END;
You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS
.
BEGIN CREATE TABLE my_new_model_settings AS SELECT setting_name, setting_value FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_model')); END;
In classification models, you can specify a cost matrix to represent the costs associated with false positive and false negative predictions. A cost matrix can be used in testing and scoring most classification models.
The Decision Tree algorithm can use a cost matrix at build time.To specify the cost matrix, you must create a cost matrix table and provide its name in the clas_cost_table_name
setting for the Decision Tree model. See "Build Settings" for an example.
If you are using the Java API, instantiate a CostMatrix
object and specify the name of the table as a parameter to the dmeConn.saveObject
method for the object.
The cost matrix table must have these columns.
Column Name | Data Type |
---|---|
actual_target_value |
VARCHAR2(4000) for categorical targets
|
predicted_target_value |
VARCHAR2(4000)
|
cost |
NUMBER |
ODM enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.
The data input to each test computation (a COMPUTE
procedure in PL/SQL, or a TestMetrics
object in Java) is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the computation generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY
results, without rebuilding the model and applying it against the same test data for every iteration.
Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY
operation along with the results of APPLY
on your scoring data. RANK_APPLY
will provide your new data ranked by cost.
In most classification models, you can specify prior probabilities to offset differences in distribution between the build data and the real population (scoring data). Priors can be used in building any classification model that uses a Bayesian algorithm. Priors are not used by the Decision Tree algorithm.
To specify prior probabilities, you must create a priors table and provide its name in the clas_priors_table_name
setting for the model. If you are using the Java API, use a setPriorProbabilitiesMap
object in the classification function settings for the model.
SVM Classification uses weights to correct for differences in target distribution. Use the priors table to specify weights for SVM Classification models.
The priors table must have these columns.
Column Name | Data Type |
---|---|
target_value |
VARCHAR2 for categorical targets
|
prior_probability |
NUMBER |