Oracle® Data Mining Application Developer's Guide, 10g Release 2 (10.2) Part Number B14340-01 |
|
|
View PDF |
This chapter introduces the Oracle Data Mining (ODM) Application Programming Interfaces (APIs). ODM supports comprehensive PL/SQL and Java APIs, SQL functions, and table functions that implement the Basic Local Alignment Search Tool (BLAST) for life sciences applications.
See Also:
|
This chapter contains the following topics:
Oracle 10g Release 2 (10.2) introduces several significant new features in the ODM APIs. Among these are the Decision Tree algorithm for classification and the One-Class SVM algorithm for anomaly detection. New predictive analytics, which automate the process of predictive data mining, and new built-in scoring functions, which return mining results within the context of a standard SQL statement, are also new in Oracle 10.2.
Oracle 10.2 introduces a completely new Java API for data mining. The Java API is an Oracle implementation of the Java Data Mining (JDM) 1.0 standard. It replaces the proprietary Java API that was available in Oracle 10g.
The Java API is layered on the PL/SQL API, and the two APIs are fully interoperable. For example, you can run a SQL script to create a model and then test and apply the model from a Java application.
Note: Model interoperability is new in Oracle 10.2. In Oracle 10g, the Java API was incompatible with the PL/SQL API.See Chapter 8 for information on migrating ODM 10g Java applications to the new API. |
See Also: Oracle Data Mining Concepts and Oracle Database New Features for a complete list of new features in Oracle 10g Release 2 (10.2) Data Mining. |
ODM supports both predictive and descriptive mining functions. Predictive functions, known as supervised learning, use training data to predict a target value. Descriptive functions, known as unsupervised learning, identify relationships intrinsic to the data. Each mining function identifies a class of problems to be solved, and each can be implemented with one or more algorithms.
The predictive data mining functions are described in Table 1-1. The algorithm abbreviations introduced in the table are used throughout this manual.
Table 1-1 Predictive Data Mining Functions
The descriptive data mining functions are described in Table 1-2.
Table 1-2 Descriptive Data Mining Functions
The first step in designing a data mining application is to analyze the business problem and determine the mining function and algorithm that best addresses it. The second step is to examine the data and determine how it should be prepared for mining.
Once you have identified the mining function and algorithm, and implemented the data transformations, you can develop a sample data mining application. A degree of intuition is involved in making these application choices. You might develop, test, and deploy your sample application and then determine that the results aren't quite what you are looking for. In this case, you might try different or additional data transformations, or you might try a different or additional algorithm.
In any case, the process of developing a data mining application is iterative. It involves testing the model, evaluating test metrics, making adjustments in the model, and re-evaluating.
See Also: Oracle Data Mining Concepts for information to help you approach a given data mining problem. |
Although it is common to try different approaches to solving a data mining problem, each application must accomplish several basic tasks.
Prepare the data. One data set is needed for building the model; additional data sets may be necessary for testing and scoring the model, depending on the algorithm. In most cases, the data must be prepared with transformations that enhance or facilitate the effectiveness of the model. Each data set must be prepared in the same way.
Create a model using the build data.
Evaluate the model.
For classification and regression models, this is the application of the model to a set of test data, and the computation of various test metrics.
For clustering models, this is the examination of the clusters identified during model creation.
For feature extraction models, this is the examination of the features identified during model creation.
For attribute importance and association models, evaluation is the final step in the mining process. These models cannot be scored against new data.
Apply (score) the model. This is the process of deploying the model to the data of interest.
For classification and regression models, scoring is the application of the "trained" model to the actual population. The result is the best prediction for a target value in each record.
For clustering models, scoring is the application of clusters identified by the model to the actual population. The result is the probability of cluster membership for each record.
For feature extraction models, scoring is the mapping of features defined by the model to the actual population. The result is a reduced set of predictors in each record.
The basic mining steps for each algorithm are summarized in Table 1-3. Some steps, such as priors and costs and specific test metrics, are optional. The individual steps are described in later sections of this manual.
Table 1-3 Major Steps in a Data Mining Application
Function/Algorithm | Build | Evaluate | Apply |
---|---|---|---|
|
|
||
|
|
||
|
|
||
|
|
|
|
|
|
||
|
Retrieve model details, consisting of a list of attributes with their importance ranking. |
|
|
|
Retrieve model details to obtain information about clusters in the data. |
|
|
|
Retrieve model details, consisting of information about clusters in the data. |
|
|
|
Retrieve frequent item sets, and rules that define the item sets. |
|
|
|
Retrieve model details, consisting of a list of features with their importance ranking. |
|
Data sets used by Oracle Data Mining are stored in tables, which can be accessed through relational views. The rows are referred to as cases or records. A case ID column specifies a unique identifier for each case, for example the customer ID in a table of customer data.
Columns referred to as attributes or fields specify a set of predictors. Supervised models (with the exception of One-Class SVM) also use a target column. For example, a regression model might predict customer income level (the target), given customer date of birth and gender (the predictors). Unsupervised models use a set of predictors but no target.
ODM distinguishes between two types of attributes: categorical or numerical. Categorical attributes are a set of values that belong to a given category or class, for example marital status or job title. Numerical attributes are values in a continuum, for example income or age.
Column attributes can have a scalar data type or they can contain nested tables (collection types) of type DM_NESTED_NUMERICALS
or DM_NESTED_CATEGORICALS
. Some ODM algorithms support text columns. Text must be indexed and converted to one of the collection types prior to data mining (See Chapter 5).
In most cases, data sets must be specifically prepared before building, testing, or applying a model. Preparation includes transformations that improve model accuracy and performance. Common data transformations are:
Binning — grouping related values together to reduce the number of distinct values for an attribute.
Normalization — converting individual attribute values so that they fall within a range, typically 0.0 – 1.0 or -1 – +1.
Clipping — setting extreme attribute values to a single value (winsorizing) or causing extreme values to be ignored by the model (trimming).
Text transformation — converting text attributes to nested tables.
In addition to these data transformation techniques, you can improve the efficiency of a model by reducing the number of attributes in large data sets. You can create an Attribute Importance model to identify critical attributes or a Non-Negative Matrix Factorization model to combine similar attributes into a single feature. You can then build a model that uses only these attributes or features.
Note: Any transformations performed on the build data must also be performed on the test and scoring data. At each stage of the mining process, the data sets must be identical in structure. |
If you are using SQL to prepare your data, you can use DBMS_DATA_MINING_TRANSFORM
, an open-source package that provides a set of typical data transformation routines. You can use these routines or adapt them, or you can use some other SQL-based mechanism for preparing your data.
See "Preparing the Data" for information on data transformations in the Java API.
When you create a new model, you specify its function. Each function has a default algorithm, and each algorithm has certain default behaviors. To specify any characteristics, you must create a settings table for the model.
Create the settings table in the schema of the model owner. The settings table must have these columns.
Column Name | Data Type |
---|---|
setting_name |
VARCHAR2(30) |
setting_value |
VARCHAR2(128) |
If you are using the PL/SQL API, specify the name of the settings table as a parameter to the DBMS_DATA_MINING.CREATE_MODEL
procedure. See "Build Settings" for information on model settings in the Java API.
Model details refer to tabular information that can be generated dynamically after a model has been created in the database. Model details provide a complete description of the model. The kind of information provided by model details depends on the algorithm used by the model.
Details of classification and regression models provide extensive statistics that you can capture and examine before testing and scoring the model.
Details of a Decision Tree model are the XML representation of the model in standard PMML format, enabling any application that supports this standard to import the model.
Details of clustering models describe groups of cases that share certain characteristics.
Details of Attribute Importance models and Association models essentially provide the results of the model. For example, the details of an Attribute Importance model are a set of attributes with their importance value and rank. Details of an Association model consist of associated items (item sets) and the rules that define each association.
Model details can be retrieved using the PL/SQL table functions GET_MODEL_DETAILS_
x
, where x
refers to the algorithm used by the model. See "Exploring Model Details" for information about model details in the Java API.
The DBMS_PREDICTIVE_ANALYTICS
PL/SQL package provides a high-level interface to data mining. It provides much of the power of predictive data mining, while masking its underlying complexity.
DBMS_PREDICTIVE_ANALYTICS
automates the process of predictive data mining, from data preparation to model building to scoring new data. In addition to generating predictions, Predictive Analytics can explain the relative influence of specific attributes on the prediction.
DBMS_PREDICTIVE_ANALYTICS
provides a PREDICT
routine and an EXPLAIN
routine.
When using Predictive Analytics, you do not need to prepare the data. Both the PREDICT
and EXPLAIN
routines analyze the data and automatically perform transformations to optimize the model.
See "Using Automated Prediction and Explain Tasks" for information on Predictive Analytics in the Java API.
Predictive Analytics are also available in the Oracle Spreadsheet Add-In for Predictive Analytics.
The built-in SQL functions for Data Mining implement scoring operations for models that have already been created in the database. They provide the following benefits:
Models can be easily deployed within the context of existing SQL applications.
Scoring performance is greatly improved, especially in single row scoring cases, since these functions take advantage of existing query execution functionality.
Scoring results are pipelined, enabling some of the results to be returned quickly to the user.
Note: SQL functions are built into the Oracle Database and are available for use within SQL statements. SQL functions should not be confused with functions defined in PL/SQL packages. |
When applied to a given row of scoring data, classification and regression models provide the best predicted value for the target and the associated probability of that value occurring. The predictive functions for Data Mining are described in Table 1-4.
Table 1-4 SQL Scoring Functions for Classification and Regression Models
Function | Description |
---|---|
Returns the best prediction for the target. |
|
Returns a measure of the cost of false negatives and false positives on the predicted target. |
|
Returns an XML string containing details that help explain the scored row. |
|
Returns the probability of a given prediction |
|
Returns a list of objects containing all classes in a binary or multi-class classification model along with the associated probability (and cost, if applicable). |
Applying a cluster model to a given row of scoring data returns the cluster ID and the probability of that row's membership in the cluster. The clustering functions for data mining are described in Table 1-5.
Table 1-5 SQL Scoring Functions for Clustering Models
Function | Description |
---|---|
Returns the ID of the predicted cluster. |
|
Returns the probability of a case belonging to a given cluster. |
|
Returns a list of all possible clusters to which a given case belongs along with the associated probability of inclusion. |
Applying a feature extraction model involves the mapping of features (sets of attributes) to columns in the scoring dataset. The feature extraction functions for data mining are described in Table 1-6.
Table 1-6 SQL Scoring Functions for Feature Extraction Models
Function | Description |
---|---|
Returns the ID of the feature with the highest coefficient value. |
|
Returns a list of objects containing all possible features along with the associated coefficients. |
|
Returns the value of a given feature. |