1 Introducing the Oracle Data Mining APIs

This chapter introduces the Oracle Data Mining (ODM) Application Programming Interfaces (APIs). ODM supports comprehensive PL/SQL and Java APIs, SQL functions, and table functions that implement the Basic Local Alignment Search Tool (BLAST) for life sciences applications.

See Also:

Oracle Database PL/SQL Packages and Types Reference (DBMS_DATA_MINING, DBMS_DATA_MINING_TRANSFORM, and DBMS_PREDICTIVE_ANALYTICS) for PL/SQL API syntax.
Oracle Data Mining Java API Reference (javadoc) for Java API syntax.
Oracle Database SQL Reference for syntax of the built-in functions for data mining.
Oracle Data Mining Concepts for detailed information about Oracle Data Mining concepts and features.
Oracle Data Mining Administrator's Guide for information about installation, database administration, and the sample data mining programs.

This chapter contains the following topics:

New Features
Predictive and Descriptive Data Mining
Steps in a Data Mining Application
Data Preparation
Model Settings
Model Details
Predictive Analytics
SQL Scoring Functions

1.1 New Features

Oracle 10g Release 2 (10.2) introduces several significant new features in the ODM APIs. Among these are the Decision Tree algorithm for classification and the One-Class SVM algorithm for anomaly detection. New predictive analytics, which automate the process of predictive data mining, and new built-in scoring functions, which return mining results within the context of a standard SQL statement, are also new in Oracle 10.2.

Oracle 10.2 introduces a completely new Java API for data mining. The Java API is an Oracle implementation of the Java Data Mining (JDM) 1.0 standard. It replaces the proprietary Java API that was available in Oracle 10g.

The Java API is layered on the PL/SQL API, and the two APIs are fully interoperable. For example, you can run a SQL script to create a model and then test and apply the model from a Java application.

Note:

Model interoperability is new in Oracle 10.2. In Oracle 10g, the Java API was incompatible with the PL/SQL API.

See Chapter 8 for information on migrating ODM 10g Java applications to the new API.

See Also:

Oracle Data Mining Concepts and Oracle Database New Features for a complete list of new features in Oracle 10g Release 2 (10.2) Data Mining.

1.2 Predictive and Descriptive Data Mining

ODM supports both predictive and descriptive mining functions. Predictive functions, known as supervised learning, use training data to predict a target value. Descriptive functions, known as unsupervised learning, identify relationships intrinsic to the data. Each mining function identifies a class of problems to be solved, and each can be implemented with one or more algorithms.

The predictive data mining functions are described in Table 1-1. The algorithm abbreviations introduced in the table are used throughout this manual.

Table 1-1 Predictive Data Mining Functions

Function	Description	Sample Problem	Algorithms
Classification	A classification model uses historical data to predict new discrete or categorical data	Given demographic data about a set of customers, predict customer response to an affinity card program.	Naive Bayes (NB) Adaptive Bayes Network ABN) Support Vector Machine (SVM) Decision Tree (DT)
Anomaly Detection	An anomaly detection model predicts whether a data point is typical for a given distribution. The PL/SQL and Java APIs currently support anomaly detection through the Classification function.	Given demographic data about a set of customers, identify customer purchasing behavior that is significantly different from the norm.	One-Class Support Vector Machine (SVM) The PL/SQL and Java APIs currently support One-Class SVM using the classification mining function and the SVM algorithm with no target.
Regression	A regression model uses historical data to predict new continuous, numerical data	Given demographic and purchasing data about a set of customers, predict customer's age.	Support Vector Machine (SVM)
Attribute Importance	An attribute importance model identifies the relative importance of an attribute in predicting a given outcome.	Given customer response to an affinity card program, find the importance of independent attributes.	Minimal Descriptor Length (MDL)

The descriptive data mining functions are described in Table 1-2.

Table 1-2 Descriptive Data Mining Functions

Function	Description	Sample Problem	Algorithms
Clustering	A clustering model identifies natural groupings within a data set.	Segment demographic data into 10 clusters and study the individual clusters. Rank the clusters on probability.	Enhanced k-means (KM) Orthogonal Clustering (O-Cluster or OC)
Association Rules	An association model identifies relationships and the probability of their occurrence within a data set.	Find the association between items bought by customers.	Apriori (AP)
Feature Extraction	A feature extraction model creates an optimized data set on which to base a model.	Given demographic data about a set of customers, extract significant features from the given data set.	Non-Negative Matrix Factorization (NMF)

1.3 Steps in a Data Mining Application

The first step in designing a data mining application is to analyze the business problem and determine the mining function and algorithm that best addresses it. The second step is to examine the data and determine how it should be prepared for mining.

Once you have identified the mining function and algorithm, and implemented the data transformations, you can develop a sample data mining application. A degree of intuition is involved in making these application choices. You might develop, test, and deploy your sample application and then determine that the results aren't quite what you are looking for. In this case, you might try different or additional data transformations, or you might try a different or additional algorithm.

In any case, the process of developing a data mining application is iterative. It involves testing the model, evaluating test metrics, making adjustments in the model, and re-evaluating.

See Also:

Oracle Data Mining Concepts for information to help you approach a given data mining problem.

Although it is common to try different approaches to solving a data mining problem, each application must accomplish several basic tasks.

Prepare the data. One data set is needed for building the model; additional data sets may be necessary for testing and scoring the model, depending on the algorithm. In most cases, the data must be prepared with transformations that enhance or facilitate the effectiveness of the model. Each data set must be prepared in the same way.
Create a model using the build data.
Evaluate the model.
- For classification and regression models, this is the application of the model to a set of test data, and the computation of various test metrics.
- For clustering models, this is the examination of the clusters identified during model creation.
- For feature extraction models, this is the examination of the features identified during model creation.
- For attribute importance and association models, evaluation is the final step in the mining process. These models cannot be scored against new data.
Apply (score) the model. This is the process of deploying the model to the data of interest.
- For classification and regression models, scoring is the application of the "trained" model to the actual population. The result is the best prediction for a target value in each record.
- For clustering models, scoring is the application of clusters identified by the model to the actual population. The result is the probability of cluster membership for each record.
- For feature extraction models, scoring is the mapping of features defined by the model to the actual population. The result is a reduced set of predictors in each record.

The basic mining steps for each algorithm are summarized in Table 1-3. Some steps, such as priors and costs and specific test metrics, are optional. The individual steps are described in later sections of this manual.

Table 1-3 Major Steps in a Data Mining Application

Function/Algorithm	Build	Evaluate	Apply
Classification with NB or ABN	Prepare build data Specify priors Specify target Create model based on build data	Prepare test data Apply model to test data Specify costs Compute test metrics (confusion matrix, lift, accuracy, ROC)	Prepare scoring data Apply model to scoring data Specify costs
Classification with DT	Prepare build data Specify costs Specify target Create model based on build data	Prepare test data Apply model to test data Specify costs Compute test metrics (confusion matrix, lift, accuracy, ROC)	Prepare scoring data Apply model to scoring data Specify costs
Classification with SVM	Prepare build data Specify weights Specify target Create model based on build data	Prepare test data Apply model to test data Specify costs Compute test metrics (confusion matrix, lift, accuracy, ROC)	Prepare scoring data Apply model to scoring data Specify costs
Classification (anomaly detection) with One-Class SVM	Prepare build data Specify NULL target Create model based on build data		Prepare scoring data Apply model to build data or to scoring data
Regression with SVM	Prepare build data Specify target Create model based on build data	Prepare test data Apply model to test data Compute test metrics (Root Mean Square Error, , Mean Absolute Error, Residuals)	Prepare scoring data Apply model to scoring data
Attribute Importance with MDL	Prepare build data Specify target Create model based on build data	Retrieve model details, consisting of a list of attributes with their importance ranking.
Clustering with KM	Prepare build data Create model based on build data	Retrieve model details to obtain information about clusters in the data.	Prepare scoring data Apply model to scoring data
Clustering with OC	Prepare build data Specify the number of clusters Create model based on build data	Retrieve model details, consisting of information about clusters in the data.	Prepare scoring data Apply model to scoring data
Association Rules with AP	Prepare build data Create model based on build data	Retrieve frequent item sets, and rules that define the item sets.
Feature Extraction with NMF	Prepare build data Create model based on build data	Retrieve model details, consisting of a list of features with their importance ranking.	Prepare scoring data Apply model to scoring data

1.4 Data Preparation

Data sets used by Oracle Data Mining are stored in tables, which can be accessed through relational views. The rows are referred to as cases or records. A case ID column specifies a unique identifier for each case, for example the customer ID in a table of customer data.

Columns referred to as attributes or fields specify a set of predictors. Supervised models (with the exception of One-Class SVM) also use a target column. For example, a regression model might predict customer income level (the target), given customer date of birth and gender (the predictors). Unsupervised models use a set of predictors but no target.

ODM distinguishes between two types of attributes: categorical or numerical. Categorical attributes are a set of values that belong to a given category or class, for example marital status or job title. Numerical attributes are values in a continuum, for example income or age.

Column attributes can have a scalar data type or they can contain nested tables (collection types) of type DM_NESTED_NUMERICALS or DM_NESTED_CATEGORICALS. Some ODM algorithms support text columns. Text must be indexed and converted to one of the collection types prior to data mining (See Chapter 5).

See Also:

Chapter 2, "Managing Data" for more information.

In most cases, data sets must be specifically prepared before building, testing, or applying a model. Preparation includes transformations that improve model accuracy and performance. Common data transformations are:

Binning — grouping related values together to reduce the number of distinct values for an attribute.
Normalization — converting individual attribute values so that they fall within a range, typically 0.0 – 1.0 or -1 – +1.
Clipping — setting extreme attribute values to a single value (winsorizing) or causing extreme values to be ignored by the model (trimming).
Text transformation — converting text attributes to nested tables.

In addition to these data transformation techniques, you can improve the efficiency of a model by reducing the number of attributes in large data sets. You can create an Attribute Importance model to identify critical attributes or a Non-Negative Matrix Factorization model to combine similar attributes into a single feature. You can then build a model that uses only these attributes or features.

Note:

Any transformations performed on the build data must also be performed on the test and scoring data. At each stage of the mining process, the data sets must be identical in structure.

If you are using SQL to prepare your data, you can use DBMS_DATA_MINING_TRANSFORM, an open-source package that provides a set of typical data transformation routines. You can use these routines or adapt them, or you can use some other SQL-based mechanism for preparing your data.

See "Preparing the Data" for information on data transformations in the Java API.

See Also:

Oracle Data Mining Concepts for an overview of data transformations

1.5 Model Settings

When you create a new model, you specify its function. Each function has a default algorithm, and each algorithm has certain default behaviors. To specify any characteristics, you must create a settings table for the model.

Create the settings table in the schema of the model owner. The settings table must have these columns.

Column Name	Data Type
`setting_name`	`VARCHAR2(30)`
`setting_value`	`VARCHAR2(128)`

If you are using the PL/SQL API, specify the name of the settings table as a parameter to the DBMS_DATA_MINING.CREATE_MODEL procedure. See "Build Settings" for information on model settings in the Java API.

See Also:

"Model Settings" for descriptions of the settings and their values.

1.6 Model Details

Model details refer to tabular information that can be generated dynamically after a model has been created in the database. Model details provide a complete description of the model. The kind of information provided by model details depends on the algorithm used by the model.

Details of classification and regression models provide extensive statistics that you can capture and examine before testing and scoring the model.

Details of a Decision Tree model are the XML representation of the model in standard PMML format, enabling any application that supports this standard to import the model.

Details of clustering models describe groups of cases that share certain characteristics.

Details of Attribute Importance models and Association models essentially provide the results of the model. For example, the details of an Attribute Importance model are a set of attributes with their importance value and rank. Details of an Association model consist of associated items (item sets) and the rules that define each association.

Model details can be retrieved using the PL/SQL table functions GET_MODEL_DETAILS_x, where x refers to the algorithm used by the model. See "Exploring Model Details" for information about model details in the Java API.

1.7 Predictive Analytics

The DBMS_PREDICTIVE_ANALYTICS PL/SQL package provides a high-level interface to data mining. It provides much of the power of predictive data mining, while masking its underlying complexity.

DBMS_PREDICTIVE_ANALYTICS automates the process of predictive data mining, from data preparation to model building to scoring new data. In addition to generating predictions, Predictive Analytics can explain the relative influence of specific attributes on the prediction.

DBMS_PREDICTIVE_ANALYTICS provides a PREDICT routine and an EXPLAIN routine.

Predictive Analytics Routine	Description
`PREDICT`	Predicts the values in a target column, based on the cases where the target is not null. `PREDICT` uses known data values to automatically create a model and populate the unknown values in the target.
`EXPLAIN`	Identifies attribute columns that are important for explaining the variation of values in a given column. `EXPLAIN` analyzes the data and builds a model that identifies the important attributes and ranks their importance

When using Predictive Analytics, you do not need to prepare the data. Both the PREDICT and EXPLAIN routines analyze the data and automatically perform transformations to optimize the model.

See "Using Automated Prediction and Explain Tasks" for information on Predictive Analytics in the Java API.

Predictive Analytics are also available in the Oracle Spreadsheet Add-In for Predictive Analytics.

1.8 SQL Scoring Functions

The built-in SQL functions for Data Mining implement scoring operations for models that have already been created in the database. They provide the following benefits:

Models can be easily deployed within the context of existing SQL applications.
Scoring performance is greatly improved, especially in single row scoring cases, since these functions take advantage of existing query execution functionality.
Scoring results are pipelined, enabling some of the results to be returned quickly to the user.

Note:

SQL functions are built into the Oracle Database and are available for use within SQL statements. SQL functions should not be confused with functions defined in PL/SQL packages.

When applied to a given row of scoring data, classification and regression models provide the best predicted value for the target and the associated probability of that value occurring. The predictive functions for Data Mining are described in Table 1-4.

Table 1-4 SQL Scoring Functions for Classification and Regression Models

Function	Description
`PREDICTION`	Returns the best prediction for the target.
`PREDICTION_COST`	Returns a measure of the cost of false negatives and false positives on the predicted target.
`PREDICTION_DETAILS`	Returns an XML string containing details that help explain the scored row.
`PREDICTION_PROBABILITY`	Returns the probability of a given prediction
`PREDICTION_SET`	Returns a list of objects containing all classes in a binary or multi-class classification model along with the associated probability (and cost, if applicable).

Applying a cluster model to a given row of scoring data returns the cluster ID and the probability of that row's membership in the cluster. The clustering functions for data mining are described in Table 1-5.

Table 1-5 SQL Scoring Functions for Clustering Models

Function	Description
`CLUSTER_ID`	Returns the ID of the predicted cluster.
`CLUSTER_PROBABILITY`	Returns the probability of a case belonging to a given cluster.
`CLUSTER_SET`	Returns a list of all possible clusters to which a given case belongs along with the associated probability of inclusion.

Applying a feature extraction model involves the mapping of features (sets of attributes) to columns in the scoring dataset. The feature extraction functions for data mining are described in Table 1-6.

Table 1-6 SQL Scoring Functions for Feature Extraction Models

Function	Description
FEATURE_ID	Returns the ID of the feature with the highest coefficient value.
FEATURE_SET	Returns a list of objects containing all possible features along with the associated coefficients.
FEATURE_VALUE	Returns the value of a given feature.

See Also:

Oracle Database SQL Reference for information on the data mining scoring functions.