2 Managing Data

This chapter describes data requirements and options for Oracle Data Mining. This information applies to data sets used to build, test, and score models.

You should ensure that a data set is properly defined before applying transformations to optimize it for a particular model. Data transformation techniques are not addressed in this chapter.

See Also:

Oracle Data Mining Concepts for information about data transformations.
DBMS_DATA_MINING_TRANSFORM in Oracle Database PL/SQL Packages and Types Reference for information about data transformations in PL/SQL.
"Preparing the Data" for information about data transformations in Java.
Oracle Database SQL Reference for information about Oracle schema objects and data types.

This chapter contains the following topics:

Data Types
Columns and Attributes
Nested Tables
Data Storage Optimization

2.1 Data Types

The input to ODM is a table or a view. The columns can have numeric or character data types: NUMBER, FLOAT, VARCHAR2, or CHAR.

2.1.1 Collection Types

Additionally, ODM supports columns of type DM_NESTED_CATEGORICALS and DM_NESTED_NUMERICALS. These are collection types that define nested tables.

The ODM collection types define tables of attribute name/value pairs. ODM data sets can include any number of these nested table columns in addition to scalar columns with built-in numeric or character data types. See "Nested Tables" for more information.

2.1.2 Text

ODM uses features of Oracle Text to transform unstructured text columns to structured columns of type DM_NESTED_NUMERICALS for mining. The ODM Java API provides the OraTextTransform interface to manage the text transformation process for you. However, if you are using the PL/SQL API, you must use Oracle Text routines directly (See Chapter 5).

Structured text columns are supported by several ODM algorithms (Support Vector Machine for classification and regression, Non-Negative Matrix Factorization, Association, and k-Means clustering).

See Also:

"Using Text Transformation" for information on text transformation in the Java API
Oracle Data Mining Administrator's Guide for information on sample programs that illustrate text transformation and text mining

2.1.3 Date and Time Data

ODM Predictive Analytics supports columns with DATE and TIMESTAMP data types. These types are not supported by the ODM PL/SQL and Java APIs.

2.2 Columns and Attributes

ODM interprets the columns of the input table as attributes for data mining. Attributes are the predictors or descriptors on which the model is based.

A model may additionally identify a case ID column, a target column, or both.

Case ID

A case ID column holds a unique identifier for each record (row) of data. The case ID must be specified at model build time for all algorithms except Decision Tree. If a case ID is present in a Decision Tree model, it is not considered a possible predictor.

In the PL/SQL API and Java APIs, the case ID must be specified at apply time for all algorithms. The SQL scoring functions do not use a case ID.

The case ID column can be of type VARCHAR2, CHAR, or NUMBER, and its maximum length is 128 bytes.
Target

Predictive algorithms (Classification, Regression, and Attribute Importance) require that one column be designated as a target. The name of the target column is supplied as an argument when the model is created. The target column holds the predictions generated by the model. The target column must be of type VARCHAR2, CHAR, NUMBER, or FLOAT. SVM Regression supports only numeric targets. One-Class SVM does not use a target.

2.2.1 Attribute Data Types

ODM interprets attributes as either categorical or numerical.

Categorical attributes are values, such as gender or job title, that belong to a category or domain. Values of a categorical attribute do not have a meaningful order. Categorical attributes have character data types.

Numerical attributes are values, such as age or income, that fall within a continuum. Numerical attributes represent interval data that has a measurable order. Numerical attributes have numeric data types.

2.2.1.1 Converting Column Data Types

If the column data type is incompatible with the attribute type, you must convert the data type. For example, an application might use postal codes as a categorical attribute, but the data is actually stored in a numeric column. In this case, you would use the TO_CHAR function to convert the column to a character data type.

If your mining data includes DATE and TIMESTAMP columns, and you are not using Predictive Analytics, you must convert those columns to numeric or character data types. In most cases, these data types should be converted to NUMBER, but you should evaluate each case individually. If, for example, the date serves as a timestamp indicating when a transaction occurred, converting the date to VARCHAR2 makes it categorical with unique values, one in each record. This kind of column is known as an identifier and is not useful in model building. However, if the date values are coarse and significantly fewer than the number of records (for example, they might indicate the week or month when an item was purchased), it may be useful to use character values.

You can convert dates to numbers by selecting a starting date and subtracting it from each date value. This process results in a NUMBER column. Another approach would be to parse the date and distribute its components over several columns. This is the conversion method used by Predictive Analytics.

2.2.1.2 DATE and TIMESTAMP Columns with Predictive Analytics

Predictive Analytics interprets DATE data and all forms of TIMESTAMP data, including TIMESTAMP WITH TIMEZONE and TIMESTAMP WITH LOCAL TIMEZONE, as a set of numerical attributes. For example, a column named TIMECOL would be transformed into attributes for year, month, week, day of year, day of month, day of week, hour, and minute. Each attribute would be named TIMECOL_x, where x is the suffix used to convert the date into a number. For example, the name of the year attribute would be TIMECOL_YYYY.

The attributes resulting from DATE and TIMESTAMP data are visible in the results of an EXPLAIN operation. They are not visible in the results of a PREDICT operation.

2.2.2 Attribute Names

The names of ODM attributes must be valid column names. Naming requirements for columns are the same as the naming requirements for Oracle schema objects.

Schema object names can be quoted or nonquoted identifiers from one to thirty bytes long. Nonquoted identifiers are not case sensitive; Oracle converts them to uppercase. Nonquoted identifiers can consist of alphanumeric characters and the underscore (_), dollar sign ($), and pound sign (#). The initial character must be alphabetic. Quoted identifiers are case sensitive and can contain most characters.

See Also:

Oracle Database SQL Reference for information on schema object naming requirements.

2.3 Nested Tables

ODM accepts data in single-record case format, where all the information (attributes) concerning an individual is contained in one row. Single-record case, also known as non-transactional format, is illustrated in the table in Example 2-1. This table contains descriptive information about customers. CUSTOMER_ID is the case ID column.

Example 2-1 Non-Transactional Format

CUSTOMER_ID      GENDER      AGE     MARITAL_STATUS
------------     ------      ---     --------------
1                Male        30       Married
2                Female      35       Single
3                Male        21       Single
4                Female      20       Single
5                Female      35       Married

Sometimes data is organized in multi-record case, also known as transactional, format. A typical example is market basket data. In transactional format, the data pertaining to an individual is distributed across multiple records. The table in Example 2-2 illustrates transactional format. This table contains information about products purchased by a group of customers on a given day. CUSTOMER_ID is the case ID column.

Example 2-2 Transactional Format

CUST_ID    PROD_ID  PROD_NAME 
--------   -------  --------- 
1          524      brand x icecream
1          530      brand y frozen dinners
1          109      brand z dog food     
2          578      brand a orange juice 
2          191      brand x frozen dinners

ODM does not support multi-record case format. However, there could be circumstances in which you want to construct a model using transactional data. For example, you might want to use transactional data like that in Example 2-2 to predict the products that each customer is likely to buy on his next visit to the store. Discount coupons for these or similar products could then be generated with the checkout receipt.

If you want to construct a model using transactional data, you must first convert the data to single-record case. You must do this by defining columns of nested tables using the ODM fixed collection types, DM_NESTED_NUMERICALS and DM_NESTED_CATEGORICALS. These types define collections of numerical attributes and categorical attributes respectively. The data type descriptions are shown as follows.

SQL> describe dm_nested_numerical
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ATTRIBUTE_NAME                                     VARCHAR2(30)
 VALUE                                              NUMBER
 
SQL> describe dm_nested_numericals
 DM_NESTED_NUMERICALS TABLE OF DMSYS.DM_NESTED_NUMERICAL
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ATTRIBUTE_NAME                                     VARCHAR2(30)
 VALUE                                              NUMBER
 
SQL> describe dm_nested_categorical
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ATTRIBUTE_NAME                                     VARCHAR2(30)
 VALUE                                              VARCHAR2(4000)
 
SQL> describe dm_nested_categoricals
 DM_NESTED_CATEGORICALS TABLE OF DMSYS.DM_NESTED_CATEGORICAL
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 ATTRIBUTE_NAME                                     VARCHAR2(30)
 VALUE                                              VARCHAR2(4000)

For a given case identifier, attribute names must be unique across all the collections and individual columns. The fixed collection types enforce this requirement. However, the attribute naming requirements, described in "Attribute Names", do not apply to the attribute_name column of a nested table.

The attributes in Example 2-2 could be stored in nested table columns, as illustrated in Example 2-3. The column PRODUCT_IDENTIFIERS is of type DM_NESTED_NUMERICALS, and the column PRODUCT_NAMES is of type DM_NESTED_CATEGORICALS.

Example 2-3 Nested Tables

CUST_ID PRODUCT_IDENTIFIERS PRODUCT_NAMES

attribute_name value attribute_name value

CUST_ID	PRODUCT_IDENTIFIERS	PRODUCT_NAMES
	attribute_name value	attribute_name value
1	`PROD_ID` `524` `PROD_ID` `530` `PROD_ID` `109`	`PROD_NAME` `brand x ice cream` `PROD_NAME` `brand y frozen dinners` `PROD_NAME` `brand z dog food`
2	`PROD_ID` `578` `PROD_ID` `191`	`PROD_NAME` `brand a orange juice` `PROD_NAME` `brand x frozen dinners`

PROD_ID 524

PROD_ID 530

PROD_ID 109

PROD_NAME brand x ice cream

PROD_NAME brand y frozen dinners

PROD_NAME brand z dog food

PROD_ID 578

PROD_ID 191

PROD_NAME brand a orange juice

PROD_NAME brand x frozen dinners

2.3.1 Object Views and Multi-Record Collections

You can create an object view that presents several sources of transactional data (implemented with nested table columns) as a single data set for data mining. See "Example: Multi-Record Collections With an Object View".

Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.

Note:

Oracle recommends that you perform any necessary data transformations on the base tables before building object views. In this way, all attributes are transformed in a similar way. In most cases, attributes in transactional format are of the same scale, and thus this approach works. Otherwise, you can split the data into sets of similar items and then transform them separately.

See DBMS_DATA_MINING_TRANSFORM in Oracle Database PL/SQL Packages and Types Reference for information about data transformations using PL/SQL. See "Preparing the Data" for information about data transformations using Java.

2.3.2 Example: Multi-Record Collections With an Object View

A real-world example of an analytical pipeline for brain tumor research illustrates multi-case collections with an object view. The underlying tables store gene expression data and clinical data about the patient.

The fact table, GENE_EXPRESSION_DATA, stores gene expression data. It has the following columns.

case_ID      NUMBER
gene         VARCHAR2(30)
expr         NUMBER

The dimension table, CLINICAL_DATA_TABLE, stores clinical patient data. It has the following columns.

case_ID      NUMBER
name         VARCHAR2(30)
type         VARCHAR2(30)
subtype      VARCHAR2(30)
gender       CHAR(1)
age          NUMBER
status       VARCHAR2(30)

In this example, we want to create a model that predicts status based on gender, age, and gene expression. The build data for the model is an object view that uses columns of clinical patient data and a nested column of gene expression data. The view will have the following columns.

case_id      NUMBER
gender       CHAR(1)
age          NUMBER
gene_expr    DM_NESTED_NUMERICALS
status       VARCHAR2(30)

The following statement constructs the object view gene_expr_build, which can be used as build data for the model.

CREATE OR REPLACE VIEW gene_expr_build AS
SELECT C.case_id,
       C.gender,
       C.age,
       CAST(MULTISET(
       SELECT gene, expr
         FROM gene_expression_data
        WHERE case_id = C.case_id) AS DM_NESTED_NUMERICALS
       ) gene_expr,
       C.status
 FROM  clinical_data_table C

2.4 Data Storage Optimization

If there are a few hundred mining attributes and your application requires the attributes to be represented as columns in the same row of the table, data storage must be carefully designed.

For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intrablock chaining.

Oracle stores multiple row pieces in the same block, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database).

See Also:

Oracle Database Performance Tuning Guide for more details.