1 Introduction to Oracle Ultra Search

This chapter contains the following topics:

Overview of Oracle Ultra Search
Oracle Ultra Search Components
Oracle Ultra Search Features
Oracle Ultra Search System Configuration

Overview of Oracle Ultra Search

Oracle Ultra Search is built on the Oracle Database and Oracle Text technology that provides uniform search-and-locate capabilities over multiple repositories: Oracle databases, other ODBC compliant databases, IMAP mail servers, HTML documents served up by a Web server, files on disk, and more.

Oracle Ultra Search uses a 'crawler' to collect documents. You can schedule the crawler to suit the Web sites that you want to search. The documents stay in their own repositories, and the collected information is used to build an index that stays within your firewall in a designated Oracle database. Oracle Ultra Search also provides APIs for building content management solutions.

In addition, Oracle Ultra Search offers the following:

A complete text query language for text search inside the database
Full integration with the Oracle Database and the SQL query language
Advanced features like concept searching and theme analysis
Attribute mapping to facilitate attribute search across disparate repositories
Indexing of more than 150 file formats
Full globalization, including support for Chinese, Japanese, and Korean (CJK), and Unicode

Oracle Ultra Search Components

Oracle Ultra Search is made up of the following components:

Oracle Ultra Search Crawler
Oracle Ultra Search Backend
Oracle Ultra Search Middle Tier

Oracle Ultra Search Crawler

The Oracle Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns a configurable number of processor threads that fetch documents from various data sources and index them using Oracle Text. This index is used for querying. Data sources can be Web sites, database tables, files, mailing lists, Oracle Application Server Portal page groups, or user-defined data sources.

The crawler maps links and analyzes relationships. The crawler schedule is integrated with and driven by the DBMS_JOB queue mechanism. Whenever the crawler encounters embedded, non-HTML documents during the crawling, it uses Oracle Text filters to automatically detect the document type and to filter and index the document.

Oracle Ultra Search Backend

The Oracle Ultra Search backend consists of an Oracle Ultra Search repository and Oracle Text. Oracle Text provides the text indexing and search capabilities required to index and query data retrieved from your data sources. The backend is not visible to users; it indexes information from the crawler and serves up the query results.

Oracle Ultra Search Middle Tier

The Oracle Ultra Search middle tier components are Web applications. The middle tier includes the Oracle Ultra Search administration tool, the APIs and the query applications.

In the Oracle Database release, the Oracle Ultra Search middle tier and backend can reside in the same Oracle home. However, in the OracleAS and Oracle Collaboration Suite releases, the middle tier is located in a different Oracle home than the backend.

Oracle Ultra Search Administration Tool

The administration tool is a J2EE-compliant Web application. You can use it to manage Oracle Ultra Search instances, and you can access it from any browser in your intranet. The administration tool is independent from the Oracle Ultra Search query application. Therefore, the administration tool and query application can be hosted on different computers to enhance security and scalability.

Oracle Ultra Search APIs and Query Applications

Oracle Ultra Search provides the following APIs:

The query API works with indexed data. The Java API does not impose any HTML rendering elements. The application can completely customize the HTML interface.
The crawler agent API crawls and indexes proprietary document repositories.
The e-mail Java API accesses archived e-mails and is used by the query application to display e-mails. It can also be used when building your own custom query application.
The URL rewriter API is used by the crawler to filter and rewrite extracted URL links before they are inserted into the URL queue.
The Document Service crawler agent API allows generation of attribute data based on the document contents. It accepts robot metatag instructions from the agent for the target document, and it transforms the original document contents for indexing control.

Oracle Ultra Search includes highly functional query applications to query and display search results. The query applications are based on JSP and work with any JSP1.1 compliant engine.

See Also:

Web Application Concepts

The Oracle Ultra Search administration tool and the Oracle Ultra Search query applications are J2EE-compliant Web applications. These are three tier architecture applications. Figure 1-1 shows the relationship between the browser (the first tier), the Web server and the servlet engine (the middle tier), and the Oracle Database (the third tier).

The Web server accepts requests from the browser and forwards the requests to the servlet engine for processing. The Oracle Ultra Search middle tier then communicates with the Oracle database through the JDBC, as in Figure 1-1.

You can use any browser to access the Oracle Ultra Search administration tool or Oracle Ultra Search query application. The URLs are described in the following section.

Figure 1-1 Oracle Ultra Search Architecture

Description of the illustration isrch010.gif

Oracle Ultra Search Features

This section explains some features in Oracle Ultra Search. It includes the following topics:

Oracle Ultra Search Instance
Document and Search Attributes
Metadata Loader
Internationalization in Oracle Ultra Search
Oracle Ultra Search Crawler Features
Query API
Secure Search
Document Relevancy Boosting
Query Syntax Expansion
Display URL Support
Federated Search
Single Sign-On Authentication
Integration with Oracle Internet Directory
Query Applications
Integration with Oracle Application Server

Oracle Ultra Search Instance

An Ultra Search instance can be created to provide the isolation for the data collections that have been crawled.

You can create a read-only snapshot of a master Oracle Ultra Search instance. This is useful for query processing or for a backup. You can also make a snapshot instance updatable. This is useful when the master instance is corrupted and you want to use a snapshot as a new master instance.

See Also:

"Instances Page"

Document and Search Attributes

Document attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process and then mapped to one of the search attributes and stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.

Oracle Ultra Search has the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. They can be incorporated in search applications for a more detailed search and richer presentation. The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.

See Also:

"Synchronizing Data Sources"

Metadata Loader

Oracle Ultra Search provides a command-line tool to load metadata into an Oracle Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool.

The metadata loader is a Java application. To use the tool, you must put the metadata in an XML file. It supports the following types of metadata:

Search attribute list of values (LOVs) and display names
Document relevance boosting and document loading

Internationalization in Oracle Ultra Search

Translators can enter the following translation strings:

Search attribute names
Attribute LOVs
Data group names
Federated data source names

During query time, they can be displayed according to the language preference.

Oracle Ultra Search Crawler Features

You can define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository, such as Lotus Notes or Documentum, which contain their own databases and interfaces. The proprietary repository is called a user-defined data source. The module that enables the crawler to access the data source is called a crawler agent.

See Also:

Robots

You can control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. For example, when a robot visits http://www.oracle.com/, it checks for http://www.oracle.com/robots.txt. If it finds it, the crawler analyzes its contents to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots.txt by enabling robots exclusion.

See Also:

"Web Sources"

Data Harvesting

For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing. You can update the crawling mode to the following:

Automatically accept all URLs for indexing
Examine URLs before indexing
Index only

See Also:

"Schedules Page"

URL Rewrite

The URL rewriter is a user-supplied Java module for implementing the Oracle Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.

See Also:

Query API

Oracle Ultra Search offers a flexible query API to incorporate search functionality to your sites. The query API includes the following functionality:

Three attribute types: string, number, and date
Multivalued attributes
Display name support for attributes, attribute list of values (LOV), and data groups
Document relevancy boosting
Arbitrary grouping of attribute query operator using operators (AND, OR), with control over attribute operator evaluation order
Selection of metadata returned in query result

See Also:

Secure Search

Oracle Ultra Search supports secure searches, which return only documents satisfying the search criteria that the search user is allowed to view. For secure searches, each indexed document should be protected by an access control list (ACL). During searches, the ACL is evaluated. If the user performing the search has permission to read the protected document, then the document is returned by the query API. Otherwise, it is not returned.

There are two ways to secure a data source:

Specify a single ACL for protecting all documents of a data source.

The administrator specifies the permissions of the single ACL in the Oracle Ultra Search administration tool. The resulting ACL is used to protect all documents belonging to that data source.
Crawl ACLs from the data source.

The data source is expected to provide the ACL together with the document. This lets each document be protected by its own unique ACL.

Oracle Ultra Search only supports this mode for user-defined data source types where the crawler agent retrieves the ACL from the data source along with other document attributes. You cannot get an ACL from a data source if it is a Web, table, portal, e-mail, or file type. With agent APIs, the URL property "UrlData.ACL" lets the agent to set the ACL of the URL submitted. Also, the AclHelper class in the Agent APIs generates the ACL string to make sure that the ACL string format is correct. Only Distinguished Name (DN) and Global User Id (GUID) can be used as the principal of an ACL.

Oracle Ultra Search performs ACL duplicate detection. This means that if a crawled document's ACL already exists in the Oracle Ultra Search system, then the existing ACL is used to protect the document, instead of creating a new ACL within Oracle Ultra Search. This policy reduces storage space and increases performance.

Oracle Ultra Search supports only a single LDAP domain. The LDAP users and groups specified in the ACL must belong to the same LDAP domain.

Caution:

If ACLs are crawled from data sources, then it is the responsibility of the administrator to ensure that the data sources being crawled belong to the same LDAP domain. Otherwise, it is possible that search users can inadvertently be granted permissions to documents that they should not be able to access.

Searches run against a secure-search enabled Oracle Ultra Search instance are slower than those run against a non-secure-search enabled instance. This is because each candidate result could require an ACL evaluation. ACLs are evaluated natively by the Oracle server for optimum performance. Nevertheless, the time taken to return hits in a secure search varies depending on the number of ACL evaluations that must be made.

Dependency on Oracle XML DB

Oracle Ultra Search stores ACLs in the Oracle XML DB repository. Oracle Ultra Search also uses Oracle XML DB functionality to evaluate ACLs. (This dependency only exists for those users who are making use of secure searching.)

The ACLs are managed by Oracle Ultra Search. ACLs are uniquely referenced by documents from a single Oracle Ultra Search instance. ACLs are not shared by multiple Oracle Ultra Search instances. For acceptable performance, the ACL cache size must be large enough to contain all ACLs evaluated at run time.

ACLs in the XML DB repository are protected by other ACLs (known as protector ACLs). Oracle Ultra Search ensures that the protector ACLs grant appropriate privileges for Oracle Ultra Search to invoke the XML DB ACL evaluation mechanism. The evaluation performance is primarily affected by the total number of ACLs used by all XML DB client applications that also utilize its ACL evaluation mechanism. This set of applications includes Oracle Ultra Search.

Security Considerations When Using Restricting Access to a Data Source

An Oracle Ultra Search data source can be protected by a single administrator-specified ACL. This ACL specifies which users and groups are allowed to view the documents belonging to that data source.

Oracle Ultra Search uses the Oracle Server's ACL evaluation engine to evaluate permissions when queries are performed by search users. This ACL evaluation engine is a feature of Oracle XML DB. If an Oracle Ultra Search query attempts to retrieve a document that is protected by an administrator-specified ACL, the ACL is evaluated and subsequently cached.

How long an ACL is cached is controlled by an XML DB configuration parameter. The acl-max-age parameter must be modified. The value is a number in seconds that determines how long ACLs are cached.

Because ACLs are cached, it is important to remember that changes to an administrator-specified ACL may not propagate immediately. This only applies to database sessions that existed before the change was made.

See Also:

Oracle XML DB Developer's Guide and "Enabling Secure Search"

Document Relevancy Boosting

You can override the search results and influence the order that documents are ranked in the query result list with document relevancy boosting. This can promote important documents to higher scores and make them easier to find.

Relevancy boosting assigns a score to a document for specific queries entered by the search user.

Note:

The document still has a score computed by Oracle Text if you enter a query that is not one of the boosted queries.

Relevancy boosting has the following limitations:

Comparison of the user's query against the boosted queries uses exact string matching. This means that the comparison is case-sensitive and space-aware. Therefore, a document with a boosted score for "Ultra Search" is not boosted when you enter "ultrasearch".
Relevancy boosting requires that the query application pass in the search term using the query API getResult method call. The applications are designed to pass the basic search terms as the boost term. Advanced search criteria based on search attributes are ignored.

See Also:

"Queries Page"

Query Syntax Expansion

Oracle Ultra Search translates each user query into a database query. This process is called query syntax expansion. The Oracle Ultra Search default expansion logic boosts the relevancy of those documents that match the user's query. The query syntax expansion can be customized with the query API.

Display URL Support

When gathering information from a database Web application, Oracle Ultra Search lets you specify a URL to display the retrieved data on a browser. The URL points to a screen in the Web application corresponding to the data in the database. This is available for table data sources, file data sources, and user-defined data sources.

See Also:

"Using Crawler Agents"

Federated Search

Traditionally, Oracle Ultra Search has used centralized search to gather data on a regular basis and to update one index that cataloged all searchable data. This provided fast searching, but it required that the data source to be crawlable before it could be searched. Oracle Ultra Search now also provides federated search, which allows multiple indexes to perform a single search. Each index can be maintained separately. By querying the data source at search time, search results are always the latest results. User credentials can be passed to the data source and authenticated by the data source itself. Queries can be processed efficiently using the data's native format. To use federated search, you must deploy an Oracle Ultra Search search ad apter, or searchlet, and create an Oracle source. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.

See Also:

"Federated Sources"

Single Sign-On Authentication

The Oracle Ultra Search administration tool supports the following modes of logging on, depending on the type of user. You can log on as:

A single sign-on user managed in the Oracle Internet Directory and authenticated with Oracle Application Server Single Sign-On
A local database schema user in the Oracle Ultra Search database (that it, not using single sign-on)
A Portal user
An Enterprise Manager user

Note:

Single sign-on is available only with the Oracle Identity Management infrastructure.

See Also:

"Logging On to Oracle Ultra Search"

Integration with Oracle Internet Directory

Oracle Internet Directory is Oracle's native LDAP v3-compliant directory service, built as an application on top of the Oracle Database. Oracle Internet Directory hosts the Oracle common identity. All Oracle Web-based products integrate with Oracle Application Server Single Sign-On.

Oracle Ultra Search Administration Groups in Oracle Internet Directory

An Oracle Ultra Search administration group contains a set of users. Each user can belong to one or multiple groups. All groups are created using the groupOfUniqueNames and orclGroup object classes.

The only way to grant a user administration privileges is to assign them to an administration group. Oracle Ultra Search authorizes the user administration privileges based on the administration groups to which the user belongs. The following groups are created for each Oracle Ultra Search instance:

Super-users: Users in this group can create or drop Oracle Ultra Search instances and can administer Oracle Ultra Search instances within the installation. Super-users must obey the rules for document relevancy boosting and ACLs defined for each of the documents associated with the Oracle Ultra Search instance. For example, if a document ACL does not grant access to the super-user or group, then the super-user cannot search and browse the document.
Instance administrators: Users in this group can administer the Oracle Ultra Search instance. Only the instance database schema user and members in the super-users group can drop the instance.

Authorization of the Administration Privileges

The authorization of the administration user is performed in the following steps:

After the administration user is successfully authenticated by Oracle Application Server Single Sign-On or the Oracle Ultra Search database, the Oracle Ultra Search GUI brings up the first screen for the user to choose an Oracle Ultra Search instance.
The Oracle Ultra Search GUI looks up the Oracle Internet Directory server or Oracle Ultra Search repository to find all Oracle Ultra Search instances that the administration user has privileges to administer.
The administration user chooses the Oracle Ultra Search instance from the list.

Query Applications

Oracle Ultra Search includes fully functional query applications to query and display search results. The query applications include a search portlet.

The Oracle Ultra Search portlet demonstrates how to write a search portlet for use in Oracle Application Server Portal. It is implemented as a JavaServer Page application.This same portlet is installed as a feature of the Oracle Application Server Portal product.

See Also:

"Oracle Ultra Search Query API"
The Oracle Application Server Portal documentation for more information about portlets
Oracle Ultra Search Query Applications Readme for more information about the query API application

Integration with Oracle Application Server

Although Oracle Ultra Search in the Oracle Application Server is the same product as Oracle Ultra Search in Oracle Collaboration Suite and Oracle Ultra Search in the Oracle Database, there are a few functional differences:

The Oracle Database is not integrated with OracleAS Portal. However, with OracleAS and Oracle Collaboration Suite installations, Oracle Ultra Search lets Portal users add powerful multi-repository search to their Portal pages. OracleAS and Oracle Collaboration Suite also have the capability to crawl and make searchable Portal's own repository. The Portal crawler recognizes Portal page groups as data sources.
OracleAS Single Sign-On users can log on once for all components of the Oracle Application Server product, and the Oracle Ultra Search administrative interface allows user management operations on either database users or single sign-on users. Authenticated single sign-on users never see the Oracle Ultra Search logon screen. Instead, they can immediately choose an instance. If the single sign-on user does not have permissions to manage Oracle Ultra Search (set in the Users Page), then the single sign-on user receives an error. Single sign-on is available only with the Oracle Identity Management infrastructure.

See Also:

http://portalstudio.oracle.com

Oracle Ultra Search System Configuration

Oracle Ultra Search runs as a client program to the Oracle server. It can be deployed in the backend or in the middle tier of a server configuration.

The Oracle Ultra Search query interface and the administration tool can be accessed from any HTML browser client. The administration tool relies on certain Java classes in the middle tier. This logical middle tier can be the same physical computer as the one that runs the database server, or on a different one running Oracle Application Server. The Oracle Ultra Search database backend consists of the Oracle Ultra Search data dictionary that stores metadata on all the different repositories, as well as the schedules and Java classes needed to drive the crawler. The crawler itself can run either on the database server computer or remotely on another computer.

See Also:

Chapter 3, "Installing Oracle Ultra Search" for more information about the components

Figure 1-2 illustrates the Oracle Ultra Search system configuration.

Figure 1-2 Oracle Ultra Search System Configuration

Description of the illustration isrch001.gif