7 Understanding the Oracle Ultra Search Crawler and Data Sources

This chapter contains the following topics:

Overview of the Oracle Ultra Search Crawler
Crawler Settings
Crawler Data Sources
Document Attributes
Crawling Process for the Schedule
Data Synchronization
Web Crawling Boundary Control
Oracle Ultra Search Remote Crawler
Oracle Ultra Search Crawler Status Codes

See Also:

Overview of the Oracle Ultra Search Crawler

The Oracle Ultra Search crawler is a Java process activated by your Oracle server according to a set schedule. When activated, the crawler spawns processor threads that fetch documents from various data sources. These documents are cached in the local file system. When the cache is full, the crawler indexes the cached files using Oracle Text. This index is used for querying.

Note:

An empty index is created when an Oracle Ultra Search instance is created. You can alter the index using SQL. The existing preferences, such as language-specific parameters, are defined in the $ORACLE_HOME/ultrasearch/admin/wk0pref.sql file.

Crawler Settings

Before you can use the crawler, you must set its operating parameters, such as the number of crawler threads, the crawler timeout threshold, the database connect string, and the default character set. To do so, use the Crawler Settings Page in the administration tool.

At installation, the Oracle Installer automatically sets the variable to include $ORACLE_HOME/ctx/lib. However, if you restart the database after the installation, then you must manually set your shared library path environment variable to include $ORACLE_HOME/ctx/lib before starting the Oracle process. You must restart the database to pick up the new value for filtering to work.

For example, on UNIX set the $LD_LIBRARY_PATH environment variable to include $ORACLE_HOME/ctx/lib, and on Windows set the $PATH environment variable to include $ORACLE_HOME/bin.

See Also:

"Crawler Page" for more information on crawler settings and Web Sources for more information on settings such as robots exclusion, the UrlRewriter, indexing dynamic Web pages, and HTTP cookies

Crawler Data Sources

In addition to the Web access parameters, you can define specific data sources on the Sources page in the administration tool. You can define one or more of the following data sources:

Web sites
Database tables
Files
Mailing lists
Oracle Application Server Portal page groups
User-defined data sources (requires crawler agent)

Using Crawler Agents

If you are defining a user-defined data source to crawl and index a proprietary docum ent repository or management system, such as Lotus Notes or Documentum, then you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Oracle Ultra Search crawler, which enqueues it for later crawling. For more information on defining a new data source type, see the User-Defined subtab in Sources page in the administration tool.

Synchronizing Data Sources

You can create synchronization schedules with one or more data sources attached to it. Synchronization schedules define the frequency at which the Oracle Ultra Search index is kept up to date with existing information in the associated data sources. To define a synchronization schedule, use the Sources page in the administration tool.

Display URL and Access URL

For some applications, for security reasons, the URL crawled is different from the one seen by the end user. For example, crawling on an internal Web site inside a firewall might be done without security checking, but when queried by the user, a corresponding mirror URL outside the firewall must be used. This mirror URL is called the display URL.

By default, the display URL is treated as the access URL unless a separate access URL is provided. The display URL must be unique in a data source; so two different access URLs cannot have the same display URL.

See Also:

"Sources Page"

Document Attributes

Docume nt attributes, or metadata, describe the properties of a document. Each data source has its own set of document attributes. The value is retrieved during the crawling process, then mapped to one of the search attributes, and then stored and indexed in the database. This lets you query documents based on their attributes. Document attributes in different data sources can be mapped to the same search attribute. Therefore, you can query documents from multiple data sources based on the same search attribute.

If the document is a Web page, the attribute can come from the HTTP header or it can be embedded inside the HTML in metatags. Document attributes can be used for many things, including document management, access control, or version control. Different data sources can have different attribute names which are used for the same idea; for example, "version" and "revision". It can also have the same attribute name for different ideas; for example, "language" as in natural language in one data source but as programming language in another.

Oracle Ultra Search has the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. They can be incorporated in search applications for a more detailed search and richer presentation.

Search attributes can also be created in the following ways:

System-defined search attributes, such as title, author, description, subject, and mimetype
Search attributes created by the system administrator
Search attributes created by the crawler. (During crawling, the crawler agent maps the document attribute to a search attribute with the same name and data type. If not found, then the crawler creates a new search attribute with the same name and type as the document attribute defined in the crawler agent.)

The list of values (LOV) for a search attribute can help you specify a search query. If attribute LOV is available, then the crawler registers the LOV definition, which includes attribute value, attribute value display name, and its translation.

Crawling Process for the Schedule

The first time the crawler runs, it must fetch Web pages, table rows, files, and so on based on the data source. It then adds the document to the Oracle Ultra Search index. The crawling process for the schedule is broken into two phases:

Queuing and Caching Documents
Indexing Documents

Queuing and Caching Documents

Figure 7-1 and Figure 7-2 illustrate an instance of the crawling cycle in a sequence of nine steps. The example uses a Web data source, although the crawler can also crawl other data source types.

Figure 7-1 illustrates how the crawler and its crawling threads are activated. It also shows how the crawler queues hypertext links to control its navigation. This figure corresponds to Steps 1 through 5.

Figure 7-2 illustrates how the crawler caches Web pages. This figure correspond to Steps 6 through 8.

The steps are the following:

Oracle spawns the crawler according to the schedule you specify with the administration tool. When crawling is initiated for the first time, the URL queue is populated with the seed URLs. See Figure 7-1.
The crawler initiates multiple crawling threads.
The crawler thread removes the next URL in the queue.
The crawler thread fetches the document from the Web. The document is usually an HTML file containing text and hypertext links.
The crawler thread scans the HTML file for hypertext links and inserts new links into the URL queue. Duplicate links already in the document table are discarded.
The crawler caches the HTML file in the local file system. See Figure 7-2.
The crawler registers URL in the document table.
The crawler thread starts over by repeating Step 3.

Fetching a document, as described in Step 4, can be time-consuming because of network traffic or slow Web sites. For maximum throughput, multiple threads fetch pages at any given time.

Note:

URLs remain visible until the next crawling run. When the crawler detects that the URL is no longer there, it is removed from the wk$doc table where Oracle Text automatically marks this document as deleted, even though the index data still exists. Cleanup is done through index optimization, which can be scheduled separately.

Figure 7-1 Queuing URLs

Description of the illustration isrch005.gif

Figure 7-2 Caching URLs

Description of the illustration isrch006.gif

Indexing Documents

When the file system cache is full (default maximum size is 20 MB), document caching stops and indexing begins. In this phase, Oracle Ultra Search augments the Oracle Text index using the cached files referred to by the document table. See Figure 7-3.

Figure 7-3 Indexing Documents

Description of the illustration isrch004.gif

Data Synchronization

After the initial crawl, a URL page is only crawled and indexed if it has changed since the last crawl. The crawler determines if it has changed with the HTTP If-Modified-Since header field or with the checksum of the page. URLs that no longer exist are marked and removed from the index.

To update changed documents, the crawler uses an internal checksum to compare new Web pages with cached Web pages. Changed Web pages are cached and marked for reindexing.

The steps involved in data synchronization are the following:

Oracle spawns the crawler according to the synchronization schedule you specify with the administration tool. The URL queue is populated with the data source URLs assigned to the schedule.
The crawler initiates multiple crawling threads.
Each crawler thread removes the next URL in the queue.
Each crawler thread fetches the document from the Web. The page is usually an HTML file containing text and hypertext links.
Each crawler thread calculates a checksum for the newly retrieved page and compares it with the checksum of the cached page. If the checksum is the same, then the page is discarded and the crawler goes to Step 3. Otherwise, the crawler moves to the next step.
Each crawler thread scans the document for hypertext links and inserts new links into the URL queue. Links that are already in the document table are discarded.
The crawler caches the document in the local file system. See Figure 7-2.
The crawler registers the URL in the document table.
If the file system cache is full or if the URL queue is empty, then Web page caching stops and indexing begins. Otherwise, the crawler thread starts over at Step 3.

Web Crawling Boundary Control

Oracle Ultra Search provides the following mechanisms to control the scope of a Web data source crawling:

URL boundary rule (domain rule and path rule)
Robots.txt file and robots META tag
Crawling depth
URL Rewriter API

URL Boundary Rule

The URL boundary rule consists of domain rules and path rules. A domain rule specifies the set of Web sites allowed using a host name prefix or suffix. A path rule specifies the URL file path allowed or disallowed for a particular host. You can specify an inclusion or exclusion rule for both a domain rule and a path rule. Exclusion rules always override inclusion rules. Path rules are always host-specific.

For example, an inclusion domain ending with oracle.com limits the Oracle Ultra Search crawler to hosts belonging to Oracle world wide. Anything ending with oracle.com is crawled, but http://www.oracle.com.tw is not crawled. If you change the inclusion domain to someurl.com with a new seed http://www.someurl.com, then all oracle.com URLs are dropped by the crawler.

An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both.

All URLs must pass domain rules before being checked for path rules. Path rules let you further restrict the crawling space. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include the path /host/doc and exclude the path /host/doc/private. Note that path rules are prefix-based.

Regular expression-based domain and path rules are not supported in the current release.

The following rules restrict the crawler to only crawl www.oracle.com and otn.oracle.com. Furthermore, only URLs under /products/database/ and /products/ias/ but not under /products/ias/web_cache/ will be crawled.

Domain inclusion: www.oracle.com
Domain inclusion: otn.oracle.com
Path inclusion for otn.oracle.com:         /products/database/
                                           /products/ias/
Path exclusion for otn.oracle.com:         /products/ias/web_cache/

robots.txt Protocol and robots Metatag

The robots.txt protocol is the webmaster's path rule for any spider or crawler that visits his or her Web site.

The following sample /robots.txt file specifies that no robots should visit any URL starting with /cyberworld/map/ or /tmp/, or /foo.html:

# robots.txt for http://www.acme.com/
 
User-agent: *
Disallow: /cyberworld/map/ 
Disallow: /tmp/ 
Disallow: /foo.html

By default, the Oracle Ultra Search crawler observes the robots.txt protocol, but it also allows the user to override it. If the Web site is under the user's control, then a specific robots rule can be tailored for the crawler by specifying the Oracle Ultra Search crawler agent name "User-agent: Ultra Search." For example:

User-agent: Ultra Search
 
Disallow: /tmp/

The robots metatag can instruct the crawler to either index a Web page or follow the links within it.

Crawling Depth

Crawling depth controls how deep the crawler follows a link starting from the given seed URL. Because crawling is multi-threaded, this is not a deterministic control, as there may be different routes to a particular page.

The crawling depth limit applies to all Web sites in a given Web data source.

URL Rewriter

You implement the URL rewriter API as a Java class to perform link filtering or rewriting. Extracted links within a crawled Web page are passed to this module for checking. This enables ultimate control over which links extracted from a Web page are allowed and which ones should be discard.

See Also:

"Oracle Ultra Search URL Rewriter API"

URL Redirection and Boundary Rule Enforcement

Earlier Oracle Ultra Search releases (9.0.2, 9.0.3, and 9.2.0.4) applied the same boundary checking to a redirected URL. Thus, a redirected URL would be rejected if it was outside the boundary rule. If the redirected URL was to be crawled, you had to make sure it was covered by the boundary rule.

In 9.2.0.5, Oracle Application Server 10g, and Oracle Database 10g the redirected URL is always allowed if it is a temporary redirection (HTTP status 302, 307). For permanent redirection (status 301), the redirected URL is still subject to boundary rules.

HTTP metatag redirection is always checked against boundary rules.

Oracle Ultra Search Remote Crawler

To increase crawli ng performance, set up the Oracle Ultra Search crawler to run on one or more computers separate from your database. These computers are called remote crawlers. However, each computer must share log and mail archive directories with the database computer.

To configure a remote crawler, you must first install the Oracle Ultra Search middle tier on a computer other than the database host. During installation, the remote crawler is registered with the Oracle Ultra Search system, and a profile is created for the remote crawler. After installing the Oracle Ultra Search middle tier, you must log on to the Oracle Ultra Search administration tool and edit the remote crawler profile. You can then assign a remote crawler to a crawling schedule. To edit remote crawler profiles, use the Crawler Settings page in the administration tool.

See Also:

"Using the Remote Crawler"

Oracle Ultra Search Crawler Status Codes

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations. Only URLs with status 200 will be indexed.

See Also:

Appendix D, "URL Crawler Status Codes"