Oracle® Data Mining Application Developer's Guide, 10g Release 2 (10.2) Part Number B14340-01 |
|
|
View PDF |
This chapter describes table functions included with ODM that permit you to perform similarity searches against nucleotide and amino acid sequence data stored in an Oracle database. You can use the table functions described in this chapter for ad hoc searches or you can embed them in applications. The inclusion of these table functions in ODM positions Oracle as a platform for bioinformatics.
This chapter discusses the following topics:
The National Center for Biotechnology Information (NCBI) developed one of the commonly used versions of the Basic Local Alignment Search Tool (BLAST).
Sequence alignments provide a way to compare new sequences with previously characterized sequences. Both functional and evolutionary information can be inferred from well-designed queries and alignments. BLAST provides a method for searching of both nucleotide and protein databases. Since the BLAST algorithm detects local alignments, regions of similarity embedded in otherwise unrelated sequences can be detected.
The BLAST algorithm searches nucleotide and amino acid query sequences against databases of nucleotide and amino acid sequences. Based on the nature of the query and the database sequences, NCBI BLAST provides the following variants:
BLASTP compares an amino acid query sequence against an amino acid sequence database.
BLASTN compares a nucleotide query sequence against a nucleotide sequence database.
BLASTX compares a nucleotide query sequence translated along all six reading frames (both strands) against a amino acid sequence database.
TBLASTN compares an amino acid query sequence against a nucleotide sequence database translated along all six reading frames (both strands).
TBLASTX compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
For more information about NCBI BLAST, see the NCBI BLAST Home Page at
http://www.ncbi.nlm.nih.gov/BLAST
The table functions described in this chapter implement some of the variants of NCBI BLAST version 2.0.
This section contains several examples of using the ODM BLAST table functions to perform searches on nucleotide or amino acid sequences.
Most table function parameters have defaults. The defaults were carefully chosen so that users who have limited experience with BLAST should obtain good results.
The BLAST table functions accept the CLOB (Character Large OBject) data type as the query sequence. It is not possible to construct a CLOB in an ad hoc SQL query. One way to construct a CLOB is to create a table and insert the query sequence into the table. Another option is to construct a CLOB using the programmatic interface if the BLAST query is part of a larger program. Suppose that the table query_db
has the schema (sequence_id VARCHAR2(32), sequence CLOB)
. The following SQL query inserts the query sequence into query_db
:
INSERT INTO query_db VALUES ('1', 'AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGT');
Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id
, publication date
, modification date
, organism
, and sequence
, among other attributes. There is no required schema for the table that stores the sequences. The only requirement is that the table contain an identifier and the sequence and any number of other optional attributes.
The portion of the database to be used for the search can be specified using SQL. The full power of SQL can be used to perform sophisticated selections.
The following query does a BLAST search of the given query sequence against the human genome and returns the seq_id,
score
, and expect
value of matches that score > 25:
SELECT t.t_seq_id, t.score, t.expect FROM TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE organism = 'human'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.score > 25;
Note: The parameter value of 0 invokes the default values in most cases. See the syntax for details.
The following query does the BLAST search against all sequences published after Jan 01, 2000:
SELECT t.t_seq_id, t.score, t.expect FROM TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.score > 25;
You can obtain other attributes of the matching sequence by joining the BLAST result with the original sequence table as follows:
SELECT t.t_seq_id, t.score, t.expect, g.publication_date, g.organism FROM GENE_DB g, TABLE ( BLASTN_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.t_seq_id = g.seq_id AND t.score > 25;
Suppose that the table PROT_DB stores protein sequences. Insert the protein query sequence to be used for the search into query_db
.
The following query does a BLASTP search of the given query sequence against protein sequences in PROT_DB and returns the identifier
, score
, name
, and expect
value of matches that score > 25:
SELECT t.t_seq_id, t.score, t.expect, p.name FROM PROT_DB p, TABLE( BLASTP_MATCH ( (SELECT sequence FROM query_db WHERE sequence_id = '2'), CURSOR(SELECT seq_id, sequence FROM PROT_DB), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) )t WHERE t.t_seq_id = p.seq_id AND t.score > 25 ORDER BY t.expect;
Suppose that the table GENE_DB stores DNA sequences. Suppose that GENE_DB has attributes seq_id
, publication date
, modification date
, organism
, and sequence
among other attributes.
The following query does a BLAST search and alignment of the given query sequence against the human genes and returns the publication_date
, organism
, and the alignment attributes of the matching sequences that score > 25 and where more than 50% of the sequence is conserved in the match:
SELECT t.t_seq_id, t.alignment_length, t.pct_identity, t.q_seq_start, t.q_seq_end, t.t_seq_start, t.t_seq_end, t.score, t.expect, g.publication_date, g.organism FROM GENE_DB g, TABLE ( BLASTN_ALIGN ( (SELECT sequence FROM query_db WHERE sequence_id = '1'), CURSOR (SELECT seq_id, sequence FROM GENE_DB WHERE publication_date > '01-JAN-2000'), 1, -1, 0, 0, 10, 0, 0, 0, 0, 11, 0, 0) ) t WHERE t.t_seq_id = g.seq_id AND t.score > 25 AND t.pct_identity > 50;
You can use BLASTP_ALIGN and TBLAST_ALIGN in a similar way.
The output of a BLAST query is a table; the output table is described as the output table for the specific query.
Here are two examples of queries and the resulting output tables.
Query 1 is as follows:
select T_SEQ_ID AS seq_id, score, EXPECT as evalue from TABLE( BLASTP_MATCH ( (select sequence from query_db), CURSOR(SELECT seq_id, seq_data FROM swissprot WHERE organism = 'Homo sapiens (Human)'), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) );
The output for query 1 is as follows:
SEQ_ID SCORE EVALUE -------- ---------- ---------- P31946 205 5.8977E-18 Q04917 198 3.8228E-17 P31947 169 8.8130E-14 P27348 198 3.8228E-17 P58107 49 7.24297332
Query 2 is as follows:
select T_SEQ_ID AS seq_id, ALIGNMENT_LENGTH as len, Q_SEQ_START as q_strt, Q_SEQ_END as q_end, Q_FRAME, T_SEQ_START as t_strt, T_SEQ_END as t_end, T_FRAME, score, EXPECT as evalue from TABLE( BLASTP_ALIGN ( (select sequence from query_db), CURSOR(SELECT seq_id, seq_data FROM swissprot WHERE organism = 'Homo sapiens (Human)' AND creation_date > '01-Jan-90'), 1, -1, 0, 0, 'BLOSUM62', 10, 0, 0, 0, 0, 0) );
The output for Query 2 is as follows:
SEQ_ID LEN Q_STRT Q_END Q_FRAME T_STRT T_END T_FRAME SCORE EVALUE -------- ---- ------ ----- ------- ------ ----- ------- ------- ---------- P31946 50 0 50 0 13 63 0 205 5.1694E-18 Q04917 50 0 50 0 12 62 0 198 3.3507E-17 P31947 50 0 50 0 12 62 0 169 7.7247E-14 P27348 50 0 50 0 12 62 0 198 3.3507E-17 P58107 21 30 51 0 792 813 0 49 6.34857645
If you perform frequent BLAST searches on nucleotide sequences, performance improves significantly when the data set of sequences is transformed into a compressed binary format, and the compressed data is used in the searches. The BLASTN_COMPRESS() function transforms a nucleotide data set represented as CLOBs into compressed binary format represented as BLOBs.
Suppose that the table GENE_DB contains DNA sequences upon which you will perform frequent searches. Suppose that GENE_DB has attributes (seq_id, publication date, modification date, organism, sequence)
among other attributes. The following query stores all human DNA sequences in compressed binary format, in the table COMPRESSED_HUMAN_GENES.
create table COMPRESSED_HUMAN_GENES as select seq_id, seq_data from Table(BLASTN_COMPRESS ( from GENE_DB where organism = 'human')))
The portion of the database to be compressed can be specified using SQL. The full power of SQL can be used to perform more sophisticated selections involving joins.
The compressed sequences can be directly passed to BLAST match and align functions as shown in the following example.
select t.t_seq_id, t.alignment_length, t.pct_identity, t.q_start, t.q_end, t.s_start, t.s_end, t.score, t.expect, g.publication_date, g.organism from GENE_DB g, Table(BLASTN_ALIGN ( select sequence from QUERY_SEQ where id = '1'), seqdb_cursor => cursor(select seq_id, seq_data from Table(BLASTN_COMPRESS ( cursor(select seq_id, sequence from GENE_DB where organism = 'human')))), expect_value => 5, word_size => 12)) t where t.t_seq_id = g.identifier AND t.score > 25 AND t.pct_identity > 50;
We provide a few sample data sets and queries that can be used to verify that the BLAST functions work correctly after ODM is installed.
The DM_USER
schema contains the following sequence data tables:
SWISSPROT
PROT_DB
ECOLI10
The SWISSPROT
table contains the sequences in Release 40 of the SwissProt database. This table has the sequence identifier, creation_date
, organism
, and sequence_data
attributes. It has 101,602 protein sequences.
SQL> describe SWISSPROT; Name Null? Type --------------------------------------- ------- ------------- SEQ_ID VARCHAR2(32) CREATION_DATE DATE ORGANISM VARCHAR2(256) SEQ_DATA CLOB
The PROT_DB
table consists of 19 protein sequences from Release 40 of the SwissProt data set.
SQL> describe prot_db; Name Null? Type ---------------------------------------- ------- ------------- SEQ_ID VARCHAR2(32) SEQ_DATA CLOB
The ECOLI10
table contains 10 nucleotide sequences from the Escherichia coli data set.
SQL> describe ECOLI10; Name Null? Type ----------------------------------------- -------- --------------- SEQ_ID VARCHAR2(32) SEQ_DATA CLOB
Table 9-1 lists genetic codes and associated names.
Table 9-1 Table of Genetic Codes
Genetic Code | Name |
---|---|
1 |
Standard |
2 |
Vertebrate Mitochondrial |
3 |
Yeast Mitochondrial |
4 |
Mold Mitochondrial, Protozoan Mitochondrial, Coelenterate Mitochondrial, Mycoplasma, Spiroplasm |
5 |
Invertebrate Mitochondrial |
6 |
Ciliate Nuclear, Dasycladacean Nuclear, Hexamita Nuclear |
9 |
Echinoderm Mitochondrial |
10 |
Euplotid Nuclear |
11 |
Bacterial and Plant Plastid |
12 |
Alternative Yeast Nuclear |
13 |
Ascidian Mitochondrial |
14 |
Flatworm Mitochondrial |
15 |
Blepharisma Macronuclear |
16 |
Chlorophycean Mitochondrial |
21 |
Trematode Mitochondrial |
22 |
Scenedesmus Obliquus Mitochondrial |
23 |
Thraustochytrium Mitochondrial Code |
There are several public domain sequence databases available. One of them is the SwissProt database, which is a highly curated collection of protein sequences. SwissProt has recently been combined with other databases to create UniProt. The last release of the SwissProt database can be downloaded from
ftp://ftp.ebi.ac.uk/pub/databases/swissprot/release/sprot45.dat
In addition to the raw sequence data, the SwissProt database contains several other attributes of the sequence including organism, date published, date modified, published literature references, annotations, and so on. BLAST requires only the sequence identifier and the sequence data to be stored to perform searches.
Depending on the needs of your specific application, different sets of these attributes may be important. Therefore, the database schema required to store the data needs to be appropriately designed. You can use a scripting language to parse the required fields from the SwissProt data and format the fields so that they can be loaded into an Oracle database.
The following Perl script outputs the sequence identifier, creation_date, organism, and sequence data in the required format for SQL*Loader. (SQL*Loader is the utility that loads data into an Oracle database; it is described in detail in Oracle Database Utilities.)
#!/bin/perl #swissprot.pl < input > output #Input: protein db as provided by SWISSPROT # my $string = ""; my $indicator = ""; $sq = 0; $ac = 0; while(<>) { #chop; if ( /^\/\// ) { print "\n"; $sq = 0; $ac = 0; next; } if ($sq == 1) { @words = split; foreach $word (@words) { print "$word"; } next; } if( /^AC(\s+)(\w+);/ ) { if ($ac == 0) { $indicator = $2; print "$indicator|"; $sq = 0; $dt = 0; $ac = 1; next; } } if ( /^OS(\s+)(.*)\./ ) { $organism = $2; print "$organism|"; next; } if ( /^DT(\s+)(\S+)/ ) { if ($dt == 0) { print "$2|"; $dt = 1; } } if ( /^SQ(\s+)/ ) { $sq = "1"; next; } }
Follow these steps to download, parse, and save the SwissProt data in an Oracle database:
Download SwisProt data to the file sprot45.dat
.
Save the perl script in a file named swissprot.pl
, type the command
swissprot.pl sprot45.dat > sprot_formatted.txt
This command will read the SwissProt data stored in sprot45.dat
, format it, and write it out to sprot_formatted.txt
.
In order to load the data using SQL*Loader, you must create a table to hold the data and a control file. Create the table swissprot
using the following SQL statement:
create table swissprot (SEQ_ID VARCHAR2(32), CREATION_DATE DATE, ORGANISM VARCHAR2(256), SEQ_DATA CLOB);
Create a control file named sprot.ctl
with the following contents:
LOAD DATA INFILE sprot40_formatted.txt INTO TABLE swissprot REPLACE FIELDS TERMINATED BY '|' TRAILING NULLCOLS ( seq_id, creation_date, organism, seq_data char(100000) )
Finally, load the data:
sqlldr userid=<user_name>/<passwd> control=sprot.ctl log=sprot.log direct=TRUE data=sprot40_formatted.txt
The SwisProt data is now stored in the Oracle table swissprot
.
The BLAST functionality is available as table functions; these table functions can be used in the FROM clause of a SQL query.
Table 9-2 BLAST Table Functions
Table Function | Description |
---|---|
BLASTN_COMPRESS Table Function |
Compress nucleotide sequence data to improve performance of sequence searches. |
|
Perform a search of the given nucleotide sequence against the selected portion of the nucleotide database |
Perform a search of the given amino acid sequence against the selected portion of the protein database |
|
Perform a search involving translations of either the query sequence or the database of sequences |
|
Perform an alignment of the given nucleotide sequence against the selected portion of the nucleotide database |
|
Perform an alignment of the given amino acid sequence against the selected portion of the protein database |
|
Perform alignments involving translations of either the query sequence or the database of sequences |
This table function compresses nucleotide sequence data. It takes as input a cursor of sequence identifier and sequence data represented as a CLOB and returns the sequence identifier and a BLOB representing the sequence data in compressed binary format. The result of BLASTN_COMPRESS can be either materialized in a table for future use or passed into the BLAST search functions that accept nucleotide sequence data
Syntax
function BLASTN_COMPRESS ( sequence_cursor REF CURSOR) return table of row (seq_id VARCHAR2, seq_data BLOB);
Parameters
Table 9-3 describes the input parameters for BLASTN_COMPRESS; Table 9-4, the output parameters.
Table 9-3 Input Parameters for BLASTN_COMPRESS Table Function
Parameter | Description |
---|---|
|
The cursor of the sequences to be compressed. The cursor has two columns the sequence identifier and the sequence string. |
Table 9-4 Output Parameters for BLASTN_MATCH Table Function
Attribute | Description |
---|---|
|
The sequence identifier of the sequence. The value returned is the same as the sequence identifier in the input cursor. |
|
The compressed sequence represented as a BLOB. |
This table function performs a BLASTN search of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id
) (for example, the NCBI accession number), the score of the match, and the expect value.
Syntax
function BLASTN_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, expect_value NUMBER default 10, open_gap_cost NUMBER default 5, extend_gap_cost NUMBER default 2, mismatch_cost NUMBER default -3, match_reward NUMBER default 1, word_size NUMBER default 11, xdropoff NUMBER default 30, final_x_dropoff NUMBER default 50) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Parameters
Table 9-5 describes the input parameters for BLASTN_MATCH; Table 9-6, the output parameters.
Table 9-5 Input Parameters for BLASTN_MATCH Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
|
|
|
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 5. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 2. Specifying 0 invokes default behavior. |
|
The penalty for nucleotide mismatch. The default value is -3. Specifying 0 invokes default behavior. |
|
The reward for a nucleotide match. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 11. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 30. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 50. Specifying 0 invokes default behavior. |
Table 9-6 Output Parameters for BLASTN_MATCH Table Function
Attribute | Description |
---|---|
|
The sequence identifier of the returned match. |
|
The score of the returned match. |
|
The expect value of the returned match. |
This table function performs a BLASTP search of the given amino acid sequence against the portion of the selected protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id
) (for example, the NCBI accession number), the score of the match, and the expect value.
Syntax
function BLASTP_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Parameters
Table 9-7 describes the input parameters for BLASTN_MATCH; Table 9-8, the output parameters.
Table 9-7 Input Parameters for BLASTP_MATCH Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
|
|
|
|
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. |
Table 9-8 Output Parameters for BLASTP_MATCH Table Function
Attribute | Description |
---|---|
|
The sequence identifier of the returned match. |
|
The score of the returned match. |
|
The expect value of the returned match. |
For each substitution matrix (sub_matrix
), only certain combinations of (open_gap_cost, extend_gap_cost) values are supported. Table 9-9 shows the supported combinations of values for each substitution matrix.
Table 9-9 Supported Combinations of (open_gap_cost, extend_gap cost)
Substitution Matrix Name | Supported (open_gap_cost, extend_gap_cost) Values |
---|---|
BLOSUM45 |
(13,3), (12,3), (11,3), (10,3), (16,2), (15,2), (14,2), (13,2), (12,2), (19,1), (18,1), (17,1), (16,1) |
BLOSUM62 |
(11,2), (10,2), (9,2), (8,2), (7,2), (6,2), (13,1), (12,1), (11,1), (10,1), (9,1) |
BLOSUM80 |
(25,2), (13,2), (9,2), (8,2), (7,2), (6,2), (11,1),(10,1), (9,1) |
PAM30 |
(7,2), (6,2), (5,2), (10,1), (9,1), (8,1) |
PAM70 |
(8,2), (7,2), (6,2), (11,1), (10,1), (9,1) |
This table function performs BLAST searches involving translations of either the query sequence or the database of sequences. The available options are:
BLASTX: The query nucleotide sequence is translated and compared against a protein database.
TBLASTN: The query amino acid sequence is compared against a translated nucleotide database.
TBLASTX: The query nucleotide sequence is translated and compared against a translated nucleotide database.
The database can be selected using a standard SQL select and passed into the function as a cursor. We also accept the standard BLAST parameters that are listed in the following section. The match returns the identifier of the matched (target) sequence (t_seq_id
) (for example, the NCBI accession number), the score of the match, and the expect value.
Syntax
function TBLAST_MATCH ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, translation_type VARCHAR2 default 'BLASTX', genetic_code NUMBER default 1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row (t_seq_id VARCHAR2, score NUMBER, expect NUMBER);
Parameters
Table 9-10 describes the input parameters for TBLAST_MATCH; Table 9-11, the output parameters.
Table 9-10 Input Parameters for TBLAST_MATCH Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX. |
|
Used for translating nucleotide sequences to amino acid sequences. |
|
|
|
|
|
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. |
Table 9-11 Output Parameters for TBLAST_MATCH Table Function
Attribute | Description |
---|---|
|
The sequence identifier of the returned match. |
|
The score of the returned match. |
|
The expect value of the returned match. |
This table function performs a BLASTN alignment of the given nucleotide sequence against the selected portion of the nucleotide database. The database can be selected using a standard SQL select and passed into the function as a cursor. It accepts the standard BLAST parameters that are listed in the following section.
BLASTN_MATCH returns only the score and expect value of the match. It does not return information about the alignment. BLASTN_MATCH is typically used when a BLAST search will be followed up with h a more compute intensive alignment, such as the Smith-Waterman alignment.
BLASTN_ALIGN does the BLAST alignment and returns the information about the alignment.
Syntax
function BLASTN_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, expect_value NUMBER default 10, open_gap_cost NUMBER default 5, extend_gap_cost NUMBER default 2, mismatch_cost NUMBER default -3, match_reward NUMBER default 1, word_size NUMBER default 11, xdropoff NUMBER default 30, final_x_dropoff NUMBER default 50) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Parameters
Table 9-12 describes the input parameters for BLASTN_ALIGN; Table 9-13, the output parameters.
Table 9-12 Input Parameters for BLASTN_ALIGN Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
|
|
|
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 5. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 2. Specifying 0 invokes default behavior. |
|
The penalty for nucleotide mismatch. The default value is -3. Specifying 0 invokes default behavior. |
|
The reward for a nucleotide match. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 11. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 30. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 50. Specifying 0 invokes default behavior. |
Table 9-13 Output Parameters for BLASTN_ALIGN Table Function
Parameter | Description |
---|---|
|
Identifier (for example, the NCBI accession number) of the matched (target) sequence |
|
Percentage of the query sequence that identically matches with the database sequence. |
|
Length of the alignment. |
|
Number of base-pair mismatches between the query and the database sequence. |
|
Number of base-pairs with a positive match score between the query and the database sequence. |
|
Number of gaps opened in gapped alignment. |
|
List of offsets where a gap is opened. |
|
The indexes of the portion of the query sequence that is aligned |
|
Translation frame number of the query. |
|
The indexes of the portion of the target sequence that is aligned. |
|
Translation frame number of the target sequence. |
|
Expect value of the alignment. |
score |
Score corresponding to the alignment. |
This table function performs a BLASTP alignment of the given amino acid sequences against the selected portion of the protein database. The database can be selected using a standard SQL select and passed into the function as a cursor. You can also use the standard BLAST parameters that are listed in the following section.
BLASTP_MATCH function returns only the score and expect value of the match. It does not return information about the alignment. The BLASTP_MATCH is typically used when a BLAST search will be followed up with h a more compute intensive alignment, such as the Smith-Waterman alignment or a full FASTA alignment.
The BLASTP_ALIGN function does the BLAST alignment and returns the information about the alignment. The schema of the returned alignment is the same as that of BLASTN_ALIGN.
Syntax
function SYS_BLASTP_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default -1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Parameters
Table 9-14 describes the input parameters for BLASTP_ALIGN; Table 9-15, the output parameters.
Table 9-14 Input Parameters for BLASTP_ALIGN Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
TRUE or FALSE. If TRUE, the search masks off segments of the query sequence that have low compositional complexity. Filtering can eliminate statistically significant but biologically uninteresting regions, leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence. The default is FALSE. |
|
TRUE or FALSE. If TRUE, you can specify a sequence in upper case characters as the query sequence and denote areas to be filtered out with lower case. This customizes what is filtered from the sequence. The default is FALSE. |
|
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
|
X-dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 25 Specifying 0 invokes default behavior. |
Table 9-15 Output Parameters for BLASTP_ALIGN Table Function
Parameter | Description |
---|---|
|
Identifier (for example, the NCBI accession number) of the matched (target) sequence |
|
Percentage of the query sequence that identically matches with the database sequence. |
|
Length of the alignment. |
|
Number of base-pair mismatches between the query and the database sequence. |
|
Number of base-pairs with a positive match score between the query and the database sequence. |
|
Number of gaps opened in gapped alignment. |
|
List of offsets where a gap is opened. |
|
The indexes of the portion of the query sequence that is aligned. |
|
Translation frame number of the query. |
|
The indexes of the portion of the target sequence that is aligned. |
|
Translation frame number of the target sequence. |
|
Score corresponding to the alignment. |
This table function performs BLAST alignments involving translations of either the query sequence or the database of sequences or both the query sequence and the database of sequences. The available translation options are BLASTX, TBLASTN, and TBLASTX. The schema of the returned alignment is the same as that of BLASTN_ALIGN and BLASTP_ALIGN.
Syntax
function TBLAST_ALIGN ( query_seq CLOB, seqdb_cursor REF CURSOR, subsequence_from NUMBER default 1, subsequence_to NUMBER default 0, translation_type VARCHAR2 default 'BLASTX', genetic_code NUMBER default 1, filter_low_complexity BOOLEAN default false, mask_lower_case BOOLEAN default false, sub_matrix VARCHAR2 default 'BLOSUM62', expect_value NUMBER default 10, open_gap_cost NUMBER default 11, extend_gap_cost NUMBER default 1, word_size NUMBER default 3, x_dropoff NUMBER default 15, final_x_dropoff NUMBER default 25) return table of row ( t_seq_id VARCHAR2, pct_identity NUMBER, alignment_length NUMBER, mismatches NUMBER, positives NUMBER, gap_openings NUMBER, gap_list [Table of NUMBER], q_seq_start NUMBER, q_frame NUMBER, q_seq_end NUMBER, t_seq_start NUMBER, t_seq_end NUMBER, t_frame NUMBER, score NUMBER, expect NUMBER);
Parameters
Table 9-16 describes the input parameters for TBLAST_ALIGN; Table 9-17, the output parameters.
Table 9-16 Input Parameters for TBLAST_ALIGN Table Function
Parameter | Description |
---|---|
|
The query sequence to search. This version of ODM BKLAST accepts bare sequences only. A bare sequence is just lines of sequence data. Blank lines are not allowed in the middle of bare sequence input. |
|
The cursor parameter supplied by the user when calling the function. It should return two columns in its returning row, the sequence identifier and the sequence string. |
|
Start position of a region of the query sequence to be used for the search. The default is 1. |
|
End position of a region of the query sequence to be used for the search. If -1 is specified, the sequence length is taken as |
|
Type of the translation involved. The options are BLASTX, TBLASTN, and TBLASTX. The default is BLASTX. |
|
Used for translating nucleotide sequences to amino acid sequences. |
|
|
|
|
|
Specifies the substitution matrix used to assign a score for aligning any possible pair of residues. The different options are |
|
The statistical significance threshold for reporting matches against database sequences. The default value is 10. Specifying 0 invokes default behavior. |
|
The cost of opening a gap. The default value is 11. Specifying 0 invokes default behavior. |
|
The cost of extending a gap. The default value is 1. Specifying 0 invokes default behavior. |
|
The word size used for dividing the query sequence into subsequences during the search. The default value is 3. Specifying 0 invokes default behavior. |
|
Dropoff for BLAST extensions in bits. The default value is 15. Specifying 0 invokes default behavior. |
|
The final X dropoff value for gapped alignments in bits. The default value is 25. Specifying 0 invokes default behavior. |
Table 9-17 Output Parameters for TBLAST_ALIGN Table Function
Parameter | Description |
---|---|
|
Identifier (for example, the NCBI accession number) of the matched (target) sequence |
|
Percentage of the query sequence that identically matches with the database sequence. |
|
Length of the alignment. |
|
Number of base-pair mismatches between the query and the database sequence. |
|
Number of base-pairs with a positive match score between the query and the database sequence. |
|
Number of gaps opened in gapped alignment. |
|
List of offsets where a gap is opened. |
|
The indexes of the portion of the query sequence that is aligned. |
|
Translation frame number of the query. |
|
The indexes of the portion of the target sequence that is aligned. |
|
Translation frame number of the target sequence. |
|
Score corresponding to the alignment. |
|
Expect value of the alignment. |