Obsolete BLAST 2.0.a19 Description

Obsolete BLAST 2.0a19 TOPICS

Licensing
Installation
Citing WU-BLAST
Historical Notes
References

Licensing

The binaries posted here are available to download without restrictions, but they are:

Copyright © 1994-1998 by Warren R. Gish.
All rights reserved.

THIS SOFTWARE IS PROVIDED ``AS IS'' WITHOUT WARRANTY OF ANY KIND, INCLUDING FITNESS FOR ANY PURPOSE WHATSOEVER, EXPRESS OR IMPLIED.

Some of the more prominent features missing from WU BLAST 2.0a19 but supported by AB-BLAST are:

support for databases of virtually unlimited size — without splitting — as well as the other unique and powerful features of the XDF database format, such as complete identifier indexing and rapid appending/updates;
support for so-called virtual databases comprised of multiple real databases that are combined at run time;
significantly improved speed and greatly improved scalability;
64-bit virtual addressing (on supporting platforms);
more efficient 32-bit virtual addressing on many 64-bit platforms;
two-hit BLAST option in all search modes;
many new command line options, which improve the speed, sensitivity and selectivity of the software.
uniform support for large files (>2 GB) by all programs in the suite, for any and all input and output files;

Installation

Users of the licensed version of BLAST 2.0 should refer to the README.html file that accompanies the software distribution for relevant instructions. The following information is specifically for users of the freely available version 2.0a19.

To install WU BLAST 2.0a19, the first step is to download the UNIX tar archive of executables appropriate for your computing platform from here. Scoring matrix files are included in each package, but sequence complexity filters are not. (Several common complexity filters are however included with AB-BLAST). It is advised that the archive be unpacked in a new, empty directory.

The executable programs from the tar archive may be placed in any directory listed in users' PATH environment variable, whether this means adding the new directory to their PATH or moving the executables into an existing directory already listed in their PATH.

Unpacking the tar archive creates a matrix/ subdirectory containing scoring matrix files. Wherever this directory ultimately resides, the BLASTMAT environment variable should be set to point there. In the absence of this environment variable being set, the programs look for scoring matrix files in /usr/ncbi/blast/matrix.

Low-complexity sequence filters or masking programs — e.g., seg, xnu and dust — are included with AB-BLAST. These filter programs are not required for running the search programs, but their use is highly recommended, as they can reduce the amount of uninteresting output produced enormously, the amount of memory used, and the search time required. You will need to build (compile and link) the filter programs from source code. Whatever directory you install the filter programs in, the BLASTFILTER environment variable should be set to point there. In the absence of this environment variable being set, the programs look for masking programs in /usr/ncbi/blast/filter. Note: unlike more recent NCBI BLAST search programs, the WU BLAST search programs do not employ sequence filtering by default.

The databases themselves are missing from the tar archives, too. Once the databases have been downloaded from any of many sources on the Internet, the database files are typically uncompressed and processed into FASTA format. Included in the tar archives are several utility programs for converting textual database files:

gb2fasta converts the nucleotide sequences in GenBank flat files into FASTA format.
gt2fasta converts the CDS translations (peptide sequences) in GenBank flat files into FASTA format.
sp2fasta converts EMBL or SWISS-PROT flat files into FASTA format.

The NCBI software Toolbox also contains parsers, including asn2fast, a program that converts both nucleotide and peptide sequences in GenBank ASN.1 format into FASTA format files.

All of the above parsers can read from standard input (sometimes signified by a single dash, "-"), so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed directly into the parsers, thus saving the time and storage required for the uncompressed data. Because a dash is often used to signify the start of each command line option, if a dash is needed to specify standard input for the required input filename argument, some of these programs require that a double-dash (--) be specified before the single-dash. This double-dash signifies the end of the command line options and the start of the required arguments.

Once the databases are in FASTA format, the setdb and pressdb programs are used to convert them into blastable format. Simple usage instructions for these programs can be obtained by invoking them without command line arguments. When producing a blastable database, each program creates 3 output files whose names are derived from the name of the input FASTA-format file. The 3 output files are given distinct filename extensions and together comprise the blastable database. For nucleotide sequences containing ambiguity codes (e.g., ESTs which often contain many Ns), the FASTA file will be referenced later (if still accessible) by the search programs, to obtain ambiguity codes for matching sequences that contain such codes. More information about the blastable database file formats is available here.

The blastable database files can be placed anywhere, but the BLASTDB environment variable should point to their directory location. If the BLASTDB environment variable is not set, the programs look for their databases in /usr/ncbi/blast/db and in the current working directory. If the search programs are to find them, nucleotide sequence FASTA files must be located in the same directory as the blastable databases. Sometimes it is more convenient to maintain the FASTA files in a separate directory on another disk partition, with UNIX soft links in the BLASTDB directory pointing to FASTA files stored elsewhere. In addition, on systems where NCBI BLAST will not be in use, blastable databases can be maintained in multiple directories listed in the BLASTDB environment variable, delimiting the directory names with colons just as directory names are delimited in the PATH environment variable.

On multi-processor computer systems, the search programs will by default employ as many CPUs as are installed (up to 4 CPUs in the case of BLASTN, unless more are requested), but this may make inefficient use of the computer when more than about 4 CPUs are used. Depending on how many processors are in your box, you may want to wrap the search programs in a shell script that sets a lower number of CPUs via the cpus=# (or the deprecated P=#) command line option. Another approach to changing the default number of CPUs follows below, for BLAST managers possessing "root" or "SuperUser" privileges.

For further information, the out-dated manual page for the BLAST version 1.4 (ungapped) search programs is still sometimes useful, for a description of procedures and parameters that have not changed.

Citing WU-BLAST

Citations or acknowledgements of WU-BLAST usage are greatly appreciated, as are any personal accounts of how the software is being used that you might wish to share. When URLs are acceptable, please cite with:

   Gish, W. (1996-2004) http://blast.wustl.edu

When URLs are not acceptable, please use:

   Gish, W., personal communication.

The WU-BLAST unified search program may also be referred to by the name BLASTA.

In scientific communications, it is important to report the program name, as well as the specific version(s) used. In the case of WU-BLAST or BLASTA, the version is a combination of the "2.0" moniker and the release date. The release date can be found on the first line of output, and it is the first date displayed. For example, consider this introductory line of output:

  BLASTN 2.0MP-WashU [02-Apr-2002] [sol8-ultra-ILP32F64 2002-04-03T01:25:46]

In the above, the software release date is April 2, 2002, whereas the build date of the Solaris 8 UltraSPARC binary executable was April 3rd at 1:25 AM.

Historical Notes

Historical notes and additional citation information for some earlier versions of NCBI and WU BLAST include:

The first description of the classical ungapped BLAST algorithm was published by Altschul et al. (1990). This paper focuses on BLASTP and BLASTN, and makes mention of TBLASTN.
The NCBI Experimental BLAST Network Service was opened to the public in December 1989, providing Internet access to the latest, parallelized search programs and sequence databases updated on a daily basis. Around the same time, the "nr" (quasi-non-redundant) databases were established (W. Gish, unpublished). The experimental service was ultimately discontinued more than a decade later in March 2000. At the request of NCBI upper management, a report on the experimental service was never published and remains W. Gish (unpublished). Awareness of the service spread by word-of-mouth, much as is the case with WU BLAST.
BLASTX first appeared in BLAST 1.1 in July 1990, and was later described by Gish and States (1993). The BLAST3 program (Altschul and Lipman, 1990) was also folded into the 1.1 release and parallelized. The use of Poisson statistics, as suggested by Karlin and Altschul (1990) to evaluate the joint probability of multiple HSPs, was also first featured in BLAST 1.1.
BLASTC, a version of BLASTX that considered codon usage information in addition to sequence similarity (States and Gish, 1994), only appeared in the BLAST 1.3 distribution. The BLAST 1.3 distribution was also the last to include the BLAST3 program.
The first version of BLAST to use Karlin and Altschul (1993) "Sum" statistics to evaluate the joint probability of multiple HSPs was BLAST 1.4 (W. Gish, unpublished).
The TBLASTX search mode first appeared in BLAST version 1.4 and remains attributable to W. Gish (unpublished).
The first release of WU BLAST was version 1.4, which was virtually identical to NCBI BLAST 1.4, save for a few bug fixes. The WU BLAST Archives (http://blast.wustl.edu) first appeared on the Internet in 1995, to provide continued support for the work begun at the NCBI, as well as to provide a central location where BLAST-related software, information, and earlier software versions could be obtained.
Starting in late 1994, Stephen Altschul and I engaged in a collaboration to provide support for my conjecture that fixed estimates for λ, K and H, along with Sum statistics, could be practically applied to the evaluation of locally optimal gapped alignment scores. This work eventually appeared in Altschul and Gish (1996) and provides much of the foundation for all later versions of BLAST from Washington University in St. Louis and the NCBI.
The first complete implementation of gapped BLAST (BLASTP, BLASTN, BLASTX, TBLASTN and TBLASTX) with statistical significance estimates (both Poisson and Sum statistics) was put into limited distribution in August 1995, and was publicly released as WU BLAST version 2.0d1 (W. Gish, unpublished), in time for presentation at the Cold Spring Harbor Genome Mapping and Sequencing conference in May 1996.
The NCBI published its BLAST version 2, or Gapped BLAST, including a description of the 2-hit BLAST and PSI-BLAST algorithms, in Altschul et al. (1997), in September 1997.
The NCBI published a description of PHI-BLAST in Zhang et al. 1998.

References

Altschul, SF, and W Gish (1996). Local alignment statistics. ed. R. Doolittle. Methods in Enzymology 266:460-80.

Altschul, SF, and DJ Lipman (1990). Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509-13.

Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990). Basic local alignment search tool. J. of Mol. Biol. 215:403-10.

Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402.

Claverie, JM, and DJ States (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry 17:191-201.

Gish, W, and DJ States (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-72.

Hancock, JM, and JS Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10:67-70.

Karlin, S, and SF Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-8.

Karlin, S, and SF Altschul (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90:5873-7.

Smith, TF, and MS Waterman (1981). Identification of common molecular subsequences. J. Mol. Biol. 147:195-7.

States, DJ, and W Gish (1994). Combined use of sequence similarity and codon bias for coding region identification. J. Comp. Biol. 1:39-50.

Wootton, JC, and S Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149-63.

Wootton, JC, and S Federhen (1996). Analysis of compositionally biased regions in sequence databases. ed. R. Doolittle. Methods in Enzymology 266:554-71.

Zhang, Z, Schaffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV, and SF Altschul (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986-90.

Return to the Advanced Biocomputing, LLC home page