Obsolete BLAST 2.0a19 TOPICS

Licensing

The binaries posted here are available to download without restrictions, but they are:

Copyright © 1994-1998 by Warren R. Gish.
All rights reserved.

THIS SOFTWARE IS PROVIDED ``AS IS'' WITHOUT WARRANTY OF ANY KIND, INCLUDING FITNESS FOR ANY PURPOSE WHATSOEVER, EXPRESS OR IMPLIED.

Some of the more prominent features missing from WU BLAST 2.0a19 but supported by AB-BLAST are:

Installation

Users of the licensed version of BLAST 2.0 should refer to the README.html file that accompanies the software distribution for relevant instructions. The following information is specifically for users of the freely available version 2.0a19.

To install WU BLAST 2.0a19, the first step is to download the UNIX tar archive of executables appropriate for your computing platform from here. Scoring matrix files are included in each package, but sequence complexity filters are not. (Several common complexity filters are however included with AB-BLAST). It is advised that the archive be unpacked in a new, empty directory.

The executable programs from the tar archive may be placed in any directory listed in users' PATH environment variable, whether this means adding the new directory to their PATH or moving the executables into an existing directory already listed in their PATH.

Unpacking the tar archive creates a matrix/ subdirectory containing scoring matrix files. Wherever this directory ultimately resides, the BLASTMAT environment variable should be set to point there. In the absence of this environment variable being set, the programs look for scoring matrix files in /usr/ncbi/blast/matrix.

Low-complexity sequence filters or masking programs — e.g., seg, xnu and dust — are included with AB-BLAST. These filter programs are not required for running the search programs, but their use is highly recommended, as they can reduce the amount of uninteresting output produced enormously, the amount of memory used, and the search time required. You will need to build (compile and link) the filter programs from source code. Whatever directory you install the filter programs in, the BLASTFILTER environment variable should be set to point there. In the absence of this environment variable being set, the programs look for masking programs in /usr/ncbi/blast/filter. Note: unlike more recent NCBI BLAST search programs, the WU BLAST search programs do not employ sequence filtering by default.

The databases themselves are missing from the tar archives, too. Once the databases have been downloaded from any of many sources on the Internet, the database files are typically uncompressed and processed into FASTA format. Included in the tar archives are several utility programs for converting textual database files:

The NCBI software Toolbox also contains parsers, including asn2fast, a program that converts both nucleotide and peptide sequences in GenBank ASN.1 format into FASTA format files.

All of the above parsers can read from standard input (sometimes signified by a single dash, "-"), so their input files can be maintained on disk in compressed format and dynamically zcat-ed or gunzip-ed directly into the parsers, thus saving the time and storage required for the uncompressed data. Because a dash is often used to signify the start of each command line option, if a dash is needed to specify standard input for the required input filename argument, some of these programs require that a double-dash (--) be specified before the single-dash. This double-dash signifies the end of the command line options and the start of the required arguments.

Once the databases are in FASTA format, the setdb and pressdb programs are used to convert them into blastable format. Simple usage instructions for these programs can be obtained by invoking them without command line arguments. When producing a blastable database, each program creates 3 output files whose names are derived from the name of the input FASTA-format file. The 3 output files are given distinct filename extensions and together comprise the blastable database. For nucleotide sequences containing ambiguity codes (e.g., ESTs which often contain many Ns), the FASTA file will be referenced later (if still accessible) by the search programs, to obtain ambiguity codes for matching sequences that contain such codes. More information about the blastable database file formats is available here.

The blastable database files can be placed anywhere, but the BLASTDB environment variable should point to their directory location. If the BLASTDB environment variable is not set, the programs look for their databases in /usr/ncbi/blast/db and in the current working directory. If the search programs are to find them, nucleotide sequence FASTA files must be located in the same directory as the blastable databases. Sometimes it is more convenient to maintain the FASTA files in a separate directory on another disk partition, with UNIX soft links in the BLASTDB directory pointing to FASTA files stored elsewhere. In addition, on systems where NCBI BLAST will not be in use, blastable databases can be maintained in multiple directories listed in the BLASTDB environment variable, delimiting the directory names with colons just as directory names are delimited in the PATH environment variable.

On multi-processor computer systems, the search programs will by default employ as many CPUs as are installed (up to 4 CPUs in the case of BLASTN, unless more are requested), but this may make inefficient use of the computer when more than about 4 CPUs are used. Depending on how many processors are in your box, you may want to wrap the search programs in a shell script that sets a lower number of CPUs via the cpus=# (or the deprecated P=#) command line option. Another approach to changing the default number of CPUs follows below, for BLAST managers possessing "root" or "SuperUser" privileges.

For further information, the out-dated manual page for the BLAST version 1.4 (ungapped) search programs is still sometimes useful, for a description of procedures and parameters that have not changed.

Citing WU-BLAST

Citations or acknowledgements of WU-BLAST usage are greatly appreciated, as are any personal accounts of how the software is being used that you might wish to share. When URLs are acceptable, please cite with:

   Gish, W. (1996-2004) http://blast.wustl.edu

When URLs are not acceptable, please use:

   Gish, W., personal communication.

The WU-BLAST unified search program may also be referred to by the name BLASTA.

In scientific communications, it is important to report the program name, as well as the specific version(s) used. In the case of WU-BLAST or BLASTA, the version is a combination of the "2.0" moniker and the release date. The release date can be found on the first line of output, and it is the first date displayed. For example, consider this introductory line of output:

  BLASTN 2.0MP-WashU [02-Apr-2002] [sol8-ultra-ILP32F64 2002-04-03T01:25:46]

In the above, the software release date is April 2, 2002, whereas the build date of the Solaris 8 UltraSPARC binary executable was April 3rd at 1:25 AM.

Historical Notes

Historical notes and additional citation information for some earlier versions of NCBI and WU BLAST include:

References

Altschul, SF, and W Gish (1996). Local alignment statistics. ed. R. Doolittle. Methods in Enzymology 266:460-80.

Altschul, SF, and DJ Lipman (1990). Protein database searches for multiple alignments. Proc. Natl. Acad. Sci. USA 87:5509-13.

Altschul, SF, Gish, W, Miller, W, Myers, EW, and DJ Lipman (1990). Basic local alignment search tool. J. of Mol. Biol. 215:403-10.

Altschul, SF, Madden, TL, Schaffer, AA, Zhang, J, Zhang, Z, Miller, W, and DJ Lipman (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17):3389-402.

Claverie, JM, and DJ States (1993). Information enhancement methods for large scale sequence analysis. Computers in Chemistry 17:191-201.

Gish, W, and DJ States (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3:266-72.

Hancock, JM, and JS Armstrong (1994). SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci. 10:67-70.

Karlin, S, and SF Altschul (1990). Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87:2264-8.

Karlin, S, and SF Altschul (1993). Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. 90:5873-7.

Smith, TF, and MS Waterman (1981). Identification of common molecular subsequences. J. Mol. Biol. 147:195-7.

States, DJ, and W Gish (1994). Combined use of sequence similarity and codon bias for coding region identification. J. Comp. Biol. 1:39-50.

Wootton, JC, and S Federhen (1993). Statistics of local complexity in amino acid sequences and sequence databases. Computers in Chemistry 17:149-63.

Wootton, JC, and S Federhen (1996). Analysis of compositionally biased regions in sequence databases. ed. R. Doolittle. Methods in Enzymology 266:554-71.

Zhang, Z, Schaffer, AA, Miller, W, Madden, TL, Lipman, DJ, Koonin, EV, and SF Altschul (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res. 26:3986-90.


Return to the Advanced Biocomputing, LLC home page

Copyright © 2008 Warren R. Gish, Saint Louis, Missouri 63141 USA. All rights reserved.