Modification history:
02/05/1998
Posted version 2.0a19

02/04/1998
Fixed a bug Mike Cherry reported that sometimes produced a FATAL error in
TBLASTN (and TBLASTX) on the very last sequence in a nt. database, if that
sequence contains any ambiguity codes.  It's conceivable that this same bug
could cause a segmentation fault under some conditions when examining the
longest sequence in the database.

Small amount of cruft removed.

02/02/1998
Posted version 2.0a18

01/31/1998
The included "pam" program can optionally report floating point (fractional)
values.

01/28/1998
The "Searching" crash problem under Linux might be fixed -- we shall see!

01/16/1998
Scoring matrix files may now contain floating point values.  Scoring of
alignments is still performed using integral values.  Fractional values are
rounded to the nearest integer, e.g. 1.5 is rounded up to 2 and -1.5 is rounded
down to -2.

01/08/1998
Fixed HSP list truncation procedure when there are more HSPs than hspmax
allows.  In the programs that search more than one strand, HSPs on the minus
strand were sometimes discarded when they were more significant than the HSPs
on the plus strand that were being retained.

12/7/1997
Fixed buffer over-run in gt2fasta.  Fixed empty database bug in setdb.
gb2fasta now parses PID lines, in case input is "GenPept".  Fixed cosmetic
bug in the display of "V=#" value in a WARNING text for DEC Alpha platforms.

11/12/1997
Fixed the accumulation of matches beyond the number reportable, which
consumed unnecessary memory.

11/10/1997
Added tests for maximum achievable score in each context or reading frame.
Searches are not attempted if the cutoff score can not be achieved.

Value specified for gapH on the command line was erroneously being plugged
in for gapK -- fixed.


10/30/1997
Added knowledge of the "dust" low-complexity filter to BLASTN, so user can
specify "filter=dust" command line option.  This filter program must still be
installed in the /usr/ncbi/blast/filter directory -- or in whatever directory
is pointed to by the BLASTFILTER environment variable -- just like all other
filters (i.e., seg, nseg, and xnu).

10/30/1997
Top combinations of HSPs are now sorted by their Group when topcombon feature
is used.


10/21/1997
Posted version 2.0a17

Deleted a straggling test left behind from debugging that could cause BLASTN,
TBLASTN, and TBLASTX to abort searches -- "Non-positive score returned from
ExpandX" -- particularly when searching ambiguity-code-containing sequences
like ESTs.


10/15/1997
Posted version 2.0a16

Fixed bug in alignment span detection when comparing gapped vs. ungapped
alignments.


10/14/1997
When unacceptable nt. codes were encountered in the input FASTA file,
pressdb wasn't reporting the proper error.


10/13/1997
Made fixes to POSIX threads support, which may improve threads performance
under Digital UNIX 4.0.

10/11/1997
Speed tweak to BLASTP.  Speed tweak-ette to the other search programs.

10/05/1997
Expanded pressdb error messages.

Added a platform description to the "Build" string in the introductory output
from the search programs -- e.g, "sol2.5-x86" -- and reordered the month,
day, and year in the build date.


9/27/97
Optimized BLASTN a little.  Added double-hit method to BLASTN.

Cleaned up a little the tabular display of Parameters.

Fixed pattern recognition of some string=string command line parameters,
e.g. "nogap" or "nogaps" are now acceptable.


9/23/97
Fixed the behavior of Z parameter in BLASTP.  It was being ignored.


9/22/97
Made the search programs better able to work in some obscure cases with
scoring matrix files that are incompletely specified, in that scores
are not provided for absolutely all acceptable letter pairs.

9/21/97
Posted version 2.0a14.

9/18/97
Sped up BLASTX, TBLASTN and TBLASTX a little.

9/12/97
Fixed error in HSP linked list management that on rare occasions caused crashes
in the code introduced 6/12/97.

In some rare instances, BLASTX was crashing in Solaris qsort, and Purify
reported UMR errors in the Solaris qsort() library function.  Crashing
and UMR errors went away when HeapSort was substituted.  PureAtria staff
say Solaris qsort() is safe, but my experience says otherwise, so I'm going
back to using Old Reliable, HeapSort.

9/11/97
Fixed minor error in Smith-Waterman score test.

Added berror() function for reporting non-fatal ERROR messages, in addition
to the existing WARNING messages and FATAL errors.  Some internal tests that
formerly would have produced FATAL error reports will now simply report
the ERROR and continue execution.  New "-errors" command line option
suppresses ERROR messages, in case they get in someone's way.

Got rid of the annoying copyright notice being sent to /dev/tty.

6/12/97
Eliminated reports of superfluous, inferior alignments.


6/11/97
Modified memfile.c for HP/Convex SPP compatibility.

6/10/97
Posted version 2.0a13
Added "postsw" option for Smith-Waterman algorithm to be applied to pairs of
sequences that will be reported by BLASTP.  The S-W score and alignment, if
different from the 2-d BLAST score and alignment, are used to re-rank the
database matches before output.

Eliminated reports of some superfluous, inferior alignments contained within
longer ones.

Added error checks to all read, write, and seek operations in pressdb
and setdb.

6/9/97
Posted version 2.0a12

5/31/97
Fixed interactions between gapE2/gapS2 and E2/S2 command line parameters.

5/29/97
Speed bump for BLASTP, BLASTX, TBLASTN, and TBLASTX (not BLASTN).

5/22/97
Speed tweak for BLASTP, BLASTX, TBLASTN, and TBLASTX (not BLASTN).

5/15/97
Posted version 2.0a10
Speed tweak.

5/12/97
Posted version 2.0a9
Word-hit statistics gathering is now OFF by default, since it consumes about 2%
of total cpu time and most users never use the results.  Use the -stats option
to turn this feature back on.  (This reverses the usage of the -stats option,
which formerly was used to turn OFF the statistics gathering).

In BLASTP, the full-diagonal search for ungapped alignments is skipped when the
gapped alignment procedure is in effect -- saves a few % cpu time.

Made the number of blank lines output between ungapped and gapped HSP alignments
consistently 1.

Fixed an inconsistency in the mid-lines of BLASTN alignments.  Residue codes
instead of vertical bars (|) were sometimes being displayed when no gaps were
present in the alignment.  The convention is supposed to be that residue codes
appear only when there are one or more gaps in the alignment.

4/14/97
Added nonnegok option for permitting nonnegative expected score cases
to halt without exiting nonzero.

2/25/97
Posted version 2.0a8

2/20/97
Fixed HSP memory management bug that tended to cause crashes after 100% search
completion when the list of database matches needed to be truncated.
Removed HSP memory management bug in HSPTruncate related to fwdptr/revptr.
Removed duplicate free of a KarlinBlk at end of blastn.
Removed memory leak of scoring matrix name info.
Added more timing statistics to the end of output

2/6/97
Eliminated any reports of exact duplicate gapped alignments when "span"
option is used.

Added -s option for simple sequence identifiers to gb2fasta, gt2fasta,
sp2fasta, and pir2fasta programs.  Added -g option to omit NCBI gi
identifiers in output from gb2fasta and gt2fasta.

1/23/97
Posted version 2.0a7
Fixed GSP (gapped alignment segment pair) consistency check, which worked
inconsistently when -span or -span1 command line options were used.  No effect
on HSP consistency in best P-value calculations, and no effect when span
and span1 options were not used.

12/13/96
Posted version 2.0a6
Fixed a minor file permissions error in setdb and pressdb.

12/04/96
Added "noseqs" option to produce abbreviated output that may be still parseable
by legacy parsers.

12/03/96
Posted version 2.0a5
Added a "compat1.4" option to revert easily to version 1.4-like behavior,
but with relevant bugs fixed.
Improved the distribution of database sequences to the threads.
Tweaked the search progress indicator so it always goes to "100%"
even for databases of less than 100 sequences.

11/27/96
Initial posting of version 2.0a4.

Found and fixed another file addressing bug that could occasionally cause
BLASTP and BLASTN to crash.

11/24/96
Fixed a file addressing error that could yield segmentation faults with
the initial 2.0a3 release, particularly when searching small databases.
Slip-stream revision posted.

11/22/96
Fixed a long-standing, occasional inconsistency in the sum statistics reported
(since version 1.4).

11/19/96
Initial posting of version 2.0a3

11/19/96
When the BLASTDB environment variable has been set, which is a path of
database-containing directories, the current working directory is automatically
appended to the path.  This provides some backward compatibility with previous
versions of BLAST software, which looked in the current working directory by
default.

11/19/96
Incorrect bounding diagonals were often being used to constrain alignments
with database subsequences for display.  This affected the appearance of
the alignments reported by those programs that search nucleotide sequence
databases (BLASTN, TBLASTN, and TBLASTX) -- the programs that buffer database
sequences in pieces for display.  SCORE_ERROR messages would be seen when the
error arose, but the scores reported as "Score = #" and used in the statistics
were not affected.

11/14/96
sp2fasta parses NCBI gi identifiers from the SWISS-PROT 34 flat file.

11/13/96
Decreased the granularity of the threads.

11/12/96
Minor rework of database access routines, to reduce virtual address space
requirements.

11/12/96
Removed two sources of slowness in BLASTN 2.0 relative to version 1.4.  First,
a high default value of 0.5 was being used for E2, which is 10-fold higher than
the default value used in BLASTN version 1.4.  Worst case, this could slow the
program down by a factor of 10.  Second, the default word length W has been
increased to 11, restoring it to the same default value used by BLASTN 1.4.
While these changes reduce the sensitivity of the program, they make direct
comparisons easier of the relative performance of versions 1.4 and 2.

11/12/96
Fixed a bug in sequence numbering (in BLAST version 2.0 ONLY) that caused the
right-side coordinate numbers to be in error by 2 nucleotide positions in
alignments of translated sequence.  This bug could affect both the Query and
Subject coordinate numbering, but only on the right side, not the coordinates
displayed immediately following the "Query:" and "Sbjct:" strings.
Coordinates were only wrong when the alignment contained one or more gaps;
and the bug only affected the numbering of sequences that had undergone
translation prior to being compared -- e.g., only the query sequence in a
BLASTX search.

10/29/96
Fixed the display of gapped alignments involving long sequences.  With
coordinate numbers greater than 5 digits in length, the alignments were
skewed to the right.

10/28/96
Sped up the gapping version of BLASTN and verified that it works properly
when wordlength W is varied.  SCORE_ERROR bug/feature (sometimes seen
with database sequences that contain ambiguity codes) is now history.
Increased BLASTN's default value for W from 10 to 11, so it is the same default
value used by BLASTN version 1.4, to facilitate and equalize the inevitable
comparisons to be made between the two versions.  For additional speed, W can
now be increased up to 32, albeit at a significant decrease in sensitivity
and increase in memory use; the time saved during the search can also be lost
in setting up for the search with long word lengths.

10/27/96
Implemented gapK, gapL and gapH command line options to enable the user
to manually set values for the Karlin-Altschul statistics' K, lambda and H
parameters used in evaluating the significance of gapped alignment scores.
The units of gapL and gapH are nats/score and nats/alignment position,
respectively.  (1 nat ~= 1.443 bits; 1 bit ~= 0.693 nats)  For any
of the 3 parameters' values that are not set on the command line,
their default values will be obtained from precomputed tables as before.

10/20/96
Added -mmio option to turn off memory-mapped I/O in all of the *BLAST*
programs.  For some users, this means the programs may coexist better with
other programs or with other users on a shared system (e.g., on a system that
is not a dedicated blast server).  As a part of using this option, consumption
of virtual memory address space is also reduced, which is becoming increasingly
important as database files grow in size; some operating systems or system
administrators will not necessarily allow per-process memory needs to increase
concordantly; but frequently the shell's "limit" command can be used to
increase "memorysize" and "datasize" limits, rather than resorting to turning
off memory-mapped I/O.  The potential for a problem arises most often with
nucleotide sequence database files, when the original FASTA-format file is
available.  When holding all of the nt. sequences of GenBank, a single FASTA
file is currently about 1 GB in size.  Memory-mapped I/O is still used by all
of the programs by default, as it is faster and doesn't seem to be a problem
for most users.

10/18/96
Added Lambda, K, H entries for gapped alignments with BLOSUM80 scoring matrix.
Precomputed values exist for Q=7, 5<=R<=7; Q=8, 4<=R<=8; Q=9, 3<=R<=9;
Q=10, 2<=R<=10; Q=11, 2<=R<=11; Q=12, 2<=R<=12.

9/17/96
Fixed an anomaly that arose at low frequency with the gapped blast heuristic.

9/10/96
Changed blast sort routine to avoid possible arithmetic overflow on some
platforms (e.g., Solaris for x86).

9/3/96
Brought all genetic codes into synchrony with the NCBI Version 3.3.

7/9/96
Fixed crashing of pressdb when the FASTA input file was zero-length.

5/9/96
Added an "identity" scoring matrix for BLASTN searches.  Not perfect, though,
it ascribes a penalty of only -10000 to mismatches.  It's possible then to have
one mismatch every 10 KB or so and still achieve a positive score.

4/29/96
Fixed statistical calculation in the case of multiple consistent HSPs and sum
statistics.  When r consistent alignments were combined, the p-values computed
were too low by a factor of about r!.

2/13/96
Added "Edegrade" command line parameter for regulating
the quality of HSP combinations reported per database sequence.

11/4/95
Fixed a bug in the parsing of sequence identifiers that could yield incorrectly
justified text in the initial, one-line summary section of blast program
output.  When this bug arose, there were 25 columns of white space at the
beginning of each line.

11/3/95
Updated the list of built-in genetic codes in blast/blast/gcode.h using the
latest NCBI Toolbox ASN.1 data (toolbox/data/gc.prt).

10/26/95
Fixed a multiprocessing bug in the blast programs that could arise when
searching small databases (<500 sequences).

10/3/95
Added support for NCBI (Wootton & Federhen) "nseg" program on the BLASTN
command line, using "-filter seg" option.

9/27/95
Added "-WashU" tag to the program version numbers, to ensure there is no
mistaking WashU distribution of these programs from the NCBI distribution.

9/26/95
Fixed a long-standing bug in pressdb regarding which sequences are tagged
as having "ambiguous" nucleotide codes. Thanks to Colin Watanabe at Genentech
for pointing this out.

9/18/95
The PRESSDB program (pressdb.c) can now append sequences to an existing
BLAST database, using the -a option.  (The SETDB program has not been
so modified yet).

8/22/95
The file locking described on 6/7/95 has been disabled at least
temporarily because it is not functioning in the intended manner
with files that reside on NFS-mounted partitions.

8/14/95
gb2fasta now parses NCBI "gi" identifiers from the GenBank flat files.

6/7/95
See note on 8/22/95!
Database file locking has been added to the BLAST search programs and to the
database maintenance programs setdb and pressdb, to eliminate (or optionally
reduce) the opportunity for collisions between database search and database
maintenance activities.  Previously, a setdb or pressdb invocation would cause
active BLAST searches of the same database to fail.  File locking now prevents
the blastable database files from being modified by setdb/pressdb until they
are no longer in use by a search program.  This doesn't necessarily come
without some risk.  With strict file locking in force (the default), deadlock
or near-deadlock may now be a concern within a production environment, as
multiple simultaneous BLAST search production lines involving one database
can effectively block setdb or pressdb forever -- unless all production lines
happen to finish their searches at the same time.   Having all production
lines finish at virtually the same time may be an infrequent event if more than
just a couple are running.  This new situation seems more desirable, though,
than not using file locks and unwittingly allowing setdb and pressdb to blow
away databases out from under any searches.  As an aid to diagnosing deadlock
situations should they arise, when blocked, setdb and pressdb report their
blocked status every 60 seconds.  If deadlock is a real problem, one can revert
to the former, ungoverned situation by completely disabling file locking with
the new -l option to the setdb/pressdb programs.  Significant file lock
protection can still be obtained, though -- and without the risk of deadlock --
by using the -b option to setdb/pressdb instead of completely disabling it with
-l.  The -b option simply blocks any subsequently invoked BLAST searches until
the current setdb/pressdb operation is finished, however any search that
happened to be in progress when setdb/pressdb was invoked will get trashed.
Through the use of locks, it is possible to update databases that are actively
being searched or that reside on-line in a production area, without the need
for off-line, ancillary working storage equivalent to a full copy of the
database.  N.B. One area not addressed by the present file locking is that of
the FASTA-format nt. sequence file accessed by BLASTN, TBLASTN, and TBLASTX,
which still causes problems if updated in the middle of a search.

6/1/95
Fixed a long-standing deadlock problem in the Solaris multithreaded
executables (and more recently the OSF/1 executables).

5/28/95
Removed the link between X & S that existed in blastapp/lib/context.c.

5/24/95
Threads support (parallel processing) added for DEC OSF/1 3.0 (Digital UNIX).

5/20/95
Switched to using Robinson&Robinson (PNAS 1991) amino acid residue frequencies.
Fixed a minor slowness problem in BLASTN, TBLASTN, and TBLASTX (all of the
programs that would access the FASTA-format database file, doing so more often
than necessary).
Changed the name of the recently added "pgsper" command line option to the
simpler name "progress".  It's now described in the documentation file,
blast.1, too.

4/26/95
Added "-pgsper #" command line option to adjust the time-out period
in progress messages.  Alarm clock errors when using Solaris threads
prompted the creation of this parameter.  To avoid any possibility of
the alarm clock error, set a time-out of 0.

Changed basename() to misc_basename() for Linux compatibility.

3/30/95
Made memory management a little more flexible and robust.  V & B command
line options are supported in the ASN.1 form of the output now.
Made changes for VMS compatibility kindly suggested by Scott Rose (GCG,
Madison, WI).

3/8/95
pressdb and setdb now parse arbitrarily large FASTA input databases,
expanding their memory buffers as much as necessary.  No more need to modify
ENTRY_MAX.

3/7/95
I lied on 2/1/95.  Solaris threads support promises to be robust now.
Famous last words.

2/13/95
The dfa library was consolidated into the gish library.

2/1/95
Too optimistic on 1/24/95 -- the Solaris threads/alarm problem was not fixed
then.  It truly seems to be fixed now.  Also, fixed a bug in BLASTN's
calculation of the Karlin-Altschul K value.  Plus some slight performance
improvements to BLASTN, TBLASTN and TBLASTX, related to the FASTA file
access;  because of this improvement, BLASTN is set to use up to 4 processors
by default instead of the previous default of 3.

1/24/95
Fixed (for the last time?!) the interaction between Solaris threads
and SIGALRM signals in the "gish" library.

12/19/94
Fixed a multiprocessing bug in all of the programs.  The bug would often
produce crashes (segmentation faults) when searching tiny databases.

hsp_max is now used to truncate HSP lists _after_ statistical significance
estimates have been made and after the list has been sorted for output.


12/16/94
Fixed handling of gap characters in the query sequence by blastx, tblastn,
and tblastx.

12/15/94
blastp was stripping gap characters (-) from the query sequence. fixed.

10/16/94
Fixed a severe bug in the support for multiprocessing under Solaris 2.
Some of the code involved in this bug fix is in the "gish" library.
Program version numbers are unchanged by this fix; but the code release
date displayed in the programs' introductory output is updated to day's date.

10/6/94
First "final copy" release of BLAST 1.4 software.

10/4/94
Changed "-overlap", "-overlap1", and "-overlap2" command line option names
to "-span", "-span1", and "-span2", respectively.  "-span2" is the default.

9/30/94
I'm now employed by the Department of Genetics, Washington University School of
Medicine, St. Louis, MO 63108

9/3/93
Fixed bug in gb2fasta's concatenation of long definitions.

8/8/93
Added -qoffset option to BLASTP, BLASTX, TBLASTN, and BLASTN, to permit
segments of long sequences to be used as queries and still have their residues
numbered correctly in alignments.

7/28/93
Changed the format of substitution matrix files read by BLASTP, BLASTX, TBLASTN
and BLAST3.  Substitution scores in the matrix files can now properly have
non-integral values.  The blast program still do their scoring using integral
data types.  Upon being read by the blast programs, each score value is rounded
to the nearest integer.  Matrices in the new format are generated by the pam
program.

Fixed the display of query sequence segments in BLASTX when its -codoninfo
option is invoked.


7/7/93
Prompted by Erik Sonnhammer, a "-overlap2" command line option (also available
as simply "-over2") was added to make the criteria for HSP overlap detection
tighter.  This option has a positive effect on the number of HSPs reported
(fewer of them will satisfy the overlap2 criteria) for sequences that contain
internal repeats, but will have a negative effect on their associated
statistics.  The additionally reported HSPs may have Poisson statistics
inappropriately applied, because the HSPs may be incompatible with others
in the same global alignment and hence can not be considered as independent
events.

For query sequences too short to satisfy the cutoffs or expectation thresholds,
the minimum acceptable expect values that were reported by BLASTP, BLASTN, and
TBLASTN were incorrect, now fixed.

7/2/93
Changed the way the cutoff score, S, and expectation cutoff, E, are reported.
All output is now filtered based on its estimated statistical significance (E
value), rather than using cutoff scores directly.

6/22/93
Fixed bug in consistp.c's implementation of R(i,3) found by Phil Green.
Followed another suggestion of Phil Green's for making Poisson probability
calculations more efficient.

6/21/93
Fixed bug in the calculation of "consistent N counts" for those HSPs found
on minus strands in BLASTN, BLASTX, and TBLASTN.  Plus strand hit counts were
not affected.

Pressdb on 64-bit platforms now produces databases that are readable on
all platforms.


6/16/93
Fixed a conflict between static and global variables in bldaa.c and bldxa.c
This produced a bug in the blast software under DEC Alpha OSF/1.

6/9/93
Added "-gapdecayrate" parameter (default=0.5), as suggested by Phil Green
(Washington University, St. Louis).  This parameter defines a geometric
progression used to adjust Poisson probabilities upward, to account for the
fact that many values for the N parameter in Poisson P(N) are considered when
choosing the "best" alignments.  If r is the decay rate (0 < r < 1) for the
progression and n is the number of segments under consideration, then the
number of gaps is n-1 and the Poisson probabilities will be _divided_ by the
quantity:

                     n-1
              (1-r) r

For n=1 (one HSP) and the default r=0.5, the adjustment is by a factor
of 1/(1-0.5) = 2.


Fixed a bug in lib/consistp.c that produced undetected overflows in factorial
calculations.  This was occasionally problematic in TBLASTN queries with hits
against extremely long database sequences.


5/9/93
In TBLASTN, fixed discrepancies in alignments when a database sequence
contained one or more ambiguity (non-ACGT) codes.  Previously, the original
FASTA format database sequence was only examined at the end of the search; now
it is examined during the search, so that it is known up front what the real
alignment score and extent of alignment is.

The HSP cutoff score in TBLASTN is now S2.  Previously, there had to be at
least one match scoring at least as high as S, after which the database
sequence was re-scanned using a cutoff of S2.  Now each database sequence is
scanned only once, using the lower cutoff.  Better sensitivity results for
short exons.  Something not done now, however, is to scan the entire diagonal
on which an HSP is found.


5/8/93
Fixed severe bug in BLASTN.  Word hits on the plus- and minus- strands were
being managed in a single pool, rather than separate pools.  Consequence:  hits
on one strand could obscure hits on the other strand.  In typical use, this
would rarely cause a problem because of the improbably long wordlength used by
BLASTN (W=12) and the requirement for the word hits to appear in a particular
order.  This bug was present since BLASTN's inception.

In BLASTN, fixed discrepancies in alignments when a database sequence contained
one or more ambiguity (non-ACGT) codes.  Previously, the original FASTA format
database sequence was only examined at the end of the search; now it is
examined during the search, so that it is known up front what the real
alignment score and extent of alignment is.

5/6/93
Fixed a bug introduced to BLASTN on 5/4/93, wherein the first residue in the
complementary strand (i.e., the complementary residue to the last residue on
the "plus" strand) was not initialized.  This bug would reveal itself iff the
query contained one or more non-ACGT codes and the first residue on the
complementary strand should have continued a matched with a database sequence.

Tweaked the default value of E2 upward from 0.1 to 0.15, in reaction to the
bug-fix on 5/5/93 which had raised the value of S2 calculated from E2.


5/5/93
Stupid bug fixed in all blast programs.  The units that had been assumed for
the Karlin-Altschul H statistic in the function stolen() were "nats per
position", whereas the karlin() function was calculating H in units of "bits
per position".  The karlin() function was modified to calculate H in nats, and
all equations that were functions of H and had been (correctly) assuming H was
in units of bits were modified to account for the change to nats.  H is still
reported in units of bits, because of the automated parsers in the world.

The consequences of this error were (1) that the expected length estimated
for an alignment of any particular score was too short by a factor of log(2);
and (2) the probability estimates reported by the programs were often higher
(lower in statistical significance) than they should have been.


5/4/93
In BLASTN, ambiguous nucleotides in the query sequence are handled consistently
throughout the program as mismatching all other letters, so that, e.g., strings
of N's can be used to mask a query sequence.  In addition, gap letters
(hyphens) in the query sequence will never appear in an alignment (although
they may appear in the database sequence half of an alignment).  Ambiguity
codes in the database sequences (only) can still lead to discrepancies between
the scores obtained during the search and the scores reported after the
search.

4/23/93
Recently, in all of the blast programs, a "consistent" N parameter was used in
the Poisson statistics, to reflect the number of HSPs likely to be consistent
with one another in the same gapped alignment.  Now, all of the blast programs
build upon this by using another enhancement of Stephen Altschul's, which is to
adjust the Poisson probabilities downwards (making them more significant) to
account for the consistency requirement.  There is no effect on single-HSP
probabilities.  Some reordering of the database sequences will be observed in
the output, with multiple-hit cases often moving up a few notches relative to
the single-hit cases.

With the consistency-adjusted Poisson P-values, sensitivity is expected to be
marginally improved, being practically confined to matches which would anyway
come close to satisfying the statistical significance threshold.  If the
threshold is set at a point within or just above background, it will be more
common to see the new program report false positives than the previous
version.  Improved sensitivity will also be noticed more often with longer
sequences, which provide greater opportunity to accumulate multiple hits with a
single database sequence.

The consistency feature (which includes both the consistent N and consistent
Poisson statistics) can be turned off with the "-consistency" command line
option.

The statistics of consistent HSPs is discussed by Karlin and Altschul in a
manuscript recently submitted to Proc. Natl. Acad. Sci. USA.


4/6/93
HSP == high-scoring segment pair, the unit of BLAST output

In all of the BLAST programs, the Poisson event count (or the N parameter used
in the Poisson statistics) assigned to each HSP is now estimated more
accurately, using positional information as well as scores.  A simple midpoint
rule of Stephen Altschul's design is used to estimate the number of HSPs that
would be consistent with each other in the same gapped alignment.  Let (x,y)
represent the location in 2-dimensional space of the midpoint of an HSP.  In a
"consistent" set of HSPs, if the HSPs are sorted in increasing order of their x
coordinates, then the y coordinates of the sorted list also produce a strictly
increasing sequence.  For any given HSP, the maximum number of other HSPs that
can be made consistent with it (plus 1 for the HSP under consideration) becomes
the Poisson N parameter.  The effect of this change is to reduce the number
of false positives reported (improved selectivity), which sets the stage for
the following...

In BLASTP and TBLASTN, a much lower cutoff score (S2 instead of S) for
reporting HSPs is used in conjunction with the consistent event count.  HSPs
are filtered from the output based on their statistical significance as
estimated using Poisson statistics.  Due to Altschul's consistency rule, a
lower cutoff score can be used without introducing too much extra noise in the
output, while providing increased sensitivity in detecting homologs in the
presence of insertion/deletion errors and mutations.  This change has not yet
been documented in the blast manual page, and the values of S2 and E2 (E2
defined to be the number of chance matches expected when comparing two random
sequences each 300 amino acids in length) can not currently be modified from
their default values through the NCBI BLAST E-mail Service.

With previous versions of BLASTP and TBLASTN, a database sequence had to
produce at least one segment (HSP) scoring at least as high as the cutoff
score, S, in order to be reported.  And if this high threshold was met, the
database sequence was scanned a second time using a lower cutoff, S2.  This
repeat scanning no longer occurs--all database sequences are scanned using the
lower cutoff.  The former cutoff score parameter, S, and expect parameter, E,
now establish a threshold of statistical significance that must be satisfied by
the Poisson P-values of the HSPs regardless of their individual scores.  The
evaluation of HSPs works like this:  if a single database sequence yields one
or more HSPs each scoring S2 or higher with the query, the list of HSPs is
first sorted by score just as before; consistent event counts are then
assigned; Poisson probabilities are calculated; and finally the list is
truncated after the last HSP having a Poisson P-value that satisfies the S or E
significance threshold.  If no Poisson P-values satisfy the threshold, then
the whole list is thrown away and none of the HSPs is reported.  S might be
thought of as the score that must be achieved by an HSP observed in isolation
(Poisson event count = 1) for it to be reported.

While use of a lower cutoff score is the default for BLASTP and TBLASTN, a
similar low cutoff has been made an option for BLASTX, which may become the
future default.  It is presently only an option because it is feared that some
automated parsers of BLASTX output might break if the lower cutoff method was
suddenly instituted as the default.  To invoke the option in BLASTX, specify a
value for either E2 or S2 on the BLASTX command line.  E2 is the number of HSPs
expected to be observed by chance when comparing a random sequence 100 codons
in length against another random sequence 300 amino acids in length.  A
suggested starting choice for E2 is 0.1.  This change to BLASTX has not yet
been documented in the blast manual page, and the option is also not presently
selectable through the NCBI BLAST E-mail Service.

A lower cutoff was not introduced to BLASTN, because the sensitivity of this
program with its fixed wordlength W=12 is low.  BLAST3 has always used a low
cutoff.

Symmetric multiprocessing can now be employed by the BLAST programs under
SunSoft's Solaris 2.2 operating system, as well as the previous Silicon
Graphics' IRIX operating system.  The code has only been tested under a beta
release of Solaris 2.2.  Code is also included to putatively use threads in an
OSF/1 environment such as Digital's OSF/1 on the Alpha AXP platform, however it
has not been possible to test this code.

Many more enhancements in the software are included, not all of which are
documented yet or bundled here--e.g., support for the low-compositional
complexity SEG filter of Wootton and Federhen (wootton@ncbi.nlm.nih.gov) and
the short-periodicity repeat XNU filter of Claverie and States
(jmc@ncbi.nlm.nih.gov).  Also, optional use by BLAST of codon bias information
read from *.cdi files (States and Gish, manuscript submitted).  The interfaces
to these features are not well developed, subject to change, and are presently
provided "as is" in an effort to expedite moving the earlier-mentioned
improvements into users' hands.



3/25/93
The default neighborhood word score threshold (T parameter) was raised a notch
in TBLASTN only, to obtain a roughly compensatory increase in speed for the
performance hit that was incurred in the switch to using the new default
BLOSUM62 matrix on 3/19/93.

3/19/93
Changed the default substitution matrix used by BLASTP, BLASTX, TBLASTN and
BLAST3 from PAM120 to BLOSUM62.  Speed declines by about 30-40% as a result.

3/5/93
Changed the format of the sequence identifiers output by the programs
gb2fasta, gt2fasta, pir2fasta, and sp2fasta.  LOCUS and ACCESSION identifiers
are now included.

9/19/91
Removed one last dependency of the software on the alphabetical case
of residues in the FASTA databases.  This change was localized to one
line in blastn.c.

9/20/91
Better compatibility with Cray UNICOS (version 7.0)

9/23/91
Marginal improvement in speed of BLASTP and TBLASTN (re: zero-ing of diagonal
hit structures in search_aa()), with a concomittant correction to the
hit statistics reported by these programs.  Only a minor change was made
with respect to BLAST3, but since all three of these programs include
the same searcha.inc file, the version number on BLAST3 was bumped up one.

9/25/91
Improved reporting of individual HSP statistics (including the number of bits
of information associated with the alignment scores), and a more consistent
report style across all blast programs.

9/27/91
BLASTN is now rigid in its interpretation of matching/mismatching.
Residues must be either A, C, G, T(U) to match with any other residue.
And T now matches U.  There is no concept of a partial match with
BLASTN.  For example, R (purine) does not half-match with a G or A,
but rather is scored as a complete MISMATCH.

The blast.1 manual page is better.

10/4/91
Hits on opposite strands of a query or database sequence are now considered
to be distinguishable events, and so are counted separately in the Poisson
statistics calculations.

The default value for E used by BLASTP, BLASTN, BLASTX, and TBLASTN
has been reduced from 25 down to 10, to avoid reporting quite so many
hits which are statistically insignificant under the random sequence model.
The experienced user may well want to routinely use even a lower value
for E, e.g. E=1 or E=2.

10/23/91
Fixed frame reference bug in blastx.print_parms.

11/11/91
Neglected to initialize the pts[] array to NULL pointers in blast3.c.

11/13/91
The mode parameter of mfile.mfil_open() was not being passed to fopen()
when USE_SHM was undefined.

12/11/91
Fixed bug in blast3.print_p which arose if USE_MPROC was _not_ defined
and the database was not resident in shared memory.

Fixed semaphore SETVAL bug in shmutil.c and minor bug in memfile.c.

12/18/91
Improved signal handling in multiprocessing situations.

12/23/91
Improved commande line parsing.  New -overlap option added to all blast
programs to turn off HSP overlap detection and removal.

12/24/91
Fixed filesize bug in shmutil.c.  Only applicable to users of shared memory.

12/29/91
Fixed bug in blastx.c and others, in vicinity of isspace() macro usage.

12/30/91
Added sp2fasta utility for converting SWISS-PROT text format into FASTA format.

12/31/91
In searchn.inc, which is used by BLASTN, the strand (frame) of each HSP was
not being set.

1/2/91
Fixed severe multiprocessing bug in TBLASTN--has no effect on uniprocessing.

1/6/91
Only the frequencies of occurrence of unambiguous letters (non-X for protein
and non-N for nucleotide sequences) are used to calculate the Karlin
parameters K and Lambda (and H).  This change can lead to occasional warning
messages (usually not fatal errors and not serious) about the score
probabilities not adding up to 1.0.

The "pam" v1.0.3 utility program now calculates a weighted average substitution
score against the ambiguity letter X; a command line option permits the user
to set a constant substitution score instead.

Several .h and .c files had some ANSI-incompatibilities fixed; in particular
"Boolean" parameters were changed to "int" because of the use of old-style
function declarations.

1/17/92
Minor bug fix in lib/mfile.c and a major bug fix in BLAST3's out3.c.
Both bugs were introduced recently; the former one prevented compilation
of mfile.c; the latter one sent the 3-way search phase of BLAST3 into
an infinite loop on single-processor architectures.  Version numbers
are not being incremented.

1/23/92
Fixed bug in sp2fasta.c that caused the last character of each DE line
to be omitted.

2/10/92
Changed SGI IRIX compiler optimization flag from -O3 to -O2 in main copy
of Makefile.sgi, for compatibility with IRIX 4.0.

2/18/92
Switched the BLAST application programs over to using a new version
of the dfa library.  The new dfa library is required.

2/20/92
Made changes to the Makefiles.  Verified that all required libraries
(ncbi, gish, dfa) and programs can be built.  New copies of all dependent
source code should be gotten.

3/9/92
Faster K calculations now performed.  Accuracy is 2+ decimal places for
the PAM120 and 2- places for PAM250.  This generally translates into
only a small error (<1%) in the dependent P-values, expectations, and
bit scores, which seems acceptable for an approximate 20-fold improvement
in the speed of calculating K.  Furthermore, the error in K is on the
high side, so P-values etc. tend to be conservative.  The speed is achieved
by performing fewer iterations in the main K loop and compensating for
this by adding in several corrective terms from a geometric progression
of Altschul's design.

3/27/92
Better handling by BLASTN of cases where the database sequence contains
ambiguity letters.  BLASTN now does not require the original FASTA-format
nucleotide sequence database file.  (TBLASTN still does, however).

3/28/92
Better handling by TBLASTN of cases where the database sequence contains
nucleotide ambiguity codes.  Now neither BLASTN nor TBLASTN requires the
original FASTA-format nucleotide sequence database file.
Long strings that had been static are now allocated dynamically.

3/29/92
blastp, blastn, blastx, tblastn, and blast3 have no theoretical limit
on the line length in the query sequence file; setdb and pressdb have
no theoretical limit on the length of lines in the input FASTA database files.
Several programs were modified to accommodate a change in the gish
library's misc/basename() function--an updated copy of the gish library
must be obtained for compatibility.

3/30/92
Fixed bug in blastn's overlap checking function, ovlap_n(), that caused
minus-strand HSPs to be reported that were intended to be filtered out.  Merged
versions of pvals_a(), pvals_n(), and pvals_t() into a single pvals() function.
Fixed a bug in pressdb that would appear only if each sequence in the input
FASTA-format database file resided on a single (possibly very long) line.

3/31/92
Added a "gap" character, '-', to the amino acid alphabet used by BLASTP,
BLASTX, TBLASTN, and BLAST3, which breaks alignments into separate segments.
BLASTN does not support gap characters.

Fixed a severe bug in the multiprocessing version of TBLASTN:  the
translate() function failed to set s_len, the database sequence length,
in frame 1.  Until the gap letter was introduced to the amino acid alphabet
today, it is not clear that this deficiency caused any problems.  It certainly
did not affect the results on uniprocessing platforms.

Default value for the H (histogram) parameter is now 0 to omit reporting the
histogram.

4/2/92
Added function etop(), which uses new function fct_expm1() in the gish
library, to calculate probabilities from expect values.

Changed the letter 'X' in the nucleotide alphabet to '-', which is supposed
to represent a gap (as it does in the amino acid alphabet), but currently
is treated by BLASTN like a mismatch character.

4/8/92
Pressdb still requires sequence lines to be of equal length (except for
the last line of each sequence, which can be shorter), but it now tolerates
one or more blank lines at the end of each sequence.

4/17/92
Fixed a bug in the single-processor version of blast3(out3.c) that produced
an infinite loop.

5/15/92
Fixed a bug in blast3 that caused it to produce an unexpected number
of pair-wise alignments.  Often no pairwise alignments were displayed
at all.  This bug had no effect on the 3-way alignments produced.

6/16/92
Added several Hitlist sorting options to each of the BLAST programs
except BLAST3.  -sort_by_pvalue is the default for all.  -sort_by_count
sorts by the number of HSPs in each database sequence's hitlist.
-sort_by_highscore sorts by the highest HSP score in a hitlist.
-sort_by_totalscore sorts by the total of all HSP scores in a hitlist.

Example:

   blastp pir myquery -sort_by_totalscore

6/18/92
In blastx, corrected the statistic reported for the highest observed
score in each reading frame.

6/25/92
Corrected the way averaging was performed to calculate substitution scores
against letters B and Z in the matrices produced by the pam program (pam.c).
Standard Dayhoff PAM-250 matrix is now included in the distribution,
under the filename "dayhoff".

7/1/92
Corrected a bug in lib/getseq.c that would cause BLASTN and TBLASTN to crash
when reporting hits on single-processor platforms when the compressed
nucleotide database file *.csq was loaded in shared memory.  No effect
if shared memory was not actively in use.

8/5/92
Fixed a bug in the single-processor version of blast3(out3.c) that produced
an infinite loop.  (How does this bug keep reappearing??)

8/14/92
Changed one fatal error message to what should have been merely a warning
in BLASTN.  Added a warning message to BLASTP and TBLASTN.  No change in
version numbers.

8/25/92
Made the software compatible with DEC Ultrix and other operating systems
running on "little endian" platforms.  BLAST databases, which contain
binary encoded integers, can be shared between big and little endian platforms.
Big endian platforms will be only marginally more efficient.

9/3/92
Corrected the substitution scores for B-X and Z-X reported by pam program.
Current version of pam is 1.0.5.

9/4/92
Added several BLOSUM matrix files to the distribution.  Moved all matrix
files into a new "matrix" subdirectory.  Renamed BLASTPAM environment
variable to BLASTMAT, and changed its default value from "/usr/ncbi/blast/pam"
to "/usr/ncbi/blast/matrix".

9/4/92
Corrected a bug in lib/hsppool.c that caused occasional bus errors and
segmentation violations.

9/7/92
Moved bulk of the low-level multiprocessing support into the "gish" library.

10/1/92
Added gt2fasta program for extracting coding sequence (CDS) feature
translations from files in the GenBank(R) flat file format, saving the
results in a FASTA format file.

10/2/92
Made code compatible with architectures having 8-byte long integers,
e.g. DEC Alpha.

10/26/92
Fixed a bug in searcha.inc regarding the handling of segmented sequences in
BLASTP and TBLASTN.  During examination of a diagonal for hits while ignoring
X, the programs had been halting the diagonal search when a gap character was
encountered in either the query or the database sequence.

11/4/92
Renamed include/blast.h to include/blastapp.h, to prepare for migration
to using a blast function library which contains blast.h.

11/5/92
Moved lib/shmutil.c and lib/mfile.c into the "gish" library, and removed
the USE_SHM macro.

11/16/92
BLASTP prunes its hitlists at the point where the expectation E/S is no
longer satisfied.  E2/S2 is now the cutoff for saving HSPs for subsequent
pruning by the E/S criterion; after pruning, no HSPs may remain.  Noise
is reduced by the pruning, and better sensitivity is obtained by using
a lower cutoff score followed by filtering on Poisson P-values.

12/8/92
sp2fasta now strips carriage-return characters from the definition lines,
so the program now works well when parsing sequences files on the EMBL CD-ROM.

