top  
Login
         
1. What is bioinformatics?

The history of bioinformatics

Although the first bioinformatics database was established already in the 1960’s and the first computational algorithms intended for analysis of biological sequences arose in the 1970’s, the concept of bioinformatics was generally adopted by scientific community even at the end of 1980’s. Setting bioinformatics as a separate branch of science was primarily related to the increasing number of results of sequencing projects; the obtained data was necessary not only to properly maintain, but also to adequately document (annotate) and subsequently to efficiently analyze. Statistical models and computational algorithms that previously had only academic importance, suddenly acquired a considerable practical purposes.

dayhoff
Margaret O. Dayhoff
Certainly the first database of biological sequences was Atlas of Protein Sequences, which was set up in 1965 by Margaret Dayhoff and her colleagues. They not only collected known protein sequences, but according to the sequence similarities they sorted them into groups (families and superfamilies). Subsequently they were able derive the phylogenetic tree from aligned sequences and graphically illustrate the mutual relationship among these sequences. From the sequence alignments they also derived a table of likelihoods of amino acids mutual substitutions (the percentage of accepted evolutionary mutations of amino acids), and so compiled the first substitution matrix PAM (percent accepted mutations).

The primary initiative to found the first database of DNA sequences, surprisingly, came from physicists around Walter Goad in 1974 in the USA. This initiative resulted in establishment of the GenBank database in 1982. Meanwhile (in 1980) a similar database was created on European ground: the European Molecular Biology Laboratory (EMBL) Data Library. Finally, in 1984 the DNA Databank of Japan (DDBJ) was established. These three largest biological (primary) databases (GenBank, EMBL and DDBJ) are now unified in the International Nucleotide Sequence Database Collaboration, where they share the data and exchange them on a daily basis.

As during the 1970’s  the number of DNA sequences was increasing, the interest in development of computational programs for their analysis rose as well. In 1970 Saul Needleman and Christian Wunsch proposed first algorithm to determine the similarity between two sequences, which was able to accept substitutions and deletions in individual sequence positions. Algorithm, which is named after them, is the first bioinformatics application based on the principles of dynamic programming. Needleman-Wunsch algorithm finds an optimal alignment of two sequences (pairwise sequence alignment), regarding whole sequences. Therefore, it is global sequence alignment.


smith waterman
Temple F. Smith a Michael Waterman
However biologically significant (functional) regions within the DNA and protein sequences are individual sections which are separated by less important (variable) segments. Based on this finding Temple Smith and Mike Waterman suggested in the 1981 modification of the original algorithm intended for alignment of two sequences. Their algorithm is also based on dynamic programming, but in order to achieve optimal alignment it compares various segments of all lengths in the original sequences. Therefore, in the case of Smith and Waterman algorithm we refer to local sequence alignment.           

During the following period programs to compare more than two sequences simultaneously were developed. Programs for multiple sequence alignment are obviously computationally more intensive and are generally based on a serial alignment of most closely resembling sequence pairs. Multiple alignment of sequences belonging to one family allowed to detect sequence motifs characteristic for this family. Methods of multiple sequences alignment also contributed significantly to the development of molecular phylogenetics.

altschul
Stephen F. Altschul
Although algorithms for global and local alignment of two sequences provide very efficient way to assess mutual similarity between them, they are almost useless (too slow), when we want to compare arbitrary sequence with all sequences in the sequence database. Bill Pearson and David Lipman in 1988 developed program FASTA, which was capable to scan the entire database to find the most similar sequences in sufficiently short time. About two years later, Stephen Altschul and his team proposed similar algorithm that was even faster. On this basis they set up program BLAST, probably the most widely used bioinformatics program of the present.         

bairoch
Amos Bairoch
A growing number of nucleotide and amino acid sequences also led to new type of databases. These are called secondary databases and they are aimed to process information from the primary databases at a higher level. The first secondary database was built by Amos Bair from the Swiss Institute of Bioinformatics in 1988. It was a database of protein sequence patterns and motifs called PROSITE. Nowadays there are hundreds of secondary databases. Simultaneously were created databases, which sought to maintain and make available biological data of other than sequential nature. For example, already in 1971 was established PDB (Protein Data Bank), a database of 3D structures of proteins and nucleic acids, obtained by methods of X-ray crystallography and nuclear magnetic resonance.

sanger
Frederick Sanger
Significant impact on the development of bioinformatics its infrastructure had increasing number of known complete genome sequences from different organisms. As the first was sequenced RNA genome of the MS2 bacteriophage by Walter Fiers in 1976. The first complete genome DNA sequence (bacteriophage phiX174) was obtained by Frederick Sanger and his colleagues only a few months later (1977). The first known genome sequence from living organism came from the bacterium Haemophilus influenzae (1995), the first eukaryotic genome belonged to baker’s yeast Saccharomyces cerevisiae (1996) and the first multicellular organism with a known genome was another model organism, nematode Caenorhabditis elegans (1998). In 2003 the complete human genome was published. Today, there are thousands of known genomic sequences from various organisms, the majority of them are single-cell organisms and viruses.

The development of bioinformatics would not be possible without parallel technological progress in computer technology and Internet communication. Only thanks to the fact that the power of computers doubles roughly every two years (Moore’s law), bioinformatics has been able to cope with exponential growth biological data, which we have been witnessing over two decades. The onset of high-throughput experimental methods of molecular biology even increases demands on computing capacities.

 

arrow1
       1       2       
 
 
Last update: 08.12.2022 Authors: Matej Stano and Lubos Klucar, Institute of Molecular Biology, SAS Bratislava
Creative Commons License Valid XHTML 1.0 Transitional