Introduction to Classification Techniques in Bioinformatics
Bioinformatics is the application of computer science techniques to the field of biology. The aims of bioinformatics are to assist life scientists in organizing biological data and developing the necessary computer tools for the discovery of new scientific hypotheses. Classification techniques, also known as clustering techniques, are important in bioinformatics as they allow the separating of various biological data with similar attributes into distinct sets.
-
History
-
The size of biological data has been growing exponentially, with the doubling of information observed every 15 months. As a result, computer science and informatics techniques are used intensively in the processing and management of biological data. The most fundamental concept in bioinformatics is that most biological data share similar characteristics and can be separated into clusters. For instance, the genes of an organism can be classified into their functional groups or metabolic pathways. Proteins can also be classified based on the genes that are expressed. Classification or clustering techniques are necessary in the management of huge databases of genetic and biological data. There are two primary types of classification techniques in bioinformatics: the hierarchical and the k-Means classification techniques.
Hierarchical Classification
-
The hierarchical classification technique organizes biological data into a tree data structure. Genes are expressed as nodes in the tree, while each sub-tree of nodes represents a cluster or grouping of genes. The tree could be either rooted or unrooted. A rooted tree is defined as a tree with just a single node at the top. In contrast, an unrooted tree has multiple topmost nodes.
-
k-Means Classification
-
A more complicated classification technique is the k-Means classification, which attempts to find a set of centers that minimize the square error distortion among the data sets in multidimensional space. A cluster is classified by grouping related points to their nearest center. The Lloyd algorithm is often used in the k-Means classification technique. In this algorithm, data points are randomly arranged into separate clusters, which are subsequently optimized to produce the minimal local square error distortions.
Significance
-
After related proteins have been classified into similar groups, life scientists can use that information to predict the properties of certain less-studied proteins. This is also applicable to other aspects of the structure of proteins. Another use of classification techniques is to solve the problem of determining the evolutionary tree of certain organisms based on their genetic sequences. The evolutionary tree is constructed from the DNA sequence of the organism using either hierarchical or k-Means classification techniques.
Considerations
-
Hierarchical classification technique is a relatively simple and effective way of clustering biological data. In contrast, no efficient algorithm exists at the time of writing that is able to perform the k-Means classification technique effectively as the size of the biological data increases. This suggests that a large computational power is often required to perform k-Means classification, which is an important factor to consider when selecting the classification technique to use in bioinformatics applications.
-