

(24, 25) SCOP utilizes a hierarchical classification consisting of four levels, (i) family, (ii) superfamily, (iii) fold, and (iv) class, with each level corresponding to different degrees of structural similarity and evolutionary relatedness between members. Another powerful example of a protein classification scheme is the Structural Classification of Proteins (SCOP), which provides a means of grouping proteins with known structure together, based on their structural and evolutionary relationships. (23) These families are largely made up of hypothetical proteins and await function annotation. (22) Nevertheless, Pfam also contains more than 3000 families annotated as domains of unknown function, or DUFs. The Pfam database classifies known protein sequences and contains almost 15 000 such families, for most of which there is some understanding about the function. The sequences that compose structured domains can be organized into families of homologous sequences, whose members are likely to share common evolutionary relationship and molecular function. These domains, which are referred as structured domains, often fold independently, make precise tertiary contacts, and adopt a specific three-dimensional structure to carry out their function. Generally, proteins are made up of a single or multiple domains that can have distinct molecular functions. (20, 21) Specifically, phase two of the PSI aimed to structurally characterize proteins and protein domains of unknown function, often providing the first hypothesis about their function and serving as a starting point for their further characterization.Ĭlassification schemes provide a guideline for systematic function assignment to proteins. (18, 19) Structural genomics efforts such as the Protein Structure Initiative (PSI) have been set up to enlarge the space of known protein folds and their functions, thereby complementing sequence-based methods in an attempt to fill the gap of sequences for which there is no function annotation.

Therefore, structural similarity between proteins can reveal distant evolutionary relationships that are not easily detectable using sequence-based methods. Even when protein sequences diverge during evolution, for example, after gene duplication, the overall fold of their structures remains roughly the same. The classical concept implies that protein sequence defines structure, which in turn determines function that is, function can be inferred from the sequence and its structure. This classical structure–function paradigm (Figure 1 left panel) has mainly been based on concepts explaining the specificity of enzymes, and on structures of folded proteins that have been determined primarily using X-ray diffraction on protein crystals. Traditionally, protein function has been viewed as critically dependent on the well-defined and folded three-dimensional structure of the polypeptide chain. Thus, uncharacterized protein segments are likely to be a large source of functional novelty relevant for discovering new biology. In addition, it is likely to shed new light on molecular mechanisms of diseases that are not yet fully understood. (11-17) Characterization of unannotated and uncharacterized protein segments is expected to lead to the discovery of novel functions as well as provide important insights into existing biological processes. (8-10) Other aspects of function, such as the biological processes proteins participate in, may come from genetic- and disease-association studies, expression and interaction network data, and comparative genomics approaches that investigate genomic context.

For instance, homology detection allows for the transfer of information from well-characterized protein segments to those with similar sequences that lack annotation of molecular function. (6, 7) Suggestions about potential protein function, primarily molecular function, often come from computational analysis of their sequences. (4, 5) Often these sequences are annotated as putative or hypothetical proteins, and for the majority their functions still remain unknown. (1, 2) While this may reflect the diversity in sequence space, and possibly also in function space, (3) a large proportion of the sequences lacks any useful function annotation. Over the past decade, we have observed a massive increase in the amount of information describing protein sequences from a variety of organisms.
