Dubbed the United Protein Database, or UniProt, the new, public database will combine the resources of three existing protein databases: SWISS-PROT, TrEMBL and the Protein Information Resource (PIR). The action is aimed at ensuring that researchers around the world will have free, unrestricted access to a comprehensive and non-redundant source of protein information, as well as creating a powerful tool for the study of human disease.
"One of the great challenges facing scientists today is interpreting the tremendous amount of data being generated by the Human Genome Project and related research," said Dr. Francis S. Collins, the director of NHGRI. "The UniProt database will become a resource for all scientists to use, both to develop a better understanding of biology and to translate that basic science into clinical applications. This award demonstrates that NHGRI and NIH continue to be committed to funding bioinformatics infrastructure for the international scientific community."
Proteomics research the large-scale study of proteins and their interactions has accelerated in recent years because of technological advances in protein science and the large amounts of genomic data pouring out of the Human Genome Project (HGP). These advances have strained the ability of one protein database SWISS-PROT, a hand-curated and annotated protein-sequence database to keep pace with the needs of the world's scientists. Established in 1986 by Amos Bairoch, Ph.D., now a group leader of SWISS-PROT at the Swiss Institute of Bioinformatics (SIB) in Geneva, SWISS-PROT's hands-on, high-quality information processes could not cope with the rising tide of new information coming out of high-throughput DNA sequencing, which has generated information about hundreds of thousands of new proteins from all species being studied, including human.
Enter TrEMBL, a computer-annotated supplement to SWISS-PROT created and maintained by SIB and the European Bioinformatics Institute (EBI), the same European teams that built and maintained SWISS-PROT. TrEMBL was developed to handle the increasing amounts of data being generated by large-scale genomic projects, allowing scientists quicker access to new protein sequences before they were hand-curated and entered into SWISS-PROT.
About the same time, centralizing protein database information became a goal at NHGRI, which with the National Institute of General Medical Sciences (NIGMS) and the National Library of Medicine (NLM), originally proposed the grant in 2001 as a single award of $4.5 million a year for each of three years, starting in fiscal year 2002 or 2003.
To achieve this vision of a centralized protein database, NHGRI decided to fund the UniProt project, which will consolidate and build upon the strengths of SWISS-PROT and TrEMBL, as well as the U.S.-operated database, Protein Information Resource. PIR is a joint effort between Georgetown University Medical Center and the National Biomedical Research Foundation in Washington, D.C. PIR was established in 1984 and resulted from the work of Dr. Margaret Dayhoff, Ph.D. Her Atlas of Protein Sequence and Structure, published from 1965-1978, was the first comprehensive collection of protein sequences. In 1974, Dr. Dayhoff devised the concept of the protein family and super-family, defined by sequence similarity, as a means of organizing and classifying proteins. PIR's entries are organized in this manner and computer-annotated with functional and structural data.
Specifically, the new UniProt database will consist of two parts: the SWISS-PROT section, which will contain fully annotated entries, and the TrEMBL section, which will contain those computer-annotated records that are waiting hands-on analysis. The PIR group will no longer maintain its database, but will assist in elevating the annotation of TrEMBL records to the SWISS-PROT standard. All existing PIR entries will be integrated into UniProt. Currently, SWISS-PROT holds entries on 114,000 proteins, TrEMBL, 700,000, and PIR, 283,000. By the end of the grant's three-year span, EBI scientists estimate that the total number of proteins in the UniProt database should reach well above the 2 million mark.
Rolf Apweiler, Ph.D., who has led the SWISS-PROT group as coordinator from EBI since 1994, will be the principal investigator of the project.
"With the increasing volume and variety of protein sequences and functional information, UniProt, as the central database of protein sequence, will function as a cornerstone for a wide range of scientists active in modern biological research, especially in the field of proteomics," said Dr. Apweiler.
Dr. Apweiler's co-investigators will be Dr. Bairoch and Cathy Wu, Ph.D., who oversees PIR. Dr. Wu is also a professor of biochemistry and molecular biology at Georgetown University Medical Center, as well as vice president and director of bioinformatics for the National Biomedical Research Foundation, both in Washington, D.C. "Combining the resources will give us even more new tools to automate and improve the process of annotation," Dr. Bairoch said. "It will give us the chance to combine everything into a unique central resource, and we're particularly excited that we'll be able to integrate some of the tools developed by Cathy Wu."
NHGRI is the primary funding institute for UniProt, contributing $3 million. Other NIH participants, in order of their funding levels, are NIGMS, $1 million; NLM, $460,000; the National Institute of Mental Health, $300,000; the National Center for Research Resources, $100,000, and the National Institute of Dental and Craniofacial Research, $50,000.
As a publicly funded project, UniProt's data will be freely accessible and will be released in a timely manner. A new web site will be created for UniProt and the web address will be: http://www.uniprot.org.
NHGRI is one of the 27 institutes and centers at the NIH, which is an agency of the Department of Health and Human Services (DHHS). The NHGRI Division of Extramural Research supports grants for research and for training and career development at sites nationwide.