One of the unsolved paradigms in molecular biology is the protein folding problem,
i.e. the acquisition of the functional three-dimensional structure of a protein from its
linear sequence of aminoacids, ultimately determined by the sequence of bases in a gene.
Thus, in recent years, with the identification of several diseases as protein folding or unfolding
disorders and the advent of many genomics projects, protein folding has become a central
issue in molecular sciences research. The detailed understanding of the forces and molecular
mechanisms driving the protein folding process in vivo and in vitro is essential in areas as
diverse as therapeutics of neurodegenerative diseases or biocatalysis in organic solvents.
Many experimental and computational approaches have been used throughout the years to tackle
the protein folding problem. However, more recently and fuelled by the ever increasing computational
power available to a wider range of scientists, the computational approaches to study protein folding
and unfolding have taken an augmented importance.
Knowing that although protein folding and unfolding simulations are expensive (in time and resources),
today these simulations are not generally available outside the groups that perform them. Thus, it is
becoming apparent that the creation of a platform where computer simulations of protein folding and unfolding
can be compared and analyzed could be of the utmost importance in the progress of the field.
To address the goal of comparison, analysis and sharing of information and data on protein folding and
unfolding simulations, the project P-found has been developed. The overall aim of this project is to create
a public repository of data that will enable researchers around the globe to share, analyze and compare
simulation data:
- from different simulation methods (MD, Go models, simplified, full atom, etc)
- from different proteins (WT vs. mutant; different structural classes or different topologies)
- from different simulation details (different temperature, solvent, force field, etc).
The two main functional requirements in this project are sharing of the data and analysis of the data.
In the initial stage of project development, sharing of the data arising from protein folding and unfolding
simulations means that users around the world can contribute and store (i.e. upload) such data from
simulations experiments and they can access, select and retrieve (i.e. download) such data from the
repository. Additionally, the analysis of protein folding and unfolding simulation data stored in the
repository allow users to apply a range of different data processing and analysis methods (classification,
clustering, time-series modeling, etc.) to the data in order to address their scientific questions. However,
the final goal of the project is to construct a GRID-enabled, truly distributed, data warehouse of protein
folding and unfolding simulations. At that stage of project development, we will make use of GRID-based
procedures and protocols in order to construct a distributed repository of the data, minimizing uploads or
downloads of large data volumes. In this case, analysis tools provided by the data warehouse itself or newly
developed by the users to address their scientific questions would operate on the data residing at their places
of storage, and reporting the results back to the users.
We believe that the development and implementation of a publicly available repository of protein folding and
unfolding simulations will allow a more efficient dissemination of the scientific results provided by these
simulations, but maybe even more importantly, will allow the development of effective analysis tools to compare
and characterize these simulations, which may have a very positive impact in the understanding of the
molecular mechanisms of protein folding, misfolding and aggregation, in protein structure prediction,
and even in areas such as de novo protein design in aqueous or organic solution.