P-found Data Repository P-found Data Repository




Overview

One of the unsolved paradigms in molecular biology is the protein folding problem, i.e. the acquisition of the functional three-dimensional structure of a protein from its linear sequence of aminoacids, ultimately determined by the sequence of bases in a gene. Thus, in recent years, with the identification of several diseases as protein folding or unfolding disorders and the advent of many genomics projects, protein folding has become a central issue in molecular sciences research. The detailed understanding of the forces and molecular mechanisms driving the protein folding process in vivo and in vitro is essential in areas as diverse as therapeutics of neurodegenerative diseases or biocatalysis in organic solvents.

Many experimental and computational approaches have been used throughout the years to tackle the protein folding problem. However, more recently and fuelled by the ever increasing computational power available to a wider range of scientists, the computational approaches to study protein folding and unfolding have taken an augmented importance.

Knowing that although protein folding and unfolding simulations are expensive (in time and resources), today these simulations are not generally available outside the groups that perform them. Thus, it is becoming apparent that the creation of a platform where computer simulations of protein folding and unfolding can be compared and analyzed could be of the utmost importance in the progress of the field.

To address the goal of comparison, analysis and sharing of information and data on protein folding and unfolding simulations, the project P-found has been developed. The overall aim of this project is to create a public repository of data that will enable researchers around the globe to share, analyze and compare simulation data:

  • from different simulation methods (MD, Go models, simplified, full atom, etc)
  • from different proteins (WT vs. mutant; different structural classes or different topologies)
  • from different simulation details (different temperature, solvent, force field, etc).

The two main functional requirements in this project are sharing of the data and analysis of the data. In the initial stage of project development, sharing of the data arising from protein folding and unfolding simulations means that users around the world can contribute and store (i.e. upload) such data from simulations experiments and they can access, select and retrieve (i.e. download) such data from the repository. Additionally, the analysis of protein folding and unfolding simulation data stored in the repository allow users to apply a range of different data processing and analysis methods (classification, clustering, time-series modeling, etc.) to the data in order to address their scientific questions. However, the final goal of the project is to construct a GRID-enabled, truly distributed, data warehouse of protein folding and unfolding simulations. At that stage of project development, we will make use of GRID-based procedures and protocols in order to construct a distributed repository of the data, minimizing uploads or downloads of large data volumes. In this case, analysis tools provided by the data warehouse itself or newly developed by the users to address their scientific questions would operate on the data residing at their places of storage, and reporting the results back to the users.

We believe that the development and implementation of a publicly available repository of protein folding and unfolding simulations will allow a more efficient dissemination of the scientific results provided by these simulations, but maybe even more importantly, will allow the development of effective analysis tools to compare and characterize these simulations, which may have a very positive impact in the understanding of the molecular mechanisms of protein folding, misfolding and aggregation, in protein structure prediction, and even in areas such as de novo protein design in aqueous or organic solution.