Data Cleansing of Genome Data
Increasing interest in genome data has lead to the availability of a multitude of public available genome databases today. By genome data we mean nucleic acid (DNA and RNA), amino acid (protein) sequence data, and their structural and functional classification (annotation). The process of assigning meaning to sequence data by identifying regions of interest and determine function for them is defined as genome annotation. The abundance of errors in genome databases is a well known fact and a major problem are errors in genome annotation. [Müller, Naumann, Freytag 2003].
Data Cleansing is the process of (semi-)automatically detecting and correcting errors in data collections. Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a brought range of methods intending to enhance the accuracy and thereby the usability of existing data.
Problems, Methods, and Challenges in Comprehensive Data Cleansing
Problems, approaches, and methods for comprehensive data cleansing are surveyed within [Müller, Freytag, 2003]. We classify the various types of data anomalies and errors occurring in data that have to be eliminated and define a set of six basic quality criteria that comprehensively cleansed data has to accomplish exhaustively. This enables the evaluation and comparison of existing approaches for data cleansing regarding the types of errors handled and eliminated by them and the quality criteria affected. We also describe in general the different steps in data cleansing and specify the methods used within the cleansing process and give an outlook to research directions that could complement the existing systems.
The evaluation of existing data cleansing approaches reveals that the main focus until now is on syntactical inconsistencies, handled by transformation, normalization, and standardization, and on duplicate elimination. Verification and assurance of value correctness within a database is handled only marginal. Still, semantic errors in databases posses a major hindrance in gainful application of data. This is a major issue, especially in genome research.
Semantic Data Cleansing in Genome Databases
Errors in genome data can result in improper target selection for biological experiments or pharmaceutical research, in turn resulting in loss of money. Missing, incomplete or erroneous information hinders the automatic processing and analysis of data. This leads to a loss in confidence and a rise in effort and frustration for the biologist. Several studies show the existence of errors in genome databases. The main cause for poor data quality are experimental errors, miss-annotation, or outdated data.
We define semantic cleansing of genome data as the process of assuring correctness of annotations for genome sequences [Müller 2003]. This is performed by identifying erroneous annotations and re-annotating them. Using a simple example we validated the applicability of this approach and identified open problems and challenges for reliable cleansing of genome data.
Semantic cleansing of genome data is closely related to genome annotation. Both require domain dependent evidence functions. The definition of a set of general evidence functions for the domain of genome annotation will enable us to build a formal model to specify the annotation and cleansing process. The intrinsic properties of these individual functions can then be used to detect erroneous annotations without the necessity of complete re-annotation.
In those cases where alternative solutions and evidence values for them are managed it is desirable to include them within the annotation and cleansing process to receive results of higher quality. Some of the genome database are also beginning to manage such evidences for their entries. Credible annotations can be derived by excluding invalid or unreliable entries from the processing. The formal model for genome annotation has to take these evidences into account.
Including the management of cleansing lineage within the model further enables efficient detection and re-annotation of affected annotations when changes in external data sources occur.
Last update: Tuesday, November 27, 2007