Personal tools
You are here: Home Topics Computing & Storage BioHDF Update

BioHDF Update

— filed under: ,

An overview of BioHDF storage format and updates on recent work.

BioHDF

Todd Smith, Geospiza, Inc. (www.geospiza.com)
Mike Folk, The HDF Group (www.hdfgroup.org)

Next Generation (“Next Gen”) DNA sequencing platforms, which combine molecular resolution with massively parallel data throughput, are dramatically lowering sequencing costs and at the same time are increasing sensitivity and specificity allowing researchers to think about DNA sequencing as a quantitative assay. For example, in a single instrument run, an Expression Sequence Tag (EST) experiment can yield millions of sequences and detect rare transcripts that cannot be found any other way [1-3]. In cancer research, high sampling rates will allow for the detection of rare sequence variants in populations of tumor cells that could be prognostic indicators or provide insights for new therapeutics [1, 4, 5]. In viral assays, it will be possible to determine the sequence of individual viral genomes and detect drug resistant strains as they appear [6, 7]. Next Gen sequencing has considerable appeal, in part, because the large numbers of sequences that can be obtained for analyses will make statistical calculations more valid and improve diagnostic assays.

A significant challenge to working with Next Gen sequencing data lies in the inability of current bioinformatics programs to scale due to inefficiencies in their data storage, processing, and memory utilization. The archetypical phred/phrap/polyphred [8-12] system, for example, creates no fewer than four copies of each base and quality value as data are processed from sequence traces, to reads, to assembled data, to variant predictions. Such programs also suffer inefficiencies because they cannot read their own output as input. This constraint means that each assembly or set of alignments must be repeated every time a new analysis is needed. Consequently, incremental analysis is impossible without writing additional programs. Finally, sequence assembly and alignment programs typically require all data be maintained in random access memory (RAM) as the program runs.  The current state-of-the-art Next Gen algorithms repeat these patterns of the past and require expensive high memory and large storage computing architectures to operate. 

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by applications that analyze DNA sequence data. Geospiza and The HDF Group (THG) are combining their expertise in laboratory information management systems and high-performance scalable scientific data technologies to address data management issues that must be overcome if the full potential of next and future generation DNA sequencing platforms are to be realized. Our proposed work will deliver these capabilities by building on a recognized technology (HDF5) that has proven its ability to meet similar scalability demands in other areas of science. We call the extensible domain-specific data technologies that will be built "BioHDF." In the remaining sections we describe HDF, its features and benefits, and close with a brief description of the BioHDF project.

HDF


When one considers ways to work with data on a computer, three problems must be solved: defining a data model, determining a data format, and implementing the data model and format in software.

The data model describes data entities, the attributes of an entity (time, length, type) and the relationships among entities.  The data format is how the data will be represented in a computer and in storage (simple text files, XML, binary files, relational tables).  The implementation is how the data model and format are instantiated in software.  Implementation is accomplished in a variety of ways, such as relational databases, object databases, Excel, and software utilities and libraries.  Implementation determines how users will interact with the data to access data, perform calculations, and create information. The factors that affect data model, format and implementation choices include ease of use, scalability (time, space, and complexity), and application requirements (reads, writes, data persistence, updates). 

In DNA sequencing, traditional bioinformatics practices have sacrificed ease of use and scalability requirements to achieve data production and research objectives. Traditional data models, formats and implementations focus on the particular data that one or a few applications work with, and fail to encompass the wide range of types of data that must be dealt with in the course of an experiment.  The result is that, in a typical workflow, a great deal of redundancy occurs as data is passed from one stage to another.  This view also applies typically to the size and complexity of data that models, formats and implementations are designed to handle, as most use simple, text-based models and formats.  Until recently, these practices have been extremely successful.  However, Next Gen sequencing has acute scalability and ease of use requirements, thus new ways of working with data need to be considered. 

For these and other reasons, Geospiza felt it would be worthwhile to explore general purpose, open source, binary file storage technologies, and looked to other scientific communities to learn how similar problems were being addressed. That search identified HDF (hierarchical data format) as a candidate technology. Initially developed in 1988 for storing scientific data, HDF is well established in many scientific fields, and bioinformatics applications utilizing HDF can benefit from its long history and an infrastructure of existing tools.

HDF technologies address the problems of how to manage, preserve and extract maximum use of scientific data in the face of enormous growth in size and complexity. There are two versions of HDF: HDF4 and HDF5.  First released in 1998, HDF5 is the successor to HDF4 and is the focus of our interest. HDF5 supports all types of data stored digitally, regardless of origin or size.  HDF5 technologies are relevant when the data challenges being faced push the limits of what can be addressed by traditional database systems, XML documents, or in-house data formats. Leveraging the powerful HDF5 products and the expertise of The HDF Group, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies.

Many HDF5 adopters have very large datasets, very fast access requirements, or very complex datasets. Others turn to HDF5 because it allows them to easily share data across a wide variety of computational platforms using applications written in different programming languages. Some use HDF5 to take advantage of the many open-source and commercial tools that understand HDF. Similar to XML documents, HDF5 files are self-describing and allow users to specify complex data relationships and dependencies. In contrast to XML documents, HDF5 files can contain binary data (in many representations) and allow direct access to parts of the file without first parsing the entire contents.

HDF5 allows hierarchical data objects to be expressed in a very natural manner, in contrast, for example, to relational database. Whereas relational databases support tables, HDF5 provides a wide variety of ways to aggregate data, including as tables.  HDF5 datasets are n-dimensional arrays, where the elements in a dataset array may themselves be complex objects. Relational databases offer excellent support for queries based on field matching, but are not well-suited for sequentially processing all records in the database or for subsetting the data based on coordinate-style lookup.

An important aspect of the HDF5 technologies is that they form a complete system for implementing scalable domain specific data models. HDF5 does not limit the size of files or the size or number of objects in a file and is portable across virtually all computing platforms. The open source HDF5 I/O library includes C, C++, Java, and Fortran90 programming interfaces. Its general data model supports complex data relationships and dependencies through its grouping and linking mechanisms and can accommodate many common types of metadata and arbitrary user-defined metadata. A rich set of predefined  datatypes is supported, and applications can create an unlimited variety of complex user-defined datatypes.

Many features of HDF5 include the kinds of things that later become costly re-implementation issues in "home grown" systems. For instance, metadata in HDF5 files fully describe how data elements are stored, including information such as byte order (endian), size, and floating point representation, insuring portability among platforms.

HDF5 offers I/O transformation, storage and subsetting options that make it well-suited for the coming challenges in working with Next Gen data.  A "virtual file driver" makes it possible to perform I/O in a variety of ways.  Standard (Posix), Parallel, and other I/O file drivers are provided with the HDF5 library. The HDF5 parallel I/O driver, for instance, can reduce access times on parallel systems by reading/writing multiple data streams simultaneously. Application developers can write additional file drivers to implement customized data storage or transport capabilities.

HDF5's flexible storage options make it possible to match the storage layout of data to the needs of an application and the characteristics of the data.  A dataset can be compressed, saving storage space and transfer time.  It can be stored as a series of chunks, improving subsetting access time, and making it possible to extend a dataset in any dimension without having to rewrite it.  HDF5 also provides for external storage of raw data, allowing raw data to be shared among HDF5 files and/or applications, and often saving disk space.

The HDF5 format and I/O library support complex subsetting and data transformation, which can be very useful for working with aligned data sets. Here HDF5 enables datatype and spatial transformation during I/O operations, and HDF5 data I/O functions can operate on selected subsets of the data, reducing transferred data volume and improving access speed.

BioHDF Project


Many HDF5 applications implement domain-specific data models using HDF5 as the underlying format, and the HDF5 library as the platform for managing data.  By doing this, the features of HDF5 can be exploited, while preserving views of that data that are appropriate to the particular application domain. 

For instance, NASA's Earth Observing System (EOS), which is the primary source of data used to study global climate change, manages earth science data in HDF with a data model and implementation called "HDF-EOS."  This is done by specifying how HDF files should be organized to store and access earth science data, then implementing an HDF-EOS I/O library and tools for efficient and easy access, storage, query, and analysis of HDF-EOS data.  EOS collects about three terabytes of new data per day in this way, and now has an archive of several petabytes.  By having a common platform to manage and access earth science data, an estimated 1.6 million users can readily access and share EOS data and tools.

The goals of BioHDF are similar to those of HDF-EOS, but in a completely different domain. The overall goal of the BioHDF project is to create highly scalable and efficient bioinformatics software technologies and applications that will increase the utility of future generations of DNA sequencing platforms, and to enable a smooth transition of their use from the genome centers to the basic research laboratory and clinic. To be scalable and efficient, applications need a bioinformatics infrastructure that consists of data models, application programming interfaces (APIs), software tools, and viewers to support the very large and complex data sets being created by multiple instruments. 

We will meet these needs by developing extensible domain-specific data technologies we call "BioHDF.” BioHDF will extend HDF5 with features (indexes, additional compression algorithms, graphs) to support the extreme data storage and computation requirements of Next Gen Sequencing. The project is broken into logical phases that follow the flow of data from acquisition to analysis to applications. Funding for this work is expected to begin in the fall.

•    Phase 1: basic data.  In the first phase, we will develop a data model to directly support the data types being created by DNA sequencing platforms and implement it in the form of a BioHDF API, library and research-level tools that support data access and viewing. The implementation will address data volume, interchange, and performance requirements. In this work we will work with the SRF group to develop interchange tools and access methods to interoperate with existing and emerging standards.

•    Phase 2: analysis.  The second phase will focus on the analysis aspects of DNA sequencing. We will develop a data model to support the computational phases of data analysis and implement it in BioHDF. The implementation will take advantage of HDF5’s features to manage data complexity, reduce redundancy, and deliver high performance. An API will be developed, and we will work with the community to adapt sequence analysis algorithms and viewers for assembly, alignment, and variation detection.

•    Phase 3: enterprise applications.  The final phase will explore how to incorporate BioHDF into enterprise applications for clinical research and diagnostics. We will examine ways to partition data between RDBMS-based laboratory information management systems (LIMS) and BioHDF. To support clinical requirements, we will add methods to secure and version data in BioHDF that minimally impact performance.

Over the coming months I will post new articles discussing additional details of the project, our progress, and work with collaborators.

References


1.    Meyer, M., U. Stenzel, S. Myles, K. Prufer, and M. Hofreiter, "Targeted high-throughput sequencing of tagged nucleic acid samples." Nucleic Acids Res, 2007. 35(15): p. e97.  doi:10.1093/nar/gkm566

2.    Korbel, J.O., A.E. Urban, J.P. Affourtit, et al., "Paired-end mapping reveals extensive structural variation in the human genome." Science, 2007. 318(5849): p. 420-6. DOI: 10.1126/science.1149504

3.    Wicker, T., E. Schlagenhauf, A. Graner, T.J. Close, B. Keller, and N. Stein, "454 sequencing put to the test using the complex genome of barley." BMC Genomics, 2006. 7: p. 275. DOI:10.1186/1471-2164-7-275

4.    Taylor, K.H., R.S. Kramer, J.W. Davis, J. Guo, D.J. Duff, D. Xu, C.W. Caldwell, and H. Shi, "Ultradeep bisulfite sequencing analysis of DNA methylation patterns in multiple gene promoters by 454 sequencing." Cancer Res, 2007. 67(18): p. 8511-8. doi: 10.1158/0008-5472.CAN-07-1016

5.    Highlander, S.K., K.G. Hulten, X. Qin, et al., "Subtle genetic changes enhance virulence of methicillin resistant and sensitive Staphylococcus aureus." BMC Microbiol, 2007. 7(1): p. 99. doi:10.1186/1471-2180-7-99

6.    Wang, G.P., A. Ciuffi, J. Leipzig, C.C. Berry, and F.D. Bushman, "HIV integration site selection: analysis by massively parallel pyrosequencing reveals association with epigenetic modifications." Genome Res, 2007. 17(8): p. 1186-94. DOI: 10.1101/gr.6286907

7.    Hoffmann, C., N. Minkah, J. Leipzig, G. Wang, M.Q. Arens, P. Tebas, and F.D. Bushman, "DNA bar coding and pyrosequencing to identify rare HIV drug resistance mutations." Nucleic Acids Res, 2007. 35(13): p. e91. doi:10.1093/nar/gkm435

8.    Rieder, M.J., S.L. Taylor, V.O. Tobe, and D.A. Nickerson, "Automating the identification of DNA variations using quality-based fluorescence re-sequencing: analysis of the human mitochondrial genome." Nucleic Acids Res, 1998. 26(4): p. 967-73. PMID 9461455

9.    Gordon, D., C. Abajian, and P. Green, "Consed: a graphical tool for sequence finishing." Genome Res, 1998. 8(3): p. 195-202. PMID: 9521923

10.    Ewing, B., L. Hillier, M.C. Wendl, and P. Green, "Base-calling of automated sequencer traces using phred. I. Accuracy assessment." Genome Res, 1998. 8(3): p. 175-85. PMID: 9521921

11.    Ewing, B. and P. Green, "Base-calling of automated sequencer traces using phred. II. Error probabilities." Genome Res, 1998. 8(3): p. 186-94. PMID: 9521922

12.    Nickerson, D.A., V.O. Tobe, and S.L. Taylor, "PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing." Nucleic Acids Res, 1997. 25(14): p. 2745-51.  PMID: 9207020

Document Actions
« November 2008 »
November
MoTuWeThFrSaSu
12
3456789
10111213141516
17181920212223
24252627282930