Periscope: Declarative and Efficient Querying for Biological Databases

An interesting parallel can be drawn between the data management methods currently used in life sciences and those used by business applications, such as banking applications, about three decades ago. Prior to the advent of the relational data model, querying business data required writing and executing customized programs, which encoded the detailed steps for executing the query. Reusing query programs and algorithms involved rewriting the application program and logic, which was time-consuming and expensive. Furthermore, writing complex queries, such as querying over multiple data sets or posing complex analytical queries, was a daunting task. One of the critical contributions of the relational data model was the introduction of a declarative querying paradigm for business data management, instead of the previously used procedural paradigm. In a declarative querying paradigm, the user expresses the query in a high-level language, like SQL, and the database management system (DBMS) determines the best strategy for evaluating the query. With this paradigm, the user focuses only on what query she/he wants to pose, rather than having to worry about what query to pose and how to evaluate the query. The declarative paradigm naturally results in programs that are compact and easier to understand. This paradigm also insulates the user queries from changes in the physical layout of the data on disk. For example, no changes to the user queries are required if a new index is created on a data set; instead, the DBMS takes charge of computing a query plan for each query based on the current physical schema of the data - new indices may simply result in faster execution of old queries. A critical additional benefit of the declarative paradigm is that the DBMS can employ sophisticated query optimization and evaluation methods that allows the system to scale to large data sets. This declarative querying paradigm has been a huge success for relational data management, and commercial relational DBMSs today easily manage large volumes of data and allow very complex querying on these databases.

Life sciences applications today largely employ the procedural querying paradigm that was used for enterprise data management three decades ago. Currently, data analysis in life sciences is often carried out by custom Perl, Python, or JAVA programs. Complex data analyses using this procedural paradigm result in large programs that are difficult to reuse and share across users. Furthermore, many of the data sets that are used in life sciences are growing at an astonishing rate, and the queries that are posed against these data sets are also increasing in complexity. The key motivation behind the Periscope project is that DBMSs employing a declarative paradigm can play a significant role in managing these data sets and efficiently answering these complex queries. As part of this project we are currently investigating declarative methods and efficient algorithms for querying on biological sequences, protein secondary and tertiary structures, and protein interaction networks.

Publications

TALE: A Tool for Approximate Large Graph Matching, Yuanyuan Tian and Jignesh M. Patel, ICDE 2008. For the extended version, click here.

SAGA: A Subgraph Matching Tool for Biological Graphs, Y. Tian, R. C. McEachin, C. Santos, D. J. States and J. M. Patel, Bioinformatics, 2007.

A Framework for Protein Structure Classification and Identification of Novel Protein Structures, Y. J. Kim and J. M. Patel, BMC Bioinformatics 2006, 7:456.

Declarative Querying for Biological Sequence Databases, S. Tata, J. M. Patel, J. S. Friedman, and A. Swaroop, ICDE 2006.

miBLAST: Scalable Evaluation of a Batch of Nucleotide Sequence Queries with BLAST, Y. J. Kim, A. Boyd, B. D. Athey, and J. M. Patel, Nucleic Acids Research, 2005 33: 4335-4344.

Practical Suffix Tree Construction, S. Tata, R. A. Hankins, and J. M. Patel, VLDB 2004. For the journal version, click here .

OASIS: An Online and Accurate Technique for Local-alignment Searches on Biological Sequences, C. Meek, J. M. Patel, and S. Kasetty, VLDB 2003.

The Role of Declarative Querying in Bioinformatics, J. M. Patel, Source: OMICS, 7(1), 2003.

PiQA: An Algebra for Querying Protein Data Sets, S. Tata and J. M. Patel, SSDBM 2003.

Searching on the Secondary Structure of Protein Sequences, L. Hammel and J. M. Patel, VLDB 2002.

Software

People

Funding

This project is support in part by funding from NSF, NIH, MEDC, and Microsoft. Any opinions, findings, and conclusions or recommendations expressed anywhere on this web page or in publications related to this project are those of the author(s) and do not necessarily reflect the views of the supporting corporations and agencies.

Periscope: Towards Declarative and Efficient Biological Data Management

Overview

Publications

Software

People

Funding

Contacts