With the ever increasing number of biomedical articles, keeping up with new information has become a big challenge for biomedical researchers. Much of the information biologists need resides in semi-structured biomedical text articles, making it difficult for researchers to realize the full benefits of these findings. Information retrieval (IR) and information extraction (IE) have been the central technologies for seeking information from large corpora of unstructured text. Advances in these technologies can have...
With the ever increasing number of biomedical articles, keeping up with new information has become a big challenge for biomedical researchers. Much of the information biologists need resides in semi-structured biomedical text articles, making it difficult for researchers to realize the full benefits of these findings. Information retrieval (IR) and information extraction (IE) have been the central technologies for seeking information from large corpora of unstructured text. Advances in these technologies can have a direct impact to the research methodologies for research areas such as biomedical research. While the fields of IR and IE have matured in the past decade, current technologies still have yet to fulfill the promise of supporting biomedical research. In particular, traditional IE technologies adopt a 'black-box' approach, in which biologists have no means in expressing their extraction needs. In addition, typical automated IE technologies rely on manually curated data to learn syntactic patterns for extraction. However, curation of such data is known to be labor-intensive, limiting the applicability of IE in the biomedical domain. While there have been successes in utilizing linguistic structures for IE, linguistic structures have yet to be adopted in the current technologies for IR. Syntactic parsing over large corpus of text is known to be computationally expensive, and this is not ideal for IR, which is expected to respond to users in a timely manner. However, the lack of usage of linguistic structures leads to suboptimal performance for certain queries in the biomedical domain. In this thesis, these issues in IR and IE are tackled by proposing a novel framework called IR+PTQL. The core idea of the framework is to model and store the syntactic and semantic information of the text corpora in a specialized database called the parse tree database. Extraction is then expressed in the form of database queries. A core component is the automated query generation that generates extraction patterns without training data. The evaluation results demonstrate that the query generation component contributes positively to the performance of IR and IE. The applicability of the framework is illustrated with various applications in the genomics domain.
Overview