Data Mining for Bioinformatics

More About This Textbook

Overview
Product Details
Meet the Author
Table of Contents

Overview

Covering theory, algorithms, and methodologies, as well as data mining technologies, Data Mining for Bioinformatics provides a comprehensive discussion of data-intensive computations used in data mining with applications in bioinformatics. It supplies a broad, yet in-depth, overview of the application domains of data mining for bioinformatics to help readers from both biology and computer science backgrounds gain an enhanced understanding of this cross-disciplinary field.

The book offers authoritative coverage of data mining techniques, technologies, and frameworks used for storing, analyzing, and extracting knowledge from large databases in the bioinformatics domains, including genomics and proteomics. It begins by describing the evolution of bioinformatics and highlighting the challenges that can be addressed using data mining techniques. Introducing the various data mining techniques that can be employed in biological databases, the text is organized into four sections:

Supplies a complete overview of the evolution of the field and its intersection with computational learning
Describes the role of data mining in analyzing large biological databases—explaining the breath of the various feature selection and feature extraction techniques that data mining has to offer
Focuses on concepts of unsupervised learning using clustering techniques and its application to large biological data
Covers supervised learning using classification techniques most commonly used in bioinformatics—addressing the need for validation and benchmarking of inferences derived using either clustering or classification

The book describes the various biological databases prominently referred to in bioinformatics and includes a detailed list of the applications of advanced clustering algorithms used in bioinformatics. Highlighting the challenges encountered during the application of classification on biological databases, it considers systems of both single and ensemble classifiers and shares effort-saving tips for model selection and performance estimation strategies.

Product Details

ISBN-13: 9780849328015
Publisher: CRC Press
Publication date: 11/9/2012
Edition number: 1
Pages: 348
Product dimensions: 6.40 (w) x 9.20 (h) x 1.00 (d)

Meet the Author

Sumeet Dua is an Upchurch endowed professor of computer science and interim director of computer science, electrical engineering, and electrical engineering technology in the College of Engineering and Science at Louisiana Tech University. He obtained his PhD in computer science from Louisiana State University in 2002. He has coauthored/edited 3 books, has published over 50 research papers in leading journals and conferences, and has advised over 22 graduate thesis and dissertations in the areas of data mining, knowledge discovery, and computational learning in high-dimensional datasets. NIH, NSF, AFRL, AFOSR, NASA, and LA-BOR have supported his research. He frequently serves as a panelist for the NSF and NIH (over 17 panels) and has presented over 25 keynotes, invited talks, and workshops at international conferences and educational institutions. He has also served as the overall program chair for three international conferences and as a chair for multiple conference tracks in the areas of data mining applications and information intelligence. He is a senior member of the IEEE and the ACM. His research interests include information discovery in heterogeneous and distributed datasets, semisupervised learning, content-based feature extraction and modeling, and pattern tracking.

Pradeep Chowriappa is a research assistant professor in the College of Engineering and Science at Louisiana Tech University. His research focuses on the application of data mining algorithms and frameworks on biological and clinical data. Before obtaining his PhD in computer analysis and modeling from Louisiana Tech University in 2008, he pursued a yearlong internship at the Indian Space Research Organization (ISRO), Bangalore, India. He received his masters in computer applications from the University of Madras, Chennai, India, in 2003 and his bachelor’s in science and engineering from Loyola Academy, Secunderabad, India, in 2000. His research interests include design and analysis of algorithms for knowledge discovery and modeling in high-dimensional data domains in computational biology, distributed data mining, and domain integration.

Read More Show Less

Introduction to Bioinformatics
Introduction
Transcription and Translation
The Central Dogma of Molecular Biology
The Human Genome Project
Beyond the Human Genome Project
Sequencing Technology
Dideoxy Sequencing
Cyclic Array Sequencing
Sequencing by Hybridization
Microelectrophoresis
Mass Spectrometry
Nanopore Sequencing
Next-Generation Sequencing
Challenges of Handling NGS Data
Sequence Variation Studies
Kinds of Genomic Variations
SNP Characterization
Functional Genomics
Splicing and Alternative Splicing
Microarray-Based Functional Genomics
Comparative Genomics
Functional Annotation
Function Prediction Aspects
Conclusion
References

Biological Databases and Integration
Introduction: Scientific Work Flows and Knowledge Discovery
Biological Data Storage and Analysis
Challenges of Biological Data
Classification of Bioscience Databases
Primary versus Secondary Databases
Deep versus Broad Databases
Point Solution versus General Solution Databases
Gene Expression Omnibus (GEO) Database
The Protein Data Bank (PDB)
The Curse of Dimensionality
Data Cleaning
Problems of Data Cleaning
Challenges of Handling Evolving Databases
Problems Associated with Single-Source Techniques
Problems Associated with Multisource Integration
Data Argumentation: Cleaning at the Schema Level
Knowledge-Based Framework: Cleaning at the Instance Level
Data Integration
Ensembl
Sequence Retrieval System (SRS)
IBM’s DiscoveryLink
Wrappers: Customizable Database Software
Data Warehousing: Data Management with Query Optimization
Data Integration in the PDB
Conclusion
References

Knowledge Discovery in Databases
Introduction
Analysis of Data Using Large Databases
Distance Metrics
Data Cleaning and Data Preprocessing
Challenges in Data Cleaning
Models of Data Cleaning
Proximity-Based Techniques
Parametric Methods
Nonparametric Methods
Semiparametric Methods
Neural Networks
Machine Learning
Hybrid Systems
Data Integration
Data Integration and Data Linkage
Schema Integration Issues
Field Matching Techniques
Character-Based Similarity Metrics
Token-Based Similarity Metrics
Data Linkage/Matching Techniques
Data Warehousing
Online Analytical Processing
Differences between OLAP and OLTP
OLAP Tasks
Life Cycle of a Data Warehouse
Conclusion
References

Section II

Feature Selection and Extraction Strategies in Data Mining
Introduction
Overfitting
Data Transformation
Data Smoothing by Discretization
Discretization of Continuous Attributes
Normalization and Standardization
Min-Max Normalization
z-Score Standardization
Normalization by Decimal Scaling
Features and Relevance
Strongly Relevant Features
Weakly Relevant to the Dataset/Distribution
Pearson Correlation Coefficient
Information Theoretic Ranking Criteria
Overview of Feature Selection
Filter Approaches
Wrapper Approaches
Filter Approaches for Feature Selection
FOCUS Algorithm
Relief Method—Weight-Based Approach.
Feature Subset Selection Using Forward Selection
Gram-Schmidt Forward Feature Selection
Other Nested Subset Selection Methods
Feature Construction and Extraction
Matrix Factorization
LU Decomposition
QR Factorization to Extract Orthogonal Features
Eigenvalues and Eigenvectors of a Matrix
Other Properties of a Matrix
A Square Matrix and Matrix Diagonalization
Symmetric Real Matrix: Spectral Theorem
Singular Vector Decomposition (SVD)
Principal Component Analysis (PCA)
Jordan Decomposition of a Matrix
Principal Components
Partial Least-Squares-Based Dimension Reduction (PLS)
Factor Analysis (FA)
Independent Component Analysis (ICA)
Multidimensional Scaling (MDS)
Conclusion
References

Feature Interpretation for Biological Learning
Introduction
Normalization Techniques for Gene Expression Analysis
Normalization and Standardization Techniques
Expression Ratios
Intensity-Based Normalization
Total Intensity Normalization
Intensity-Based Filtering of Array Elements
Identification of Differentially Expressed Genes
Selection Bias of Gene Expression Data
Data Preprocessing of Mass Spectrometry Data
Data Transformation Techniques
Baseline Subtraction (Smoothing)
Normalization
Binning
Peak Detection
Peak Alignment
Application of Dimensionality Reduction
Techniques for MS Data Analysis
Feature Selection Techniques
Univariate Methods
Multivariate Methods
Data Preprocessing for Genomic Sequence Data
Feature Selection for Sequence Analysis
Ontologies in Bioinformatics
The Role of Ontologies in Bioinformatics
Description Logics
Gene Ontology (GO)
Open Biomedical Ontologies (OBO)
Conclusion
References

Section III

Clustering Techniques in Bioinformatics
Introduction
Clustering in Bioinformatics
Clustering Techniques
Distance-Based Clustering and Measures
Mahalanobis Distance
Minkowiski Distance
Pearson Correlation
Binary Features
Nominal Features
Mixed Variables
Distance Measure Properties
k-Means Algorithm
k-Modes Algorithm
Genetic Distance Measure (GDM)
Applications of Distance-Based Clustering in Bioinformatics
New Distance Metric in Gene Expressions for Coexpressed Genes
Gene Expression Clustering Using Mutual Information Distance Measure
Gene Expression Data Clustering Using a Local Shape-Based Clustering
Exact Similarity Computation
Approximate Similarity Computation
Implementation of k-Means in WEKA
Hierarchical Clustering
Agglomerative Hierarchical Clustering
Cluster Splitting and Merging
Calculate Distance between Clusters
Applications of Hierarchical Clustering Techniques in Bioinformatics
Hierarchical Clustering Based on Partially Overlapping and Irregular Data
Cluster Stability Estimation for Microarray Data
Comparing Gene Expression Sequences Using Pairwise Average Linking
Implementation of Hierarchical Clustering
Self-Organizing Maps Clustering
SOM Algorithm
Application of SOM in Bioinformatics
Identifying Distinct Gene Expression Patterns Using SOM
SOTA: Combining SOM and Hierarchical Clustering for Representation of Genes
Fuzzy Clustering
Fuzzy c-Means (FCM)
Application of Fuzzy Clustering in Bioinformatics
Clustering Genes Using Fuzzy J-Means and VNS Methods
Fuzzy k-Means Clustering on Gene Expression
Comparison of Fuzzy Clustering Algorithms
Implementation of Expectation Maximization Algorithm
Conclusion
References

Advanced Clustering Techniques
Graph-Based Clustering
Graph-Based Cluster Properties
Cut in a Graph
Intracluster and Intercluster Density
Measures for Identifying Clusters
Identifying Clusters by Computing Values for the Vertices or Vertex Similarity
Distance and Similarity Measure
Adjacency-Based Measures
Connectivity Measures
Computing the Fitness Measure
Density Measure
Cut-Based Measures
Determining a Split in the Graph
Cuts
Spectral Methods
Edge-Betweenness
Graph-Based Algorithms
Chameleon Algorithm
CLICK Algorithm
Application of Graph-Based Clustering in Bioinformatics
Analysis of Gene Expression Data Using Shortest Path (SP)
Construction of Genetic Linkage Maps Using Minimum Spanning Tree of a Graph
Finding Isolated Groups in a Random Graph Process
Implementation in Cytoscape
Seeding Method
Kernel-Based Clustering
Kernel Functions
Gaussian Function
Application of Kernel Clustering in Bioinformatics
Kernel Clustering
Kernel-Based Support Vector Clustering
Analyzing Gene Expression Data Using SOM and Kernel-Based Clustering
Model-Based Clustering for Gene Expression Data
Gaussian Mixtures
Diagonal Model
Model Selection
Relevant Number of Genes
A Resampling-Based Approach for Identifying Stable and Tight Patterns
Overcoming the Local Minimum Problem in k-Means Clustering
Tight Clustering
Tight Clustering of Gene Expression Time Courses
Higher-Order Mining
Clustering for Association Rule Discovery
Clustering of Association Rules
Clustering Clusters
Conclusion
References

Section IV

Classification Techniques in Bioinformatics
Introduction
Bias-Variance Trade-Off in Supervised Learning
Linear and Nonlinear Classifiers
Model Complexity and Size of Training Data
Dimensionality of Input Space
Supervised Learning in Bioinformatics
Support Vector Machines (SVMs)
Hyperplanes
Large Margin of Separation
Soft Margin of Separation
Kernel Functions
Applications of SVM in Bioinformatics
Gene Expression Analysis
Remote Protein Homology Detection
Bayesian Approaches
Bayes’ Theorem
Naive Bayes Classification
Handling of Prior Probabilities
Handling of Posterior Probability
Bayesian Networks
Methodology
Capturing Data Distributions Using Bayesian Networks
Equivalence Classes of Bayesian Networks
Learning Bayesian Networks
Bayesian Scoring Metric
Application of Bayesian Classifiers in Bioinformatics
Binary Classification
Multiclass Classification
Computational Challenges for Gene Expression Analysis
Decision Trees
Tree Pruning
Ensemble Approaches
Bagging
Unweighed Voting Methods
Confidence Voting Methods
Ranked Voting Methods
Boosting
Seeking Prospective Classifiers to Be Part of the Ensemble
Choosing an Optimal Set of Classifiers
Assigning Weight to the Chosen Classifier
Random Forest
Application of Ensemble Approaches in Bioinformatics
Computational Challenges of Supervised Learning
Conclusion
References

Validation and Benchmarking
Introduction: Performance Evaluation Techniques
Classifier Validation
Model Selection
Challenges Model Selection
Performance Estimation Strategies
Holdout
Three-Way Split
k-Fold Cross-Validation
Random Subsampling
Performance Measures
Sensitivity and Specificity
Precision, Recall, and f-Measure
ROC Curve
Cluster Validation Techniques
The Need for Cluster Validation
External Measures
Internal Measures
Performance Evaluation Using Validity Indices
Silhouette Index (SI)
Davies-Bouldin and Dunn’s Index
Calinski Harabasz (CH) Index
Rand Index
Conclusion

References

Customer Reviews

Be the first to write a review

( 0 )

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

If you find inappropriate content, please report it to Barnes & Noble

Data Mining for Bioinformatics / Edition 1