Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data / Edition 1
by Bing LiuWeb mining aims to discover useful information and knowledge from Web hyperlink structures, page contents, and usage data. Although Web mining uses many conventional data mining techniques, it is not purely an application of traditional data mining due to the semistructured and unstructured nature of the Web data and its heterogeneity. The field has also developed… See more details below
Overview
Web mining aims to discover useful information and knowledge from Web hyperlink structures, page contents, and usage data. Although Web mining uses many conventional data mining techniques, it is not purely an application of traditional data mining due to the semistructured and unstructured nature of the Web data and its heterogeneity. The field has also developed many of its own algorithms and techniques.
Liu has written a comprehensive text on Web data mining. Key topics of structure mining, content mining, and usage mining are covered both in breadth and in depth. His book brings together all the essential concepts and algorithms from related areas such as data mining, machine learning, and text processing to form an authoritative and coherent text.
The book offers a rich blend of theory and practice, addressing seminal research ideas, as well as examining the technology from a practical point of view. It is suitable for students, researchers and practitioners interested in Web mining both as a learning text and as a reference book. Lecturers can readily use it for classes on data mining, Web mining, and Web search. Additional teaching materials such as lecture slides, datasets, and implemented algorithms are available online.
Product Details
- ISBN-13:
- 9783642072376
- Publisher:
- Springer-Verlag New York, LLC
- Publication date:
- 11/23/2010
- Series:
- Data-Centric Systems and Applications Series
- Edition description:
- Softcover reprint of hardcover 1st ed. 2007
- Pages:
- 552
Table of Contents
Introduction 1
What is the World Wide Web? 1
A Brief History of the Web and the Internet 2
Web Data Mining 4
What is Data Mining? 6
What is Web Mining? 6
Summary of Chapters 8
How to Read this Book 11
Bibliographic Notes 12
Data Mining Foundations
Association Rules and Sequential Patterns 13
Basic Concepts of Association Rules 13
Apriori Algorithm 16
Frequent Itemset Generation 16
Association Rule Generation 20
Data Formats for Association Rule Mining 22
Mining with Multiple Minimum Supports 22
Extended Model 24
Mining Algorithm 26
Rule Generation 31
Mining Class Association Rules 32
Problem Definition 32
Mining Algorithm 34
Mining with Multiple Minimum Supports 37
Basic Concepts of Sequential Patterns 37
Mining Sequential Patterns Based on GSP 39
GSP Algorithm 39
Mining with Multiple Minimum Supports 41
Mining Sequential Patterns Basedon PrefixSpan 45
PrefixSpan Algorithm 46
Mining with Multiple Minimum Supports 48
Generating Rules from Sequential Patterns 49
Sequential Rules 50
Label Sequential Rules 50
Class Sequential Rules 51
Bibliographic Notes 52
Supervised Learning 55
Basic Concepts 55
Decision Tree Induction 59
Learning Algorithm 62
Impurity Function 63
Handling of Continuous Attributes 67
Some Other Issues 68
Classifier Evaluation 71
Evaluation Methods 71
Precision, Recall, F-score and Breakeven Point 73
Rule Induction 75
Sequential Covering 75
Rule Learning: Learn-One-Rule Function 78
Discussion 81
Classification Based on Associations 81
Classification Using Class Association Rules 82
Class-Association Rules as Features 86
Classification Using Normal Association Rules 86
Naive Bayesian Classification 87
Naive Bayesian Text Classification 91
Probabilistic Framework 92
Naive Bayesian Model 93
Discussion 96
Support Vector Machines 97
Linear SVM: Separable Case 99
Linear SVM: Non-Separable Case 105
Nonlinear SVM: Kernel Functions 108
K-Nearest Neighbor Learning 112
Ensemble of Classifiers 113
Bagging 114
Boosting 114
Bibliographic Notes 115
Unsupervised Learning 117
Basic Concepts 117
K-means Clustering 120
K-means Algorithm 120
Disk Version of the K-means Algorithm 123
Strengths and Weaknesses 124
Representation of Clusters 128
Common Ways of Representing Clusters 129
Clusters of Arbitrary Shapes 130
Hierarchical Clustering 131
Single-Link Method 133
Complete-Link Method 133
Average-Link Method 134
Strengths and Weaknesses 134
Distance Functions 135
Numeric Attributes 135
Binary and Nominal Attributes 136
Text Documents 138
Data Standardization 139
Handling of Mixed Attributes 141
Which Clustering Algorithm to Use? 143
Cluster Evaluation 143
Discovering Holes and Data Regions 146
Bibliographic Notes 149
Partially Supervised Learning 151
Learning from Labeled and Unlabeled Examples 151
EM Algorithm with Naive Bayesian Classification 153
Co-Training 156
Self-Training 158
Transductive Support Vector Machines 159
Graph-Based Methods 160
Discussion 164
Learning from Positive and Unlabeled Examples 165
Applications of PU Learning 165
Theoretical Foundation 168
Building Classifiers: Two-Step Approach 169
Building Classifiers: Direct Approach 175
Discussion 178
Derivation of EM for Naive Bayesian Classification 179
Bibliographic Notes 181
Web Mining
Information Retrieval and Web Search 183
Basic Concepts of Information Retrieval 184
Information Retrieval Models 187
Boolean Model 188
Vector Space Model 188
Statistical Language Model 191
Relevance Feedback 192
Evaluation Measures 195
Text and Web Page Pre-Processing 199
Stopword Removal 199
Stemming 200
Other Pre-Processing Tasks for Text 200
Web Page Pre-Processing 201
Duplicate Detection 203
Inverted Index and Its Compression 204
Inverted Index 204
Search Using an Inverted Index 206
Index Construction 207
Index Compression 209
Latent Semantic Indexing 215
Singular Value Decomposition 215
Query and Retrieval 218
An Example 219
Discussion 221
Web Search 222
Meta-Search: Combining Multiple Rankings 225
Combination Using Similarity Scores 226
Combination Using Rank Positions 227
Web Spamming 229
Content Spamming 230
Link Spamming 231
Hiding Techniques 233
Combating Spam 234
Bibliographic Notes 235
Link Analysis 237
Social Network Analysis 238
Centrality 238
Prestige 241
Co-Citation and Bibliographic Coupling 243
Co-Citation 244
Bibliographic Coupling 245
PageRank 245
PageRank Algorithm 246
Strengths and Weaknesses of PageRank 253
Timed PageRank 254
Hits 255
Hits Algorithm 256
Finding Other Eigenvectors 259
Relationships with Co-Citation and Bibliographic Coupling 259
Strengths and Weaknesses of Hits 260
Community Discovery 261
Problem Definition 262
Bipartite Core Communities 264
Maximum Flow Communities 265
Email Communities Based on Betweenness 268
Overlapping Communities of Named Entities 270
Bibliographic Notes 271
Web Crawling 273
A Basic Crawler Algorithm 274
Breadth-First Crawlers 275
Preferential Crawlers 276
Implementation Issues 277
Fetching 277
Parsing 278
Stopword Removal and Stemming 280
Link Extraction and Canonicalization 280
Spider Traps 282
Page Repository 283
Concurrency 284
Universal Crawlers 285
Scalability 286
Coverage vs Freshness vs Importance 288
Focused Crawlers 289
Topical Crawlers 292
Topical Locality and Cues 294
Best-First Variations 300
Adaptation 303
Evaluation 310
Crawler Ethics and Conflicts 315
Some New Developments 318
Bibliographic Notes 320
Structured Data Extraction: Wrapper Generation 323
Preliminaries 324
Two Types of Data Rich Pages 324
Data Model 326
HTML Mark-Up Encoding of Data Instances 328
Wrapper Induction 330
Extraction from a Page 330
Learning Extraction Rules 333
Identifying Informative Examples 337
Wrapper Maintenance 338
Instance-Based Wrapper Learning 338
Automatic Wrapper Generation: Problems 341
Two Extraction Problems 342
Patterns as Regular Expressions 343
String Matching and Tree Matching 344
String Edit Distance 344
Tree Matching 346
Multiple Alignment 350
Center Star Method 350
Partial Tree Alignment 351
Building DOM Trees 356
Extraction Based on a Single List Page: Flat Data Records 357
Two Observations about Data Records 358
Mining Data Regions 359
Identifying Data Records in Data Regions 364
Data Item Alignment and Extraction 365
Making Use of Visual Information 366
Some Other Techniques 366
Extraction Based on a Single List Page: Nested Data Records 367
Extraction Based on Multiple Pages 373
Using Techniques in Previous Sections 373
RoadRunner Algorithm 374
Some Other Issues 375
Extraction from Other Pages 375
Disjunction or Optional 376
A Set Type or a Tuple Type 377
Labeling and Integration 378
Domain Specific Extraction 378
Discussion 379
Bibliographic Notes 379
Information Integration 381
Introduction to Schema Matching 382
Pre-Processing for Schema Matching 384
Schema-Level Match 385
Linguistic Approaches 385
Constraint Based Approaches 386
Domain and Instance-Level Matching 387
Combining Similarities 390
1:m Match 391
Some Other Issues 392
Reuse of Previous Match Results 392
Matching a Large Number of Schemas 393
Schema Match Results 393
User Interactions 394
Integration of Web Query Interfaces 394
A Clustering Based Approach 397
A Correlation Based Approach 400
An Instance Based Approach 403
Constructing a Unified Global Query Interface 406
Structural Appropriateness and the Merge Algorithm 406
Lexical Appropriateness 408
Instance Appropriateness 409
Bibliographic Notes 410
Opinion Mining 411
Sentiment Classification 412
Classification Based on Sentiment Phrases 413
Classification Using Text Classification Methods 415
Classification Using a Score Function 416
Feature-Based Opinion Mining and Summarization 417
Problem Definition 418
Object Feature Extraction 424
Feature Extraction from Pros and Cons of Format 1 425
Feature Extraction from Reviews of of Formats 2 and 3 429
Opinion Orientation Classification 430
Comparative Sentence and Relation Mining 432
Problem Definition 433
Identification of Gradable Comparative Sentences 435
Extraction of Comparative Relations 437
Opinion Search 439
Opinion Spam 441
Objectives and Actions of Opinion Spamming 441
Types of Spam and Spammers 442
Hiding Techniques 443
Spam Detection 444
Bibliographic Notes 446
Web Usage Mining 449
Data Collection and Pre-Processing 450
Sources and Types of Data 452
Key Elements of Web Usage Data Pre-Processing 455
Data Modeling for Web Usage Mining 462
Discovery and Analysis of Web Usage Patterns 466
Session and Visitor Analysis 466
Cluster Analysis and Visitor Segmentation 467
Association and Correlation Analysis 471
Analysis of Sequential and Navigational Patterns 475
Classification and Prediction Based on Web User Transactions 479
Discussion and Outlook 482
Bibliographic Notes 482
References 485
Index 517
Customer Reviews
Average Review: