Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data / Edition 1

Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data / Edition 1

by Bing Liu
     
 

Web mining aims to discover useful information and knowledge from Web hyperlink structures, page contents, and usage data. Although Web mining uses many conventional data mining techniques, it is not purely an application of traditional data mining due to the semistructured and unstructured nature of the Web data and its heterogeneity. The field has also developed… See more details below

Overview

Web mining aims to discover useful information and knowledge from Web hyperlink structures, page contents, and usage data. Although Web mining uses many conventional data mining techniques, it is not purely an application of traditional data mining due to the semistructured and unstructured nature of the Web data and its heterogeneity. The field has also developed many of its own algorithms and techniques.

Liu has written a comprehensive text on Web data mining. Key topics of structure mining, content mining, and usage mining are covered both in breadth and in depth. His book brings together all the essential concepts and algorithms from related areas such as data mining, machine learning, and text processing to form an authoritative and coherent text.

The book offers a rich blend of theory and practice, addressing seminal research ideas, as well as examining the technology from a practical point of view. It is suitable for students, researchers and practitioners interested in Web mining both as a learning text and as a reference book. Lecturers can readily use it for classes on data mining, Web mining, and Web search. Additional teaching materials such as lecture slides, datasets, and implemented algorithms are available online.

Read More

Product Details

ISBN-13:
9783642072376
Publisher:
Springer-Verlag New York, LLC
Publication date:
11/23/2010
Series:
Data-Centric Systems and Applications Series
Edition description:
Softcover reprint of hardcover 1st ed. 2007
Pages:
552

Related Subjects

Table of Contents


Introduction     1
What is the World Wide Web?     1
A Brief History of the Web and the Internet     2
Web Data Mining     4
What is Data Mining?     6
What is Web Mining?     6
Summary of Chapters     8
How to Read this Book     11
Bibliographic Notes     12
Data Mining Foundations
Association Rules and Sequential Patterns     13
Basic Concepts of Association Rules     13
Apriori Algorithm     16
Frequent Itemset Generation     16
Association Rule Generation     20
Data Formats for Association Rule Mining     22
Mining with Multiple Minimum Supports     22
Extended Model     24
Mining Algorithm     26
Rule Generation     31
Mining Class Association Rules     32
Problem Definition     32
Mining Algorithm     34
Mining with Multiple Minimum Supports     37
Basic Concepts of Sequential Patterns     37
Mining Sequential Patterns Based on GSP     39
GSP Algorithm     39
Mining with Multiple Minimum Supports     41
Mining Sequential Patterns Basedon PrefixSpan     45
PrefixSpan Algorithm     46
Mining with Multiple Minimum Supports     48
Generating Rules from Sequential Patterns     49
Sequential Rules     50
Label Sequential Rules     50
Class Sequential Rules     51
Bibliographic Notes     52
Supervised Learning     55
Basic Concepts     55
Decision Tree Induction     59
Learning Algorithm     62
Impurity Function     63
Handling of Continuous Attributes     67
Some Other Issues     68
Classifier Evaluation     71
Evaluation Methods     71
Precision, Recall, F-score and Breakeven Point     73
Rule Induction     75
Sequential Covering     75
Rule Learning: Learn-One-Rule Function     78
Discussion     81
Classification Based on Associations     81
Classification Using Class Association Rules     82
Class-Association Rules as Features     86
Classification Using Normal Association Rules     86
Naive Bayesian Classification     87
Naive Bayesian Text Classification      91
Probabilistic Framework     92
Naive Bayesian Model     93
Discussion     96
Support Vector Machines     97
Linear SVM: Separable Case     99
Linear SVM: Non-Separable Case     105
Nonlinear SVM: Kernel Functions     108
K-Nearest Neighbor Learning     112
Ensemble of Classifiers     113
Bagging     114
Boosting     114
Bibliographic Notes     115
Unsupervised Learning     117
Basic Concepts     117
K-means Clustering     120
K-means Algorithm     120
Disk Version of the K-means Algorithm     123
Strengths and Weaknesses     124
Representation of Clusters     128
Common Ways of Representing Clusters     129
Clusters of Arbitrary Shapes     130
Hierarchical Clustering     131
Single-Link Method     133
Complete-Link Method     133
Average-Link Method     134
Strengths and Weaknesses     134
Distance Functions     135
Numeric Attributes     135
Binary and Nominal Attributes     136
Text Documents     138
Data Standardization     139
Handling of Mixed Attributes     141
Which Clustering Algorithm to Use?     143
Cluster Evaluation     143
Discovering Holes and Data Regions     146
Bibliographic Notes     149
Partially Supervised Learning     151
Learning from Labeled and Unlabeled Examples     151
EM Algorithm with Naive Bayesian Classification     153
Co-Training     156
Self-Training     158
Transductive Support Vector Machines     159
Graph-Based Methods     160
Discussion     164
Learning from Positive and Unlabeled Examples     165
Applications of PU Learning     165
Theoretical Foundation     168
Building Classifiers: Two-Step Approach     169
Building Classifiers: Direct Approach     175
Discussion     178
Derivation of EM for Naive Bayesian Classification     179
Bibliographic Notes     181
Web Mining
Information Retrieval and Web Search     183
Basic Concepts of Information Retrieval     184
Information Retrieval Models     187
Boolean Model      188
Vector Space Model     188
Statistical Language Model     191
Relevance Feedback     192
Evaluation Measures     195
Text and Web Page Pre-Processing     199
Stopword Removal     199
Stemming     200
Other Pre-Processing Tasks for Text     200
Web Page Pre-Processing     201
Duplicate Detection     203
Inverted Index and Its Compression     204
Inverted Index     204
Search Using an Inverted Index     206
Index Construction     207
Index Compression     209
Latent Semantic Indexing     215
Singular Value Decomposition     215
Query and Retrieval     218
An Example     219
Discussion     221
Web Search     222
Meta-Search: Combining Multiple Rankings     225
Combination Using Similarity Scores     226
Combination Using Rank Positions     227
Web Spamming     229
Content Spamming     230
Link Spamming     231
Hiding Techniques     233
Combating Spam     234
Bibliographic Notes     235
Link Analysis     237
Social Network Analysis     238
Centrality     238
Prestige     241
Co-Citation and Bibliographic Coupling     243
Co-Citation     244
Bibliographic Coupling     245
PageRank     245
PageRank Algorithm     246
Strengths and Weaknesses of PageRank     253
Timed PageRank     254
Hits     255
Hits Algorithm     256
Finding Other Eigenvectors     259
Relationships with Co-Citation and Bibliographic Coupling     259
Strengths and Weaknesses of Hits     260
Community Discovery     261
Problem Definition     262
Bipartite Core Communities     264
Maximum Flow Communities     265
Email Communities Based on Betweenness     268
Overlapping Communities of Named Entities     270
Bibliographic Notes     271
Web Crawling     273
A Basic Crawler Algorithm     274
Breadth-First Crawlers     275
Preferential Crawlers     276
Implementation Issues     277
Fetching      277
Parsing     278
Stopword Removal and Stemming     280
Link Extraction and Canonicalization     280
Spider Traps     282
Page Repository     283
Concurrency     284
Universal Crawlers     285
Scalability     286
Coverage vs Freshness vs Importance     288
Focused Crawlers     289
Topical Crawlers     292
Topical Locality and Cues     294
Best-First Variations     300
Adaptation     303
Evaluation     310
Crawler Ethics and Conflicts     315
Some New Developments     318
Bibliographic Notes     320
Structured Data Extraction: Wrapper Generation     323
Preliminaries     324
Two Types of Data Rich Pages     324
Data Model     326
HTML Mark-Up Encoding of Data Instances     328
Wrapper Induction     330
Extraction from a Page     330
Learning Extraction Rules     333
Identifying Informative Examples     337
Wrapper Maintenance     338
Instance-Based Wrapper Learning     338
Automatic Wrapper Generation: Problems     341
Two Extraction Problems     342
Patterns as Regular Expressions     343
String Matching and Tree Matching     344
String Edit Distance     344
Tree Matching     346
Multiple Alignment     350
Center Star Method     350
Partial Tree Alignment     351
Building DOM Trees     356
Extraction Based on a Single List Page: Flat Data Records     357
Two Observations about Data Records     358
Mining Data Regions     359
Identifying Data Records in Data Regions     364
Data Item Alignment and Extraction     365
Making Use of Visual Information     366
Some Other Techniques     366
Extraction Based on a Single List Page: Nested Data Records     367
Extraction Based on Multiple Pages     373
Using Techniques in Previous Sections     373
RoadRunner Algorithm     374
Some Other Issues     375
Extraction from Other Pages     375
Disjunction or Optional     376
A Set Type or a Tuple Type     377
Labeling and Integration     378
Domain Specific Extraction      378
Discussion     379
Bibliographic Notes     379
Information Integration     381
Introduction to Schema Matching     382
Pre-Processing for Schema Matching     384
Schema-Level Match     385
Linguistic Approaches     385
Constraint Based Approaches     386
Domain and Instance-Level Matching     387
Combining Similarities     390
1:m Match     391
Some Other Issues     392
Reuse of Previous Match Results     392
Matching a Large Number of Schemas     393
Schema Match Results     393
User Interactions     394
Integration of Web Query Interfaces     394
A Clustering Based Approach     397
A Correlation Based Approach     400
An Instance Based Approach     403
Constructing a Unified Global Query Interface     406
Structural Appropriateness and the Merge Algorithm     406
Lexical Appropriateness     408
Instance Appropriateness     409
Bibliographic Notes     410
Opinion Mining     411
Sentiment Classification     412
Classification Based on Sentiment Phrases     413
Classification Using Text Classification Methods     415
Classification Using a Score Function     416
Feature-Based Opinion Mining and Summarization     417
Problem Definition     418
Object Feature Extraction     424
Feature Extraction from Pros and Cons of Format 1     425
Feature Extraction from Reviews of of Formats 2 and 3     429
Opinion Orientation Classification     430
Comparative Sentence and Relation Mining     432
Problem Definition     433
Identification of Gradable Comparative Sentences     435
Extraction of Comparative Relations     437
Opinion Search     439
Opinion Spam     441
Objectives and Actions of Opinion Spamming     441
Types of Spam and Spammers     442
Hiding Techniques     443
Spam Detection     444
Bibliographic Notes     446
Web Usage Mining     449
Data Collection and Pre-Processing     450
Sources and Types of Data     452
Key Elements of Web Usage Data Pre-Processing     455
Data Modeling for Web Usage Mining     462
Discovery and Analysis of Web Usage Patterns     466
Session and Visitor Analysis     466
Cluster Analysis and Visitor Segmentation     467
Association and Correlation Analysis     471
Analysis of Sequential and Navigational Patterns     475
Classification and Prediction Based on Web User Transactions     479
Discussion and Outlook     482
Bibliographic Notes     482
References     485
Index     517

Read More

Customer Reviews

Average Review:

Write a Review

and post it to your social network

     

Most Helpful Customer Reviews

See all customer reviews >