Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

Overview

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).
  • An extensive set of exercises ...
See more details below
Sending request ...

Overview

A hands on guide to web scraping and text mining for both beginners and experienced users of R

  • Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
  • Provides basic techniques to query web documents and data sets (XPath and regular expressions).
  • An extensive set of exercises are presented to guide the reader through each technique.
  • Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
  • Case studies are featured throughout along with examples for each technique presented.
  • R code and solutions to exercises featured in the book are provided on a supporting website.
Read More Show Less

Product Details

  • ISBN-13: 9781118834817
  • Publisher: Wiley
  • Publication date: 12/31/2014
  • Edition number: 1
  • Pages: 472

Table of Contents

Dedication

Table of Contents

List of Figures

List of Tables

Preface

1 Introduction

1.1 Case Study: World Heritage Sites in Danger

1.2 Some Remarks on Web Data Quality

1.3 Technologies for Disseminating, Extracting and Storing Web Data

1.3.1 Technologies for disseminating content on the Web

1.4 Structure of the Book

Part One A Primer on Web and Data Technologies

2 HTML

2.1 Browser Presentation and Source Code

2.2 Syntax Rules

2.3 Tags and Attributes

2.4 Parsing

Summary

Further Reading

Problems

3 XML and JSON

3.1 A Short Example XML Document

3.2 XML Syntax Rules

3.3 When Is an XML Document Well-formed or Valid?

3.4 XML Extensions and Technologies

3.5 XML and R in Practice

3.6 A Short Example JSON Document

3.7 JSON Syntax Rules

3.8 JSON and R in Practice

Summary

Further Reading

Problems

4 XPath

4.1 XPath - a Querying Language for Web Documents

4.2 Identifying Node Sets with XPath

4.3 Extracting Node Elements

Summary

Further Reading

Problems

5 HTTP

5.1 HTTP Fundamentals

5.2 Advanced Features of HTTP

5.3 Protocols beyond HTTP

5.4 HTTP in Action

Summary

Further Reading

Problems

6 AJAX

6.1 JavaScript

6.2 XHR

6.3 Exploring AJAX with Web Developer Tools

Summary

Further Reading

Problems

7 SQL and Relational Databases

7.1 Overview and Terminology

7.2 Relational Databases

7.3 SQL: a Language to Communicate with Databases

7.4 Databases in Action

Summary

Further Reading

Problems

8 Regular Expressions and String Functions

8.1 Regular Expressions

8.2 String Processing

8.3 A Word on Character Encodings

Summary

Further Reading

Problems

Part Two A Practical Toolbox for Web Scraping and Text Mining

9 Scraping the Web

9.1 Retrieval Scenarios

9.2 Extraction Strategies

9.3 Web Scraping: Good Practice

9.4 Valuable Sources of Inspiration

Summary

Further Reading

Problems

10 Statistical Text Processing

10.1 The running example: classifying press releases of the British government

10.2 Processing Textual Data

10.3 Supervised Learning Techniques

10.4 Unsupervised Learning Techniques

Summary

Further reading

11 Managing Data Projects

11.1 Interacting with the File System

11.2 Processing Multiple Documents/Links

11.3 Organizing Scraping Procedures

11.4 Executing R Scripts on a Regular Basis

Part Three A Bag of Case Studies

12 Collaboration Networks in the U.S. Senate

12.1 Information on the Bills

12.2 Information on the Senators

12.3 Analyzing the network structure

12.4 Conclusion

13 Parsing Information from Semi-Structured Documents

13.1 Downloding Data from the FTP Server

13.2 Parsing Semi-Structured Text Data

13.3 Visualizing station and temperature data

14 Predicting the 2014 Academy Awards using Twitter

14.1 Twitter APIs: Overview

14.2 Twitter-based Forecast of the 2014 Academy Awards

14.3 Conclusion

15 Mapping the Geographic Distribution of Names

15.1 Developing a Data Collection Strategy

15.2 Web Site Inspection

15.3 Data Retrieval and Information Extraction

15.4 Mapping Names

15.5 Automating the Process

15.6 Summary

16 Gathering Data on Mobile Phones

16.1 Page Exploration

16.2 Scraping Procedure

16.3 Graphical Analysis

16.4 Data storage

17 Analyzing Sentiments of Product Reviews

17.1 Introduction

17.2 Collecting the data

17.3 Analyzing the Data

17.4 Conclusion

References

Bibliography

Indices

General Index

Package Index

Function Index

Read More Show Less

If you find inappropriate content, please report it to Barnes & Noble
Why is this product inappropriate?
Comments (optional)