Automated Data Collection with R: A Practical Guide to Web Scraping and Text Mining

Want to read this on your NOOK? Request as NOOK Book from the publisher

Thank you for requesting this book as a NOOK book from the publisher.

More About This Book

Overview
Product Details
Table of Contents

Overview

A hands on guide to web scraping and text mining for both beginners and experienced users of R

Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL.
Provides basic techniques to query web documents and data sets (XPath and regular expressions).
An extensive set of exercises are presented to guide the reader through each technique.
Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management.
Case studies are featured throughout along with examples for each technique presented.
R code and solutions to exercises featured in the book are provided on a supporting website.

Product Details

ISBN-13: 9781118834817
Publisher: Wiley
Publication date: 12/31/2014
Edition number: 1
Pages: 472

Dedication

Table of Contents

List of Figures

List of Tables

Preface

1 Introduction

1.1 Case Study: World Heritage Sites in Danger

1.2 Some Remarks on Web Data Quality

1.3 Technologies for Disseminating, Extracting and Storing Web Data

1.3.1 Technologies for disseminating content on the Web

1.4 Structure of the Book

Part One A Primer on Web and Data Technologies

2 HTML

2.1 Browser Presentation and Source Code

2.2 Syntax Rules

2.3 Tags and Attributes

2.4 Parsing

Summary

Further Reading

Problems

3 XML and JSON

3.1 A Short Example XML Document

3.2 XML Syntax Rules

3.3 When Is an XML Document Well-formed or Valid?

3.4 XML Extensions and Technologies

3.5 XML and R in Practice

3.6 A Short Example JSON Document

3.7 JSON Syntax Rules

3.8 JSON and R in Practice

Summary

Further Reading

Problems

4 XPath

4.1 XPath - a Querying Language for Web Documents

4.2 Identifying Node Sets with XPath

4.3 Extracting Node Elements

Summary

Further Reading

Problems

5 HTTP

5.1 HTTP Fundamentals

5.2 Advanced Features of HTTP

5.3 Protocols beyond HTTP

5.4 HTTP in Action

Summary

Further Reading

Problems

6 AJAX

6.1 JavaScript

6.2 XHR

6.3 Exploring AJAX with Web Developer Tools

Summary

Further Reading

Problems

7 SQL and Relational Databases

7.1 Overview and Terminology

7.2 Relational Databases

7.3 SQL: a Language to Communicate with Databases

7.4 Databases in Action

Summary

Further Reading

Problems

8 Regular Expressions and String Functions

8.1 Regular Expressions

8.2 String Processing

8.3 A Word on Character Encodings

Summary

Further Reading

Problems

Part Two A Practical Toolbox for Web Scraping and Text Mining

9 Scraping the Web

9.1 Retrieval Scenarios

9.2 Extraction Strategies

9.3 Web Scraping: Good Practice

9.4 Valuable Sources of Inspiration

Summary

Further Reading

Problems

10 Statistical Text Processing

10.1 The running example: classifying press releases of the British government

10.2 Processing Textual Data

10.3 Supervised Learning Techniques

10.4 Unsupervised Learning Techniques

Summary