# Lecture Notes
📗 Part II Outline
➩ Data collection:
(1) Scraping data from websites.
(2) Create websites to collect visitor information.
➩ Data visualization:
(1) Low dimensional data sets.
(2) High dimensional data sets (more in Part III).
(3) Graph data sets.
(4) Map data sets.
➩ Data pre-processing for machine learning
(1) Images and image features.
(2) Text representation and search (regex).
➩ Data Analysis and Machine Learning (Part III)
TopHat Question
➩ Select the choice you think is the least popular (the one you think the least number of people will select).
➩ A
➩ B
➩ C
➩ D
Document Object Model
➩ Every web page is a tree.
➩ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.
➩ JavaScript can directly edit the DOM tree.
➩ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.
📗 Web Scraping
➩ request
module (CS220).
➩ request
can fetch .html, .js, etc file.
➩ selenium
module (CS320).
➩ Selenium
can fetch .html, .js, etc file, run a .js file in browser, and grab HTML version of DOM after JavaScript has modified it.
📗 Install Selenium
➩ Firefox GeckoDriver:
Link.
Finding Elements on Webpages
➩
WebDriver.get(u)
loads the web page with URL (uniform resource locator)
u
:
Doc
➩
WebDriver.find_element(a, b)
returns the first WebElement which has attribute
a
being
b
, and
WebDriver.find_elements(a, b)
returns a list of WebElements:
Doc
📗 Common Attributes
➩ WebDriver.find_element("id", id)
locates the element by its unique ID.
➩ WebDriver.find_element("name", id)
locates the element by its names, but multiple elements can have the same name.
➩ WebDriver.find_element("tag name", id)
locates the element by its tag, some of the common tags include:
Scraping Tables
➩ Tables can be scraped by find_element("tag name", "table")
and iterate over find_elements("tag name", "tr")
and find_elements("tag name", "td")
.
➩ Alternatively, for static pages (without JavaScript modifications), pandas.read_html
can return a list of DataFrame
s, one for each table, on the web page.
Table Example
TopHat Discussion
📗 Why are the numbers in the table different from what you see? How to fix it?
📗 Screenshots
➩ WebDriver.save_screenshot("file.png")
saves a screenshot of the web page to a file with name file.png
.
➩ Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png")
.
➩ Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png")
.
Screenshot Example
➩ To take screenshot with ChromeDriver:
Notebook.
➩ To take full-screen screenshot with GeckoDriver:
Notebook.
Polling
➩ Sometimes, loading the page and running the JavaScript takes time.
➩ Use time.sleep(s)
to wait s
seconds, and use it inside a while
loop to wait for an event to happen, before accessing or interacting with the updated elements.
➩ Avoid infinite loops by setting a maximum amount of time to wait.
➩
selenium.webdriver.support
has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed:
Doc
📗 Interact with Elements
➩ Selenium can interact with elements and update the DOM tree.
➩
WebElement.send_keys(t)
enters text
t
into the element (
input
and
textarea
), and
WebElement.clear()
clears the text in the element, and
WebElement.click()
clicks the element (
buttons
):
Doc.
Multiple Choice Example
➩ Code to loop through all choices:
Notebook.
Password Example
➩ Code to look through all 4-digit passwords (running the full version takes several minutes):
Notebook.
📗 Robots Exclusion Protocol
➩ Some websites disallow web crawling. The rules are specified in a
robots.txt
:
Link.
➩ Google:
Txt, YouTube:
Txt, Facebook:
Txt, Instagram:
Txt, Twitter:
Txt, UW Madison:
Txt.
➩
urllib.robotparser
can be used to check whether a website allows scraping:
Doc.
➩ RobotFileParser.can_fetch(useragent, url)
returns True
if the useragent
(for example, "*"
) is allowed to fetch url
.
Additional Examples
➩ Suppose element
is an HTML table WebElement with \(n\) rows and \(m\) columns, what is the code to find:
(1) first cell of first row? element.find_element("tag name", "tr").find_element("tag name", "td").text
(2) \(j\)th cell of first row? element.find_element("tag name", "tr").find_elements("tag name", "td")[j - 1].text
(3) first cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_element("tag name", "td").text
(4) \(j\)th cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_elements("tag name", "td")[j - 1].text
test 1,2,3 q
➩ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.
1
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link