# Lecture Notes
📗 Part II Outline
➩ Data collection:
(1) Scraping data from websites.
(2) Create websites to collect visitor information.
➩ Data visualization:
(1) Low dimensional data sets.
(2) High dimensional data sets (more in Part III).
(3) Graph data sets.
(4) Map data sets.
➩ Data pre-processing for machine learning
(1) Images and image features.
(2) Text representation and search (regex).
➩ Data Analysis and Machine Learning (Part III)
TopHat Question
➩ Select the choice you think is the least popular (the one you think the least number of people will select).
➩ A
➩ B
➩ C
➩ D
Document Object Model
➩ Every web page is a tree.
➩ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.
➩ JavaScript can directly edit the DOM tree.
➩ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.
📗 Web Scraping
➩ request module (CS220).
➩ request can fetch .html, .js, etc file.
➩ selenium module (CS320).
➩ Selenium can fetch .html, .js, etc file, run a .js file in browser, and grab HTML version of DOM after JavaScript has modified it.
📗 Install Selenium
➩ Firefox GeckoDriver:
Link.
Finding Elements on Webpages
➩
WebDriver.get(u) loads the web page with URL (uniform resource locator)
u:
Doc
➩
WebDriver.find_element(a, b) returns the first WebElement which has attribute
a being
b, and
WebDriver.find_elements(a, b) returns a list of WebElements:
Doc
📗 Common Attributes
➩ WebDriver.find_element("id", id) locates the element by its unique ID.
➩ WebDriver.find_element("name", id) locates the element by its names, but multiple elements can have the same name.
➩ WebDriver.find_element("tag name", id) locates the element by its tag, some of the common tags include:
Scraping Tables
➩ Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").
➩ Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.
Table Example
TopHat Discussion
📗 Why are the numbers in the table different from what you see? How to fix it?
📗 Screenshots
➩ WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.
➩ Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").
➩ Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").
Screenshot Example
➩ To take screenshot with ChromeDriver:
Notebook.
➩ To take full-screen screenshot with GeckoDriver:
Notebook.
Polling
➩ Sometimes, loading the page and running the JavaScript takes time.
➩ Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.
➩ Avoid infinite loops by setting a maximum amount of time to wait.
➩
selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed:
Doc
📗 Interact with Elements
➩ Selenium can interact with elements and update the DOM tree.
➩
WebElement.send_keys(t) enters text
t into the element (
input and
textarea), and
WebElement.clear() clears the text in the element, and
WebElement.click() clicks the element (
buttons):
Doc.
Multiple Choice Example
➩ Code to loop through all choices:
Notebook.
Password Example
➩ Code to look through all 4-digit passwords (running the full version takes several minutes):
Notebook.
📗 Robots Exclusion Protocol
➩ Some websites disallow web crawling. The rules are specified in a
robots.txt:
Link.
➩ Google:
Txt, YouTube:
Txt, Facebook:
Txt, Instagram:
Txt, Twitter:
Txt, UW Madison:
Txt.
➩
urllib.robotparser can be used to check whether a website allows scraping:
Doc.
➩ RobotFileParser.can_fetch(useragent, url) returns True if the useragent (for example, "*") is allowed to fetch url.
Additional Examples
➩ Suppose element is an HTML table WebElement with \(n\) rows and \(m\) columns, what is the code to find:
(1) first cell of first row? element.find_element("tag name", "tr").find_element("tag name", "td").text
(2) \(j\)th cell of first row? element.find_element("tag name", "tr").find_elements("tag name", "td")[j - 1].text
(3) first cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_element("tag name", "td").text
(4) \(j\)th cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_elements("tag name", "td")[j - 1].text
test 1,2,3 q
➩ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.
1
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link