Young Wu's Homepage

Prev: L18, Next: L20

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.

📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.

📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.

# Lecture Notes

📗 Part II Outline

➩ Data collection:

(1) Scraping data from websites.
(2) Create websites to collect visitor information.

➩ Data visualization:

(1) Low dimensional data sets.
(2) High dimensional data sets (more in Part III).
(3) Graph data sets.
(4) Map data sets.

➩ Data pre-processing for machine learning

(1) Images and image features.
(2) Text representation and search (regex).

➩ Data Analysis and Machine Learning (Part III)

TopHat Question

➩ Select the choice you think is the least popular (the one you think the least number of people will select).

➩ A

➩ B

➩ C

➩ D

Document Object Model

➩ Every web page is a tree.

➩ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.

➩ JavaScript can directly edit the DOM tree.

➩ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.

📗 Web Scraping

➩ request module (CS220).

➩ request can fetch .html, .js, etc file.

➩ selenium module (CS320).

➩ Selenium can fetch .html, .js, etc file, run a .js file in browser, and grab HTML version of DOM after JavaScript has modified it.

📗 Install Selenium

➩ Selenium: Link.

➩ ChromeDriver: Link.

➩ Firefox GeckoDriver: Link.

Finding Elements on Webpages

➩ WebDriver.get(u) loads the web page with URL (uniform resource locator) u: Doc

➩ WebDriver.find_element(a, b) returns the first WebElement which has attribute a being b, and WebDriver.find_elements(a, b) returns a list of WebElements: Doc

📗 Common Attributes

➩ WebDriver.find_element("id", id) locates the element by its unique ID.

➩ WebDriver.find_element("name", id) locates the element by its names, but multiple elements can have the same name.

➩ WebDriver.find_element("tag name", id) locates the element by its tag, some of the common tags include:

Tag	Element	Example
`a`	hyperlink	Link
`button`	button
`code`	code	`code`
`div` or `span`	section	span
`h1` to `h6`	headings
`img` and `video`	image and video
`input` and `textarea`	text fields
`ol` or `ul`	ordered or unordered list	`li` is an item
`p`	paragraph
`select`	drop-down list
`table`	table	`tr` is a row, and `td` is a cell in the table

Scraping Tables

➩ Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").

➩ Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.

Table Example

➩ Code to scrape a table (without pandas): Notebook, Notebook.

TopHat Discussion

📗 Why are the numbers in the table different from what you see? How to fix it?

📗 Screenshots

➩ WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.

➩ Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").

➩ Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").

Screenshot Example

➩ To take screenshot with ChromeDriver: Notebook.

➩ To take full-screen screenshot with GeckoDriver: Notebook.

Polling

➩ Sometimes, loading the page and running the JavaScript takes time.

➩ Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.

➩ Avoid infinite loops by setting a maximum amount of time to wait.

➩ selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed: Doc

📗 Interact with Elements

➩ Selenium can interact with elements and update the DOM tree.

➩ WebElement.send_keys(t) enters text t into the element (input and textarea), and WebElement.clear() clears the text in the element, and WebElement.click() clicks the element (buttons): Doc.

Multiple Choice Example

➩ Code to loop through all choices: Notebook.

Password Example

➩ Code to look through all 4-digit passwords (running the full version takes several minutes): Notebook.

📗 Robots Exclusion Protocol

➩ Some websites disallow web crawling. The rules are specified in a robots.txt: Link.

➩ Google: Txt, YouTube: Txt, Facebook: Txt, Instagram: Txt, Twitter: Txt, UW Madison: Txt.

➩ urllib.robotparser can be used to check whether a website allows scraping: Doc.

➩ RobotFileParser.can_fetch(useragent, url) returns True if the useragent (for example, "*") is allowed to fetch url.

Additional Examples

➩ Suppose element is an HTML table WebElement with \(n\) rows and \(m\) columns, what is the code to find:

(1) first cell of first row? element.find_element("tag name", "tr").find_element("tag name", "td").text
(2) \(j\)th cell of first row? element.find_element("tag name", "tr").find_elements("tag name", "td")[j - 1].text
(3) first cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_element("tag name", "td").text
(4) \(j\)th cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_elements("tag name", "td")[j - 1].text

➩ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.

3	2	0
2	2	0
0	0	0

➩ What is the passcode?

➩ Answer: Incorrect.

Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link

Last Updated: April 23, 2025 at 2:49 AM