Prev: L18, Next: L20

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

📗 Part II Outline
➭ Data collection:
(1) Scraping data from websites.
(2) Create websites to collect visitor information.
➭ Data visualization:
(1) Low dimensional data sets.
(2) High dimensional data sets (more in Part III).
(3) Graph data sets.
(4) Map data sets.
➭ Data pre-processing for machine learning
(1) Images and image features.
(2) Text representation and search (regex).
➭ Data Analysis and Machine Learning (Part III)

TopHat Question ➭ Select the choice you think is the least popular (the one you think the least number of people will select).
➭ A
➭ B
➭ C
➭ D



📗 Document Object Model
➭ Every web page is a tree.
➭ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.
➭ JavaScript can directly edit the DOM tree.
➭ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.

📗 Web Scraping
request module (CS220).
request can fetch .html, .js, etc file.
selenium module (CS320).
Selenium can fetch .html, .js, etc file, run a .js file in browser, and grab HTML version of DOM after JavaScript has modified it.

📗 Install Selenium
➭ Selenium: Link.
➭ ChromeDriver: Link.
➭ Firefox GeckoDriver: Link.



📗 Finding Elements on Webpages
WebDriver.get(u) loads the web page with URL (uniform resource locator) u: Doc
WebDriver.find_element(a, b) returns the first WebElement which has attribute a being b, and WebDriver.find_elements(a, b) returns a list of WebElements: Doc

📗 Common Attributes
WebDriver.find_element("id", id) locates the element by its unique ID.
WebDriver.find_element("name", id) locates the element by its names, but multiple elements can have the same name.
WebDriver.find_element("tag name", id) locates the element by its tag, some of the common tags include:

Tag Element Example
a hyperlink Link
button button
code code code
div or span section span
h1 to h6 headings
img and video image and video
input and textarea text fields
ol or ul ordered or unordered list li is an item
p paragraph
select drop-down list
table table tr is a row, and td is a cell in the table




📗 Scraping Tables
➭ Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").
➭ Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.

Table Example
➭ Code to scrape a table (without pandas): Notebook, Notebook.

TopHat Discussion 📗 Why are the numbers in the table different from what you see? How to fix it?

📗 Screenshots
WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.
➭ Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").
➭ Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").

Screenshot Example ➭ To take screenshot with ChromeDriver: Notebook.
➭ To take full-screen screenshot with GeckoDriver: Notebook.



📗 Polling
➭ Sometimes, loading the page and running the JavaScript takes time.
➭ Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.
➭ Avoid infinite loops by setting a maximum amount of time to wait.
selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed: Doc

📗 Interact with Elements
➭ Selenium can interact with elements and update the DOM tree.
WebElement.send_keys(t) enters text t into the element (input and textarea), and WebElement.clear() clears the text in the element, and WebElement.click() clicks the element (buttons): Doc.

Multiple Choice Example
➭ Code to loop through all choices: Notebook.

Password Example
➭ Code to look through all 4-digit passwords (running the full version takes several minutes): Notebook.

📗 Robots Exclusion Protocol
➭ Some websites disallow web crawling. The rules are specified in a robots.txt: Link.
➭ Google: Txt, YouTube: Txt, Facebook: Txt, Instagram: Txt, Twitter: Txt, UW Madison: Txt.
urllib.robotparser can be used to check whether a website allows scraping: Doc.
RobotFileParser.can_fetch(useragent, url) returns True if the useragent (for example, "*") is allowed to fetch url.

Additional Examples ➭ Suppose element is an HTML table WebElement with \(n\) rows and \(m\) columns, what is the code to find:
(1) first cell of first row? element.find_element("tag name", "tr").find_element("tag name", "td").text
(2) \(j\)th cell of first row? element.find_element("tag name", "tr").find_elements("tag name", "td")[j - 1].text
(3) first cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_element("tag name", "td").text
(4) \(j\)th cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_elements("tag name", "td")[j - 1].text


➭ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.
3 2 0
2 2 0
0 0 0


➭ Two of the choices are correct, which ones?
Choices:
A
B
C
D
E
F

📗 Incorrect.
➭ What is the passcode?
➭ Answer: Incorrect.





📗 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: April 29, 2024 at 1:10 AM