📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.
📗 Part II Outline
➭ Data collection:
(1) Scraping data from websites.
(2) Create websites to collect visitor information.
➭ Data visualization:
(1) Low dimensional data sets.
(2) High dimensional data sets (more in Part III).
(3) Graph data sets.
(4) Map data sets.
➭ Data pre-processing for machine learning
(1) Images and image features.
(2) Text representation and search (regex).
➭ Data Analysis and Machine Learning (Part III)
TopHat Question
➭ Select the choice you think is the least popular (the one you think the least number of people will select).
➭ A
➭ B
➭ C
➭ D
📗 Document Object Model
➭ Every web page is a tree.
➭ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.
➭ JavaScript can directly edit the DOM tree.
➭ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.
📗 Web Scraping
➭ request module (CS220).
➭ request can fetch .html, .js, etc file.
➭ selenium module (CS320).
➭ Selenium can fetch .html, .js, etc file, run a .js file in browser, and grab HTML version of DOM after JavaScript has modified it.
📗 Finding Elements on Webpages
➭ WebDriver.get(u) loads the web page with URL (uniform resource locator) u: Doc
➭ WebDriver.find_element(a, b) returns the first WebElement which has attribute a being b, and WebDriver.find_elements(a, b) returns a list of WebElements: Doc
📗 Common Attributes
➭ WebDriver.find_element("id", id) locates the element by its unique ID.
➭ WebDriver.find_element("name", id) locates the element by its names, but multiple elements can have the same name.
➭ WebDriver.find_element("tag name", id) locates the element by its tag, some of the common tags include:
📗 Scraping Tables
➭ Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").
➭ Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.
Table Example
➭ Code to scrape a table (without pandas): Notebook, Notebook.
TopHat Discussion
📗 Why are the numbers in the table different from what you see? How to fix it?
📗 Screenshots
➭ WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.
➭ Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").
➭ Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").
Screenshot Example
➭ To take screenshot with ChromeDriver: Notebook.
➭ To take full-screen screenshot with GeckoDriver: Notebook.
📗 Polling
➭ Sometimes, loading the page and running the JavaScript takes time.
➭ Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.
➭ Avoid infinite loops by setting a maximum amount of time to wait.
➭ selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed: Doc
📗 Interact with Elements
➭ Selenium can interact with elements and update the DOM tree.
➭ WebElement.send_keys(t) enters text t into the element (input and textarea), and WebElement.clear() clears the text in the element, and WebElement.click() clicks the element (buttons): Doc.
Multiple Choice Example
➭ Code to loop through all choices: Notebook.
Password Example
➭ Code to look through all 4-digit passwords (running the full version takes several minutes): Notebook.
📗 Robots Exclusion Protocol
➭ Some websites disallow web crawling. The rules are specified in a robots.txt: Link.
➭ Google: Txt, YouTube: Txt, Facebook: Txt, Instagram: Txt, Twitter: Txt, UW Madison: Txt.
➭ urllib.robotparser can be used to check whether a website allows scraping: Doc.
➭ RobotFileParser.can_fetch(useragent, url) returns True if the useragent (for example, "*") is allowed to fetch url.
Additional Examples
➭ Suppose element is an HTML table WebElement with \(n\) rows and \(m\) columns, what is the code to find:
(1) first cell of first row? element.find_element("tag name", "tr").find_element("tag name", "td").text
(2) \(j\)th cell of first row? element.find_element("tag name", "tr").find_elements("tag name", "td")[j - 1].text
(3) first cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_element("tag name", "td").text
(4) \(j\)th cell of \(i\)th row? element.find_elements("tag name", "tr")[i - 1].find_elements("tag name", "td")[j - 1].text
test1,2,3q
➭ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.
3
2
0
2
2
0
0
0
0
1
➭ Two of the choices are correct, which ones?
Choices: A B C D E F 1
📗 Incorrect.
➭ What is the passcode?
➭ Answer: Incorrect. 1
📗 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link