Young Wu's Homepage

Prev: W5, Next: W7

Zoom: Link, TopHat: Link (936525), GoogleForm: Link, Piazza: Link, Feedback: Link, GitHub: Link, Sec1&2: Link

Slide:

# Slides and Notes

📗 From sections 1 and 2:

➩ Web slides: Link.

➩ Web notes: Link

➩ Scraping: Link

# Document Object Model

➩ Every web page is a tree.

➩ Elements (nodes of the DOM (document object model) tree) may contain attributes, text, and other elements.

➩ JavaScript can directly edit the DOM tree.

➩ Browser renders (displays) DOM tree, based on original HTML (hyper-text markup language) file and any JavaScript changes.

# Install Selenium

📗 Selenium: Link.

📗 ChromeDriver: Link.

📗 Firefox GeckoDriver: Link.

# Finding Elements on Webpages

📗 WebDriver.get(u) loads the web page with URL (uniform resource locator) u: Doc

📗 WebDriver.find_element(a, b) returns the first WebElement which has attribute a being b, and WebDriver.find_elements(a, b) returns a list of WebElements: Doc

# Common Attributes

📗 WebDriver.find_element("id", id) locates the element by its unique ID.

📗 WebDriver.find_element("name", id) locates the element by its names, but multiple elements can have the same name.

📗 WebDriver.find_element("tag name", id) locates the element by its tag, some of the common tags include:

Tag	Element	Example
`a`	hyperlink	Link
`button`	button
`code`	code	`code`
`div` or `span`	section	span
`h1` to `h6`	headings
`img` and `video`	image and video
`input` and `textarea`	text fields
`ol` or `ul`	ordered or unordered list	`li` is an item
`p`	paragraph
`select`	drop-down list
`table`	table	`tr` is a row, and `td` is a cell in the table

# Scraping Tables

📗 Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").

📗 Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.

➩ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.

3	2	0
2	2	0
0	0	0

# Screenshots

📗 WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.

📗 Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").

📗 Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").

# Polling

📗 Sometimes, loading the page and running the JavaScript takes time.

📗 Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.

📗 Avoid infinite loops by setting a maximum amount of time to wait.

📗 selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed: Doc

# Interact with Elements

📗 Selenium can interact with elements and update the DOM tree.

📗 WebElement.send_keys(t) enters text t into the element (input and textarea), and WebElement.clear() clears the text in the element, and WebElement.click() clicks the element (buttons): Doc.

➩ What is the passcode?

➩ Answer: Incorrect.

# Robots Exclusion Protocol

📗 Some websites disallow web crawling. The rules are specified in a robots.txt: Link.

📗 Google: Txt, YouTube: Txt, Facebook: Txt, Instagram: Txt, Twitter: Txt, UW Madison: Txt.

📗 urllib.robotparser can be used to check whether a website allows scraping: Doc.

📗 RobotFileParser.can_fetch(useragent, url) returns True if the useragent (for example, "*") is allowed to fetch url.

# Access Links

📗 WebDriver.find_elements("tag_name", "a") finds all the hyperlinks on the page.

📗 Use url = WebElement.get_attribute("href") to get the URL of the hyperlink, then use WebDriver.get(url) to load that page.

➩ Follow this link: Link find the page with an image (two since there is also a UW Madison logo). What search strategy will you use?

# Infinite Graph Search

📗 If the pages are nodes, and links on one page to another page are edges, the digraph formed by pages will possibly have infinite depth and may contain cycles.

📗 To find a specific (goal) page, or to discover reachable pages from the initial page, breadth first search should be used (since depth first search may not terminate on trees with infinite depth).

📗 Since there are cycles, a list of visited pages should be kept.

➩ Start from the "Data science" Wikipedia page: Link, following the links on the page and try to get to the "Cat" Wikipedia page: Link.

➩ The daily Wikipedia game: Link, and solution? Link.

# Search Heuristics

📗 A search heuristic is an estimate of how close the current node is to the goal node in the search tree.

📗 Before the start of a search, the heuristic functions may not be accurate estimates of the distances from the current node to the goal node.

📗 A heuristic that always underestimates the true distance is called an admissible heuristic.

# Informed Search

📗 Searching the nodes in the order according to the heuristic is called best first greedy search (GBS, since BFGS is reserved for the non-linear optimization method, Link).

📗 Since the heuristic could be incorrect, it might not find the shortest path to the goal node.

📗 Searching the nodes in the order according to the current distance from the initial node plus the heuristic is called A search (the name of the search algorithm is "A")

📗 If the heuristic is admissible, A search is called A* search (A-star search).

# Priority Queue

📗 For GBS search, use a Priority Queue with the priority based on the heuristic: Doc.

📗 For A search, use a Priority Queue with the priority based on current distance plus the heuristic.

➩ Follow this link: Link find the page with an image (two since there is also a UW Madison logo). What is the image?

# Questions?

📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.

📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.

📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link

Prev: W5, Next: W7

Last Updated: June 27, 2026 at 9:06 PM