📗 WebDriver.get(u) loads the web page with URL (uniform resource locator) u: Doc
📗 WebDriver.find_element(a, b) returns the first WebElement which has attribute a being b, and WebDriver.find_elements(a, b) returns a list of WebElements: Doc
📗 Tables can be scraped by find_element("tag name", "table") and iterate over find_elements("tag name", "tr") and find_elements("tag name", "td").
📗 Alternatively, for static pages (without JavaScript modifications), pandas.read_html can return a list of DataFrames, one for each table, on the web page.
➩ The following table is filled by JavaScript (with a one-second delay), scrape it in and put the numbers in an array.
📗 WebDriver.save_screenshot("file.png") saves a screenshot of the web page to a file with name file.png.
📗 Sometimes the screenshot only captures only a part of the page: Firefox has the option to take screenshot of the full page using WebDriver.save_full_screenshot("file.png").
📗 Alternatively, a screenshot of a specific element can be save using WebElement.screenshot("file.png").
📗 Sometimes, loading the page and running the JavaScript takes time.
📗 Use time.sleep(s) to wait s seconds, and use it inside a while loop to wait for an event to happen, before accessing or interacting with the updated elements.
📗 Avoid infinite loops by setting a maximum amount of time to wait.
📗 selenium.webdriver.support has implicit and explicit waits, so that the driver keeps polling until a certain condition is met or a certain time has passed: Doc
📗 Selenium can interact with elements and update the DOM tree.
📗 WebElement.send_keys(t) enters text t into the element (input and textarea), and WebElement.clear() clears the text in the element, and WebElement.click() clicks the element (buttons): Doc.
📗 If the pages are nodes, and links on one page to another page are edges, the digraph formed by pages will possibly have infinite depth and may contain cycles.
📗 To find a specific (goal) page, or to discover reachable pages from the initial page, breadth first search should be used (since depth first search may not terminate on trees with infinite depth).
📗 Since there are cycles, a list of visited pages should be kept.
➩ Start from the "Data science" Wikipedia page: Link, following the links on the page and try to get to the "Cat" Wikipedia page: Link.
➩ The daily Wikipedia game: Link, and solution? Link.
📗 Searching the nodes in the order according to the heuristic is called best first greedy search (GBS, since BFGS is reserved for the non-linear optimization method, Link).
📗 Since the heuristic could be incorrect, it might not find the shortest path to the goal node.
📗 Searching the nodes in the order according to the current distance from the initial node plus the heuristic is called A search (the name of the search algorithm is "A")
📗 If the heuristic is admissible, A search is called A* search (A-star search).
📗 Notes and code adapted from the course taught by Professors Gurmail Singh, Yiyin Shen, Tyler Caraza-Harter.
📗 If there is an issue with TopHat during the lectures, please submit your answers on paper (include your Wisc ID and answers) or this Google form Link at the end of the lecture.
📗 Anonymous feedback can be submitted to: Form. Non-anonymous feedback and questions can be posted on Piazza: Link