Prev: L20, Next: L22

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

Note on Table Scraping
pandas.read_html(WebDriver.page_source) is deprecated: Doc.
➩ To scrape a table, either use a loop on find_element("tag name", "table") to get the tr td elements, or use pandas.read_html(url) where url is the link to the page (may need to WebDriver.quit() first).
➩ Also possible to create a "fake" file using io.BytesIO: will cover this in the first visualization lecture.

 Flask
➩ Flask will be used to create or modify web pages. It can be useful for collecting visitor data when interacting with the web pages and displaying them on the web pages: Link
➩ Flask is a simpler web framework of Django. Django is a more popular package.

📗 Flask Basics
app = flask.Flask(...) to create a web app: Link
@app.route("/") binds a function to the root URL (front page of the website).
@app.route("/abc") binds a function to a specific URL path on the site (one page on the website, or a file).
app.run(host="0.0.0.0", debug=False, threaded=False) to run the app. host="0.0.0.0" makes the server externally visible.
➩ For deployment, see Link.

 Binding a Function to an URL Path
@app.route("/index") def index() return "Hello World!" binds the index function to the page IP address/index, meaning it will display a web page that says "Hello World".
➩ "Hello World" can be replaced by any text or HTML string, which can be read from an HTML file and modified in the index() function.
➩ HTML string can be read from existing HTML files then modified, for example, with open("index.html") as f: return f.read().
➩ It can also be generated by packages such as pandas, for example, pandas.read_csv("data.csv").to_html().

📗 Binding Multiple Paths
➩ To bind multiple paths, variable rules can be added, @app.route("/index/<x>") def index(x) return f"Hello {x}" will display a web page that says "Hello World" when the path IP address/index/World is used.
➩ The variable x can also be converted to another type for the index(x) function.

Route Type Description
@app.route("/index/<x>") string Default
@app.route("/index/<int:x>") int Convert to Integer
@app.route("/index/<float:x>") float Convert to Float
@app.route("/index/<path:x>") path String but allows /


📗 Redirecting Pages
return flask.redirect(url) redirects the current path to another with URL url.
return flask.redirect(flask.url_for("index")) redirects the current path to another which binds to the function index().
return flask.redirect(flask.url_for("index", x = "World")) redirects the current path to another which binds to the function index("World").

Simple Web Example
➩ Build a simple website to keep track of visitors and display visitor names in a table.
➩ Code to create the website: Notebook.
➩ Code to scrape that website: Notebook.

TopHat Question
➩ Visit the Flask website and enter the visitor name or number on TopHat.

 Collecting Visitor Information
flask.request.remote_addr can be used to find the remote IP address of the visitor: Doc.
➩ IP addresses can be then used to find visitor information such as their service provider, location (country, state, city), Link.
flask.request.user_agent.string can be used to find the user agent of the visitor.
➩ User Agent information can be used to find browser, operating system, device information: Link
➩ The visitors' IP addresses then can be stored in global variable or saved to a file on the server to keep track of who visited each page and when. 

📗 Rate Limiting
➩ One use of such visitor information is for rate limiting: preventing visitors from loading the pages too often, for example, to prevent web scraping.
➩ In this case, the visitor's IP address and visit time can be stored in a list: in case the next visit time is too close to the previous one, the visitor can be redirected to another page, or more commonly, responded with a error message, for example, return flask.Response("...", status = 429, headers = {"Retry-After": "60"} tells the visitor to retry after 60 seconds.
➩ A list of response status and header fields can be found: Link Link, here status = 429 says "Too Many Requests".

Rate Limiting Example
➩ Build a website and rate limit to 3 seconds.
➩ Code to create the website: Notebook.
➩ Code to scrape that website: Notebook.


 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: November 30, 2024 at 4:34 AM