# Lecture Notes
Note on Table Scraping
➩
pandas.read_html(WebDriver.page_source)
is deprecated:
Doc.
➩ To scrape a table, either use a loop on find_element("tag name", "table")
to get the tr
td
elements, or use pandas.read_html(url)
where url
is the link to the page (may need to WebDriver.quit()
first).
➩ Also possible to create a "fake" file using io.BytesIO
: will cover this in the first visualization lecture.
Flask
➩ Flask will be used to create or modify web pages. It can be useful for collecting visitor data when interacting with the web pages and displaying them on the web pages:
Link
➩ Flask is a simpler web framework of Django. Django is a more popular package.
📗 Flask Basics
➩
app = flask.Flask(...)
to create a web app:
Link
➩ @app.route("/")
binds a function to the root URL (front page of the website).
➩ @app.route("/abc")
binds a function to a specific URL path on the site (one page on the website, or a file).
➩ app.run(host="0.0.0.0", debug=False, threaded=False)
to run the app. host="0.0.0.0"
makes the server externally visible.
➩ For deployment, see
Link.
Binding a Function to an URL Path
➩ @app.route("/index")
def index()
return "Hello World!"
binds the index
function to the page IP address/index
, meaning it will display a web page that says "Hello World".
➩ "Hello World" can be replaced by any text or HTML string, which can be read from an HTML file and modified in the index()
function.
➩ HTML string can be read from existing HTML files then modified, for example, with open("index.html") as f:
return f.read()
.
➩ It can also be generated by packages such as pandas, for example, pandas.read_csv("data.csv").to_html()
.
📗 Binding Multiple Paths
➩ To bind multiple paths, variable rules can be added, @app.route("/index/<x>")
def index(x)
return f"Hello {x}"
will display a web page that says "Hello World" when the path IP address/index/World
is used.
➩ The variable x
can also be converted to another type for the index(x)
function.
Route |
Type |
Description |
@app.route("/index/<x>") |
string |
Default |
@app.route("/index/<int:x>") |
int |
Convert to Integer |
@app.route("/index/<float:x>") |
float |
Convert to Float |
@app.route("/index/<path:x>") |
path |
String but allows / |
📗 Redirecting Pages
➩ return flask.redirect(url)
redirects the current path to another with URL url
.
➩ return flask.redirect(flask.url_for("index"))
redirects the current path to another which binds to the function index()
.
➩ return flask.redirect(flask.url_for("index", x = "World"))
redirects the current path to another which binds to the function index("World")
.
Simple Web Example
➩ Build a simple website to keep track of visitors and display visitor names in a table.
➩ Code to scrape that website:
Notebook.
TopHat Question
➩ Visit the Flask website and enter the visitor name or number on TopHat.
Collecting Visitor Information
➩
flask.request.remote_addr
can be used to find the remote IP address of the visitor:
Doc.
➩ IP addresses can be then used to find visitor information such as their service provider, location (country, state, city),
Link.
➩ flask.request.user_agent.string
can be used to find the user agent of the visitor.
➩ User Agent information can be used to find browser, operating system, device information:
Link
➩ The visitors' IP addresses then can be stored in global variable or saved to a file on the server to keep track of who visited each page and when.
📗 Rate Limiting
➩ One use of such visitor information is for rate limiting: preventing visitors from loading the pages too often, for example, to prevent web scraping.
➩ In this case, the visitor's IP address and visit time can be stored in a list: in case the next visit time is too close to the previous one, the visitor can be redirected to another page, or more commonly, responded with a error message, for example, return flask.Response("...", status = 429, headers = {"Retry-After": "60"}
tells the visitor to retry after 60 seconds.
➩ A list of response status and header fields can be found:
Link Link, here
status = 429
says "Too Many Requests".
Rate Limiting Example
➩ Build a website and rate limit to 3 seconds.
➩ Code to scrape that website:
Notebook.
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link