Prev: L26, Next: L28

# Lecture

📗 The lecture is in person, but you can join Zoom: 8:50-9:40 or 11:00-11:50. Zoom recordings can be viewed on Canvas -> Zoom -> Cloud Recordings. They will be moved to Kaltura over the weekends.
📗 The in-class (participation) quizzes should be submitted on TopHat (Code:741565), but you can submit your answers through Form at the end of the lectures too.
📗 The Python notebooks used during the lectures can also be found on: GitHub. They will be updated weekly.


# Lecture Notes

TopHat Game Again
➩ There will be 20 questions on the exam, 10 of them from past exams and quizzes, and 10 of them new questions (see Link for details). I will post \(n\) more questions next Monday that are identical or similar to \(n\) of the new questions on exam.
➩ A: \(n = 0\)
➩ B: \(n = 1\) if more than 50 percent of you choose B.
➩ C: \(n = 2\) if more than 75 percent of you choose C.
➩ D: \(n = 3\) if more than 95 percent of you choose D.
➩ E: \(n = 0\)

📗 Supervised Machine Learning
➩ Supervised learning (data is labeled): use the data to figure out the relationship between the features and labels of the items, and apply the relationship to predict the label of a new item.
➩ If the labels are discrete (categories): classification.
➩ If the labels are continuous: regression.

 Classification and Regression Models
Item Input (Features) Output (Labels) -
1 \(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) \(y_{1}\) training data
2 \(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) \(y_{2}\) -
3 \(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) \(y_{3}\) -
... ... ... ...
n \(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) \(y_{n}\) used to figure out \(y \approx \hat{f}\left(x\right)\)
new \(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) \(y'\) guess \(y' = \hat{f}\left(x'\right)\)


📗 Examples of Supervised Learning
Input Output (Labels) -
Images cat or dog, turtle of rifle object classification
Images characters handwriting recognition
Sound recordings words, (recognize speech vs wreck a nice beach) voice recognition
Medical records diagnosis medical diagnosis
Email texts spam or ham, offensive or not spam detection
Review texts positive or negative sentiment analysis
Essays A, AB, B, ..., F -


 Unsupervised Learning
➩ Unsupervised learning (data is unlabeled): use the data to find patterns and put items into groups.
➩ If the groups are discrete: clustering
➩ If the groups are continuous (lower dimensional representation): dimensionality reduction

Item Input (Features) -
1 \(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) no label
2 \(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) -
3 \(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) -
... ... ... ...
n \(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) similar \(x\) in the same or similar groups


📗 Examples of Unsupervised Learning
Input Groups -
News articles similar articles Google News
Photos similar photos Google Photo
Images objects in the image image segmentation
Words similar words have similar representation word embeddings


 Feature Representation
➩ Items need to be represented by vector of numbers, item \(i\) is \(\left(x_{i1}, x_{i2}, ..., x_{im}\right)\).
➩ Some features are already represented by real numbers, sometimes need to be rescaled to \(\left[0, 1\right]\).
➩ Some features are categories, could be converted using one-hot encoding.
➩ Images features and text features require additional preprocessing.

📗 Tokenization
➩ A text document needs to be split into words.
➩ Split the string by space and punctuation, sometimes regular expression rules can be used.
➩ Remove stop words: "the", "of", "a", "with", ...
➩ Stemming and lemmatization: "looks", "looked", "looking" to "look".
nltk has functions to do these: Link, for example, nltk.corpus.stopwords.words("english") to get a list of stop words in English, and nltk.stem.WordNetLemmatizer().lemmatize(word) to lemmatize the token word.

Tokenization Example
➩ Read the course evaluations from a PDF file (use Link or Link), and find the most frequently used words in the course evaluations.
➩ Code to create the strings: Notebook.

📗 Vocabulary
➩ A word token is an occurrence of a word.
➩ A word type is a unique word as a dictionary entry.
➩ A vocabulary is a list of word types, typically 10000 or more word types. Sometimes <s> (start of sentence), </s> (end of sentence), and <unk> (out of vocabulary words) are included in the vocabulary.
➩ A corpus is a larger collection of text (like a DataFrame).
➩ A document is a unit of text item (like a row in a DataFrame).

 Bag of Words Features
➩ Bag of words features represent documents as an unordered collection of words: Link.
➩ Each document is represented by a row containing the number of occurrences of each word type in the vocabulary.
➩ For word type \(j\) and document \(i\), the feature is \(c\left(i, j\right)\), the number of times word \(j\) appears in document \(i\).

📗 TF-IDF Features
➩ TF-IDF or Term Frequency Inverse Document Frequency features adjust for the fact that some words appear more frequently in all documents: Link.
➩ The term frequency of word type \(j\) in document \(i\) is defined as \(\text{tf} \left(i, j\right) = \dfrac{c\left(i, j\right)}{\displaystyle\sum_{j'} c\left(i, j'\right)}\) where \(c\left(i, j\right)\) is the number of times word \(j\) appears in document \(j\), and \(\displaystyle\sum_{j'} c\left(i, j'\right)\) is the total number of words in document \(i\).
➩ The inverse document frequency of word type \(j\) in document \(i\) is defined as \(\text{idf} \left(i, j\right) = \log \left(\dfrac{n + 1}{n\left(j\right) + 1}\right)\) where \(n\left(j\right)\) is the number of documents containing word \(j\), and \(n\) is the total number of documents.
➩ For word type \(j\) and document \(i\), the feature is \(\text{tf} \left(i, j\right) \cdot \text{idf} \left(i, j\right)\).

📗 Scikit Learn
sklearn is the Python package for machine learning: Link.
sklearn.feature_extraction.text.CountVectorizer can be used to convert text documents to a bag of words matrix: Doc.
sklearn.feature_extraction.text.TfidfVectorizer can be used to convert text documents to a bag of words matrix: Doc.

Groot Example
➩ Vin Diesel's Guardians of the Galaxy scripts are summarized in the following table. Compute the bag of words features and TF-IDF features.

Document Phrase Number of times
Guardians of the Galaxy "I am Groot" 13
- "We are Groot" 1
Guardians of the Galaxy Vol. 2 "I am Groot" 17
Guardians of the Galaxy Vol. 3 "I am Groot" 13
- "I love you guys" 1


➩ The bag of words features are:
Items \ Features "I" "am" "Groot" "we" "are" "love" "you" "guys"
1 13 13 14 1 1 0 0 0
2 17 17 17 0 0 0 0 0
3 14 13 13 0 0 1 1 1


➩ The TF-IDF features are:
Items \ Features "I" "am" "Groot" "we" "are" "love" "you" "guys"
1 0 0 0 log(2) / 42 log(2) / 42 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 log(2) / 43 log(2) / 43 log(2) / 43


➩ For example, compute TF-IDF of "Groot" in document 1, \(\dfrac{13}{42} \log\left(\dfrac{4}{4}\right) = \dfrac{13}{42} \log\left(1\right) = 0\), and to compute TF-IDF of "love" in document 3, \(\dfrac{1}{43} \log\left(\dfrac{4}{2}\right) = \dfrac{\log\left(2\right)}{43}\).
➩ Since "Groot" appears in all documents, its appearance in document 1 is not important, thus the feature has value 0, whereas "love" only appears in document 3, therefore the multiplier \(\log\left(2\right)\) is added.


 Notes and code adapted from the course taught by Yiyin Shen Link and Tyler Caraza-Harter Link






Last Updated: November 30, 2024 at 4:34 AM