# Lecture Notes
TopHat Game Again
➭ There will be 20 questions on the exam, 10 of them from past exams and quizzes, and 10 of them new questions (see Link for details). I will post \(n\) more questions next Monday that are identical or similar to \(n\) of the new questions on exam.
➭ A: \(n = 0\)
➭ B: \(n = 1\) if more than 50 percent of you choose B.
➭ C: \(n = 2\) if more than 75 percent of you choose C.
➭ D: \(n = 3\) if more than 95 percent of you choose D.
➭ E: \(n = 0\)
📗 Supervised Machine Learning
➭ Supervised learning (data is labeled): use the data to figure out the relationship between the features and labels of the items, and apply the relationship to predict the label of a new item.
➭ If the labels are discrete (categories): classification.
➭ If the labels are continuous: regression.
📗 Classification and Regression Models
Item |
Input (Features) |
Output (Labels) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
\(y_{1}\) |
training data |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
\(y_{2}\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
\(y_{3}\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
\(y_{n}\) |
used to figure out \(y \approx \hat{f}\left(x\right)\) |
new |
\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) |
\(y'\) |
guess \(y' = \hat{f}\left(x'\right)\) |
📗 Examples of Supervised Learning
Input |
Output (Labels) |
- |
Images |
cat or dog, turtle of rifle |
object classification |
Images |
characters |
handwriting recognition |
Sound recordings |
words, (recognize speech vs wreck a nice beach) |
voice recognition |
Medical records |
diagnosis |
medical diagnosis |
Email texts |
spam or ham, offensive or not |
spam detection |
Review texts |
positive or negative |
sentiment analysis |
Essays |
A, AB, B, ..., F |
- |
📗 Unsupervised Learning
➭ Unsupervised learning (data is unlabeled): use the data to find patterns and put items into groups.
➭ If the groups are discrete: clustering
➭ If the groups are continuous (lower dimensional representation): dimensionality reduction
Item |
Input (Features) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
no label |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
similar \(x\) in the same or similar groups |
📗 Examples of Unsupervised Learning
Input |
Groups |
- |
News articles |
similar articles |
Google News |
Photos |
similar photos |
Google Photo |
Images |
objects in the image |
image segmentation |
Words |
similar words have similar representation |
word embeddings |
📗 Feature Representation
➭ Items need to be represented by vector of numbers, item \(i\) is \(\left(x_{i1}, x_{i2}, ..., x_{im}\right)\).
➭ Some features are already represented by real numbers, sometimes need to be rescaled to \(\left[0, 1\right]\).
➭ Some features are categories, could be converted using one-hot encoding.
➭ Images features and text features require additional preprocessing.
📗 Tokenization
➭ A text document needs to be split into words.
➭ Split the string by space and punctuation, sometimes regular expression rules can be used.
➭ Remove stop words: "the", "of", "a", "with", ...
➭ Stemming and lemmatization: "looks", "looked", "looking" to "look".
➭
nltk
has functions to do these:
Link, for example,
nltk.corpus.stopwords.words("english")
to get a list of stop words in English, and
nltk.stem.WordNetLemmatizer().lemmatize(word)
to lemmatize the token
word
.
Tokenization Example
➭ Read the course evaluations from a PDF file (use Link or Link), and find the most frequently used words in the course evaluations.
➭ Code to create the strings: Notebook.
📗 Vocabulary
➭ A word token is an occurrence of a word.
➭ A word type is a unique word as a dictionary entry.
➭ A vocabulary is a list of word types, typically 10000 or more word types. Sometimes
<s>
(start of sentence),
</s>
(end of sentence), and
<unk>
(out of vocabulary words) are included in the vocabulary.
➭ A corpus is a larger collection of text (like a
DataFrame
).
➭ A document is a unit of text item (like a row in a
DataFrame
).
📗 Bag of Words Features
➭ Bag of words features represent documents as an unordered collection of words:
Link.
➭ Each document is represented by a row containing the number of occurrences of each word type in the vocabulary.
➭ For word type \(j\) and document \(i\), the feature is \(c\left(i, j\right)\), the number of times word \(j\) appears in document \(i\).
📗 TF-IDF Features
➭ TF-IDF or Term Frequency Inverse Document Frequency features adjust for the fact that some words appear more frequently in all documents:
Link.
➭ The term frequency of word type \(j\) in document \(i\) is defined as \(\text{tf} \left(i, j\right) = \dfrac{c\left(i, j\right)}{\displaystyle\sum_{j'} c\left(i, j'\right)}\) where \(c\left(i, j\right)\) is the number of times word \(j\) appears in document \(j\), and \(\displaystyle\sum_{j'} c\left(i, j'\right)\) is the total number of words in document \(i\).
➭ The inverse document frequency of word type \(j\) in document \(i\) is defined as \(\text{idf} \left(i, j\right) = \log \left(\dfrac{n + 1}{n\left(j\right) + 1}\right)\) where \(n\left(j\right)\) is the number of documents containing word \(j\), and \(n\) is the total number of documents.
➭ For word type \(j\) and document \(i\), the feature is \(\text{tf} \left(i, j\right) \cdot \text{idf} \left(i, j\right)\).
📗 Scikit Learn
➭
sklearn
is the Python package for machine learning:
Link.
➭
sklearn.feature_extraction.text.CountVectorizer
can be used to convert text documents to a bag of words matrix:
Doc.
➭
sklearn.feature_extraction.text.TfidfVectorizer
can be used to convert text documents to a bag of words matrix:
Doc.
Groot Example
➭ Vin Diesel's Guardians of the Galaxy scripts are summarized in the following table. Compute the bag of words features and TF-IDF features.
Document |
Phrase |
Number of times |
Guardians of the Galaxy |
"I am Groot" |
13 |
- |
"We are Groot" |
1 |
Guardians of the Galaxy Vol. 2 |
"I am Groot" |
17 |
Guardians of the Galaxy Vol. 3 |
"I am Groot" |
13 |
- |
"I love you guys" |
1 |
➭ The bag of words features are:
Items \ Features |
"I" |
"am" |
"Groot" |
"we" |
"are" |
"love" |
"you" |
"guys" |
1 |
13 |
13 |
14 |
1 |
1 |
0 |
0 |
0 |
2 |
17 |
17 |
17 |
0 |
0 |
0 |
0 |
0 |
3 |
14 |
13 |
13 |
0 |
0 |
1 |
1 |
1 |
➭ The TF-IDF features are:
Items \ Features |
"I" |
"am" |
"Groot" |
"we" |
"are" |
"love" |
"you" |
"guys" |
1 |
0 |
0 |
0 |
log(2) / 42 |
log(2) / 42 |
0 |
0 |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
log(2) / 43 |
log(2) / 43 |
log(2) / 43 |
➭ For example, compute TF-IDF of "Groot" in document 1, \(\dfrac{13}{42} \log\left(\dfrac{4}{4}\right) = \dfrac{13}{42} \log\left(1\right) = 0\), and to compute TF-IDF of "love" in document 3, \(\dfrac{1}{43} \log\left(\dfrac{4}{2}\right) = \dfrac{\log\left(2\right)}{43}\).
➭ Since "Groot" appears in all documents, its appearance in document 1 is not important, thus the feature has value 0, whereas "love" only appears in document 3, therefore the multiplier \(\log\left(2\right)\) is added.
📗 Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link