# Lecture Notes
TopHat Game Again
➩ There will be 20 questions on the exam, 10 of them from past exams and quizzes, and 10 of them new questions (see
Link for details). I will post \(n\) more questions next Monday that are identical or similar to \(n\) of the new questions on exam.
➩ A: \(n = 0\)
➩ B: \(n = 1\) if more than 50 percent of you choose B.
➩ C: \(n = 2\) if more than 75 percent of you choose C.
➩ D: \(n = 3\) if more than 95 percent of you choose D.
➩ E: \(n = 0\)
📗 Supervised Machine Learning
➩ Supervised learning (data is labeled): use the data to figure out the relationship between the features and labels of the items, and apply the relationship to predict the label of a new item.
➩ If the labels are discrete (categories): classification.
➩ If the labels are continuous: regression.
Classification and Regression Models
Item |
Input (Features) |
Output (Labels) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
\(y_{1}\) |
training data |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
\(y_{2}\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
\(y_{3}\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
\(y_{n}\) |
used to figure out \(y \approx \hat{f}\left(x\right)\) |
new |
\(\left(x'_{1}, x'_{2}, ..., x'_{m}\right)\) |
\(y'\) |
guess \(y' = \hat{f}\left(x'\right)\) |
📗 Examples of Supervised Learning
Input |
Output (Labels) |
- |
Images |
cat or dog, turtle of rifle |
object classification |
Images |
characters |
handwriting recognition |
Sound recordings |
words, (recognize speech vs wreck a nice beach) |
voice recognition |
Medical records |
diagnosis |
medical diagnosis |
Email texts |
spam or ham, offensive or not |
spam detection |
Review texts |
positive or negative |
sentiment analysis |
Essays |
A, AB, B, ..., F |
- |
Unsupervised Learning
➩ Unsupervised learning (data is unlabeled): use the data to find patterns and put items into groups.
➩ If the groups are discrete: clustering
➩ If the groups are continuous (lower dimensional representation): dimensionality reduction
Item |
Input (Features) |
- |
1 |
\(\left(x_{11}, x_{12}, ..., x_{1m}\right)\) |
no label |
2 |
\(\left(x_{21}, x_{22}, ..., x_{2m}\right)\) |
- |
3 |
\(\left(x_{31}, x_{32}, ..., x_{3m}\right)\) |
- |
... |
... |
... |
... |
n |
\(\left(x_{n1}, x_{n2}, ..., x_{nm}\right)\) |
similar \(x\) in the same or similar groups |
📗 Examples of Unsupervised Learning
Input |
Groups |
- |
News articles |
similar articles |
Google News |
Photos |
similar photos |
Google Photo |
Images |
objects in the image |
image segmentation |
Words |
similar words have similar representation |
word embeddings |
Feature Representation
➩ Items need to be represented by vector of numbers, item \(i\) is \(\left(x_{i1}, x_{i2}, ..., x_{im}\right)\).
➩ Some features are already represented by real numbers, sometimes need to be rescaled to \(\left[0, 1\right]\).
➩ Some features are categories, could be converted using one-hot encoding.
➩ Images features and text features require additional preprocessing.
📗 Tokenization
➩ A text document needs to be split into words.
➩ Split the string by space and punctuation, sometimes regular expression rules can be used.
➩ Remove stop words: "the", "of", "a", "with", ...
➩ Stemming and lemmatization: "looks", "looked", "looking" to "look".
➩
nltk
has functions to do these:
Link, for example,
nltk.corpus.stopwords.words("english")
to get a list of stop words in English, and
nltk.stem.WordNetLemmatizer().lemmatize(word)
to lemmatize the token
word
.
Tokenization Example
➩ Read the course evaluations from a PDF file (use
Link or
Link), and find the most frequently used words in the course evaluations.
📗 Vocabulary
➩ A word token is an occurrence of a word.
➩ A word type is a unique word as a dictionary entry.
➩ A vocabulary is a list of word types, typically 10000 or more word types. Sometimes <s>
(start of sentence), </s>
(end of sentence), and <unk>
(out of vocabulary words) are included in the vocabulary.
➩ A corpus is a larger collection of text (like a DataFrame
).
➩ A document is a unit of text item (like a row in a DataFrame
).
Bag of Words Features
➩ Bag of words features represent documents as an unordered collection of words:
Link.
➩ Each document is represented by a row containing the number of occurrences of each word type in the vocabulary.
➩ For word type \(j\) and document \(i\), the feature is \(c\left(i, j\right)\), the number of times word \(j\) appears in document \(i\).
📗 TF-IDF Features
➩ TF-IDF or Term Frequency Inverse Document Frequency features adjust for the fact that some words appear more frequently in all documents:
Link.
➩ The term frequency of word type \(j\) in document \(i\) is defined as \(\text{tf} \left(i, j\right) = \dfrac{c\left(i, j\right)}{\displaystyle\sum_{j'} c\left(i, j'\right)}\) where \(c\left(i, j\right)\) is the number of times word \(j\) appears in document \(j\), and \(\displaystyle\sum_{j'} c\left(i, j'\right)\) is the total number of words in document \(i\).
➩ The inverse document frequency of word type \(j\) in document \(i\) is defined as \(\text{idf} \left(i, j\right) = \log \left(\dfrac{n + 1}{n\left(j\right) + 1}\right)\) where \(n\left(j\right)\) is the number of documents containing word \(j\), and \(n\) is the total number of documents.
➩ For word type \(j\) and document \(i\), the feature is \(\text{tf} \left(i, j\right) \cdot \text{idf} \left(i, j\right)\).
📗 Scikit Learn
➩
sklearn
is the Python package for machine learning:
Link.
➩
sklearn.feature_extraction.text.CountVectorizer
can be used to convert text documents to a bag of words matrix:
Doc.
➩
sklearn.feature_extraction.text.TfidfVectorizer
can be used to convert text documents to a bag of words matrix:
Doc.
Groot Example
➩ Vin Diesel's Guardians of the Galaxy scripts are summarized in the following table. Compute the bag of words features and TF-IDF features.
Document |
Phrase |
Number of times |
Guardians of the Galaxy |
"I am Groot" |
13 |
- |
"We are Groot" |
1 |
Guardians of the Galaxy Vol. 2 |
"I am Groot" |
17 |
Guardians of the Galaxy Vol. 3 |
"I am Groot" |
13 |
- |
"I love you guys" |
1 |
➩ The bag of words features are:
Items \ Features |
"I" |
"am" |
"Groot" |
"we" |
"are" |
"love" |
"you" |
"guys" |
1 |
13 |
13 |
14 |
1 |
1 |
0 |
0 |
0 |
2 |
17 |
17 |
17 |
0 |
0 |
0 |
0 |
0 |
3 |
14 |
13 |
13 |
0 |
0 |
1 |
1 |
1 |
➩ The TF-IDF features are:
Items \ Features |
"I" |
"am" |
"Groot" |
"we" |
"are" |
"love" |
"you" |
"guys" |
1 |
0 |
0 |
0 |
log(2) / 42 |
log(2) / 42 |
0 |
0 |
0 |
2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
3 |
0 |
0 |
0 |
0 |
0 |
log(2) / 43 |
log(2) / 43 |
log(2) / 43 |
➩ For example, compute TF-IDF of "Groot" in document 1, \(\dfrac{13}{42} \log\left(\dfrac{4}{4}\right) = \dfrac{13}{42} \log\left(1\right) = 0\), and to compute TF-IDF of "love" in document 3, \(\dfrac{1}{43} \log\left(\dfrac{4}{2}\right) = \dfrac{\log\left(2\right)}{43}\).
➩ Since "Groot" appears in all documents, its appearance in document 1 is not important, thus the feature has value 0, whereas "love" only appears in document 3, therefore the multiplier \(\log\left(2\right)\) is added.
Notes and code adapted from the course taught by Yiyin Shen
Link and Tyler Caraza-Harter
Link