For those who haven’t seen HBO’s sci-fi/western/dystopian hit series Westworld, you shouldn’t have trouble finding a friend or coworker who are crazy about it. I, for one, would certainly be put in that category by my friends. The show has a way of drawing people in—be it from a plot that weaves complex themes regarding good and evil, human and machine, or ethics in the face of new technology with allusions to classic biblical and mythological stories to the beautiful cinemotography and the sheer “acting Olympics” it takes to do the script justice. But, underlying these attributes is one more way that Westworld attracts its loyal audience: constant repetition.

From the very first episode, viewers of Westworld are introduced to a variety of hosts, who day-in and day-out follow tighly woven loops. They repeat much of the same phrase and follow similar actions until the day or week resets, or they get killed by one of the park’s many guests. These loops were certainly intriguing to me on my first view of the show, perhaps, as witted by Dr. Robert Ford, because we humans exhibit behavior consistent with this looping pattern. But the series goes much deeper into the idea of repetition than simply showing the host’s loops; one striking feature is that Westworld exhibits so much repetition in scripted lines and visual screenshots across episodes and even between characters. In the wake of Season 2’s Finale, I wanted to explore this repetition.

What follows is my attempt to scrape the text from Westworld episodes from the web, transform this raw data into a more useable form, and analyze it for repetition of lines. I will caveat this post by saying this was a learning experience for me (I have essentially no prior experience in Natural Language Processing), but I hope that most users of the programming language R can find some new insight from my code. At the end of the day, we get a few cool data visualizations that provide some answers–and in true Westworld fashion–even more questions.

The poster for Westworld Season 2.

The poster for Westworld Season 2.

Get the data!

First step, we need to grab all of the text from Westworld episodes. Thankfully, I found a wonderful site: which has this information. Even more conveniently, each line of the show is separate and easily scraped. Side note: web-scraping is a wonderful thing that I have used quite frequently since learning it. You’d be surprised how convenient it is to be able to grab almost anything off the internet with a few lines of code. I learned through ‘css tags’ (found from the Google Chrom plug-in “SelectorGadget”), and the R package rvest from this tutorial on Analytics Vidhya.

Since has each episode as a separate link, I actually first grab all of the links from their Westworld page that correspond to episodes, save those urls as a vector, and then parse the data into a list with the help of purr::map. In case of future trouble with the website, I save the data (and I highly recommend this step!).

## load in packages (and one github repo)
# library(devtools)
# install_github("andreacirilloac/paletter")
## Webscrape data from links at
url_home <- ""

## grab urls of episodes from home url
urls_of_episodes <- url_home %>% 
  read_html() %>%
  html_nodes(".topictitle") %>%
  html_attr("href") %>%
  gsub("./", x=., replacement="") %>% 
  sub(pattern = "&sid.*",replacement = "", x = .)

## scrape lines from urls (19 episodes, 12446 total lines)
data <- urls_of_episodes %>%
  map(~ html_text(html_nodes(read_html(.), "#pagecontent p"))) 
## save in case of future corruption  
save(data, urls_of_episodes, file="westworld.RData")

Great, now that we have the data in a nice list format, what more could be left to do? Turns out a lot. In order to do any analysis on text, we first need our data in a data frame (or a tibble, as I tend to prefer), and we need a text column that removes any speaker tags, punctuation, or other non-spoken lines. I generally use the dplyr package for datafame manipulation, which along with a few gsub uses does the trick. I also found that the R package tm has some very useful functions for text manipulations, such as trimws for removing leading and trailing white space and tolower for making a strong all lowercase.

The most time consuming part of this project came when I realized that the transcripts also contained lines from some of the pre-episode recaps. This necessitated a manual rewatch of the beginning of every episde to determine where the actual episode began, since keeping any recaps would heavily increase any repetition present in the show. I also couldn’t think of any way to remove these lines except with a for loop (shudder), but I suppose they do have their uses.

## make data into a tibble (dataframe)
data.dt <- data %>% 
  lapply(., FUN = data.frame, stringsAsFactors = FALSE) %>%
  setNames(c(paste0("s01e",1:10), paste0("s02e",1:9)) ) %>%
  bind_rows(.id = "episode") %>%
  as.tibble %>%
  setNames(c("episode", "line"))

## add row with no speaker tag, punctuation, or capitalization with parentheticals removed
data.dt <- data.dt %>%
  mutate(line_clean = gsub(pattern = "\\([^\\)]+\\)", x = as.character(line), replacement = NA)) %>%
  mutate(line_clean = gsub(pattern = ".*: ", x = line_clean, replacement = "")) %>%
  mutate(line_clean = gsub("[\\.\\',!\\?-]", x=line_clean, replacement = "")) %>%
  mutate(line_clean = trimws(line_clean)) %>%
  mutate(line_clean = tolower(line_clean)) %>% 
  mutate(n_words = str_count(line_clean, "\\S+"))

## Manually code line corresponding to the end of recap
recap_info <- tibble(season.episode = unique(data.dt$episode),
                     end_of_recap = c(0,  0,  17, 0,  0,  0, 0, 0,  0, 0,
                                  79, 18, 22, 23, 38, 9, 9, 10, 8)  )

## add line number to each episode to the tibble
data.dt.norecap.pre <- data.dt %>% 
  group_by(episode) %>% 
  mutate(ep_line_num = row_number()) %>%

## Remove beginning of episode recaps via filtering
data.dt.norecap <- integer(0)
for(row in 1:nrow(recap_info)) {
  piece <- data.dt.norecap.pre %>%
    filter(episode == as.character(recap_info[row,"season.episode"])) %>%
    filter(ep_line_num > as.integer(recap_info[row,"end_of_recap"]))
  data.dt.norecap <- bind_rows(data.dt.norecap, piece)

And now, a quick peek at the data before moving on to the real heavy lifting.

## # A tibble: 12,214 x 5
##    episode line                 line_clean             n_words ep_line_num
##    <chr>   <chr>                <chr>                    <int>       <int>
##  1 <NA>    <NA>                 <NA>                        NA          NA
##  2 s01e1   (Theme music playin~ <NA>                        NA           1
##  3 s01e1   Man: Bring her back~ bring her back online~       8           2
##  4 s01e1   Woman, Western acce~ yes im sorry im not f~       8           3
##  5 s01e1   Man: You can lose t~ you can lose the acce~      11           4
##  6 s01e1   Woman, standard acc~ im in a dream                4           5
##  7 s01e1   Man: That's right, ~ thats right dolores y~      16           6
##  8 s01e1   Dolores: Yes. I'm t~ yes im terrified             3           7
##  9 s01e1   Man: There's nothin~ theres nothing to be ~      16           8
## 10 s01e1   Dolores: Yes.        yes                          1           9
## # ... with 12,204 more rows

Step into Analysis

In my head, I felt like Westworld had so many phrases that come up over and over again. Sometimes this happens from flashbacks, but much of it is not. To test this theory, I had to add a little extra information to the above dataframe. First, to qualify as a phrase, I filter out any episdoe lines with less than 3 words. Then, I summarize the data into a new dataframe, which as rows of unique lines, along with information on how often the line was said, how many episodes that line has been in and how many ‘interesting’ words each phrase contains (i.e. removing words like “a”, “him”, “don’t”, “no”, to name a few).

While counts of occurences, episodes, and ‘interesting’ words are useful, they don’t naturally provide a way to rank phrases on combinations. Ideally, the most repetitive phrase in Westworld’s episodes will occur often, occur in many different episodes, and contain many uncommon words (with the last criteria being the least important in my mind). The simplest way to combine these stats was to turn each count data point into a rating from 0% to 100%. For occurences and episodes, this is achieved by dividing each count by the maximum count and multiplying by 100. For unique words, since some phrases were quite long, I applied a similar procedues but first took natural logs of the counts to lower the skew of the data. Lastly, two combined ratings are produced: one as the average of all three ratings (rating1), and the other as the average of the occurence rating and episode rating (rating2). Both of these combined ratings have sqrt transformations done to make the significances closer to 100% (and honestly to make the following graph easier to read).

frequency <- data.dt.norecap %>%
  filter(n_words >= 3) %>% 
  group_by(line_clean, n_words) %>%
            line = gsub(pattern = ".*: ", "", dplyr::first(line)),
            n_episodes=n_distinct(episode)) %>%
  arrange(desc(number)) %>%
  ungroup() %>%
  mutate(line_clean_nsws = removePunctuation(removeWords(tolower(line), stopwords("english")))) %>%
  mutate(n_words_ns = str_count(line_clean_nsws, "\\S+") )

frequency2 <- frequency %>% 
  filter(number > 1) %>%
  mutate(rating_episodes = n_episodes / (diff(range(n_episodes))+1) ,
         rating_number = number / range(number)[2] ,
         rating_words_ns = log(n_words_ns) / max(log(n_words_ns))  ) %>%
  mutate(rating1 = sqrt( (rating_episodes + rating_number + rating_words_ns)/3)*100,
         rating2 = sqrt((rating_episodes + rating_number)/2)*100 ) %>%
  dplyr::select(line, rating1, rating_episodes, rating_number, rating_words_ns, number, n_words_ns, rating2, n_episodes) %>%


Now we could certainly stop here and make some nice charts of the data we have now analyzed. But, let’s choose to really see the beauty in these charts. I found a cool package called paletter, which can be installed following my first few lines of code. There’s an excellent post on how this package works here, but essentially you feed in a picture, and the package uses K-mean clustering and some filtering to output a color palette based on that image! How cool!

Naturally, I fed in the Season 2 poster for Westworld, and was quite happy with the results.

ww_color_scheme <- create_palette(image_path = "",
               number_of_colors =20,
               type_of_variable = "categorical")

First, let’s see the most repeated phrases based on number of occurences and number of episodes in which it occurs. Because these metrics are highly correlated, the number of episodes tends to be more of a tie-breaker, since many phrases occur a few times. Unfortunately, this approach only captures when a line is repeated exactly. I’d really like to re-do this analysis that can count, for example, S1E2 “You must be William. Welcome to Westworld” and S1E6 “Welcome to Westworld” as a repeated phrase, along with the many times “the center of the maze” comes up in different contexts. Motivation for me to learn more language processing techniques!

## ordered by rating2
frequency2 %>% 
  mutate(lab = paste0("Repeated ",number, " times over ", n_episodes, " episodes" )) %>%
  arrange(desc(rating2), desc(rating1)) %>%
  .[1:20,] %>%
  ggplot(aes(x = reorder(line, rating2), y = rating2, fill = line)) +
  geom_bar(stat = 'identity') +
  geom_text(aes(label=line), hjust=1.01, color="grey95", fontface="bold") + 
  geom_text(aes(label=lab), hjust=-0.05, color="grey5", fontface="italic", size = 3) + 
  ylim(0,130) +
  theme_minimal() +
        plot.title = element_text(hjust = 0.5, face="bold")) +
  guides(size = FALSE) +
  labs(title = "Loops in Westworld: Important, Repeated Lines",
       x = "", y = "Significance % (out of 100%)")+