If asked for two words to describe me, my friends would probably say “Survivor superfan”. As someone who had just finished a Coursera course about machine learning, I was eager to apply what I had learned in a personal project. And as an avid Survivor fan for the last 10 years, what better application than to try to predict Survivor winners? Thus, this research project was born.
Every season I watch the show, I make a bet with my friends on who’ll win, although none of us has ever been successful in predicting the winner! This led me to wonder–are there certain aspects of a player’s experience or performance in the game that can predict whether they’ll win, or is the field of human social bonds too complex to be accurately predicted? My goal for this project was to explore whether various controllable and uncontrollable quantitative factors could predict whether a finalist would win Survivor.
Background:
The game of Survivor takes place in two main phases, with the ultimate objective of becoming the sole Survivor and winner of a million dollars. The first phase of the game is Tribe, or the team-based phase. Tribes compete in challenges against each other to determine who will win rewards and immunity, allowing them to be safe from Tribal Council. The losing tribe attends Tribal Council, where they must vote off a member of their Tribe. This team phase continues until the Merge, where everybody becomes part of the same Tribe and challenges become individual.
At the end of the season, 2-3 finalists are remaining. At this point, the players that were voted off since the Merge vote for who they believe is the most deserving to win.
Research Process:
Before starting my project, I did some research on the internet to see what had already been done. There were very few people who had done projects with machine learning and Survivor, but one who did was Uygar Sozer, a former (now graduated) Master of Science in Analytics student at Northwestern University. I connected with him on LinkedIn and bounced some ideas off of him about suitable variables to examine. For the project, I used data from the public GitHub repository: doehm/survivoR.
Although I was curious to examine how factors such as how the contestant is shown on screen or their likeability impacted their chances of winning, I decided to stick with purely objective and quantifiable factors in my research. This is because the edit or appearance of a player can be subjective (many people part of the edgic (editing logic) community would beg to differ!). I wanted to make sure that only numerical variables were used to train my model. As a result, I decided to only look at historical data from the show, whether it be contestant (castaway) demographics or voting history.
Choosing features:
I separated my variables into two groups: uncontrollable and controllable. Uncontrollable variables are aspects of the player or their game that they have little to no control over. These include factors such as the percentage of Tribe challenge wins, the number of seasons played, and age.
Percentage of Tribe challenges wins (percent_tribe): I assumed that the contribution of a castaway to whether the tribe won was negligible. Although there are exceptions to this rule when it comes to contestants like Joe or Jonathan, who can single-handedly win a tribe challenge, most of the time, one person’s effort or lack thereof does not make a huge difference in the challenge outcome.
Number of seasons played (appearance): I decided to look at this variable because I was curious if being more experienced (playing more seasons) increased the chance of a player winning. Additionally, sometimes returnees are viewed favorably by others, as was the case with Boston Rob in season 22.
Age: Especially in older seasons of Survivor, age seems to have a negative correlation with winning the show. Most Survivor players are younger than 40 years old, leading to a skew in ages in most seasons. This seems to be consistent with my kernel density plot.
I was surprised to see that the plot of losers was bimodal with peaks around 25 and 45 years old–perhaps there is a “golden” age range for winning the show?
Controllable factors include aspects of gameplay that are within a player’s influence. I looked at four factors here, including individual immunity wins (winning a challenge and being safe from being voted out that round), being chosen for reward, immunity idols found (a type of advantage that can nullify all votes cast against you), and voting correctly.
Individual immunity wins (percent_immunity): Fairly self-explanatory variable, but I was curious to see if winning immunity challenges increased respect from the jury. In some challenges, individual immunity can also reflect a castaway’s effort (Andrea from Season 34 immediately comes to mind).
Being chosen for reward (percent_reward): Sometimes, when a castaway wins an individual reward challenge, they will be permitted to bring another person or two along. I wanted to examine if being chosen for many rewards was indicative of strong social bonds or likeability.
Immunity idols found (total_idols): Finding idols takes a lot of determination (especially in recent seasons as it has become riskier!). Putting in the effort to look for idols and play them successfully (blocking votes that would have sent you home without the idol) typically warrants a lot of admiration from the jury.
Voting correctly (percent_vote): It’s hard to always “be in on the vote”, or vote correctly at tribal council, when a member of the tribe is voted off the island. Voting correctly could indicate that the player is orchestrating the vote, in the majority, or has strong enough connections to the majority.
It is important to note that I calculated percentages whenever necessary to account for any differences in the number of challenges or votes each season. All plots were made using the ggplot2 library of R.
Data Cleaning:
Once my features were decided, it was time to start cleaning the data! I used the castaways, vote, tribe, immunity, rewards, season_summary, and idols datasets from the GitHub repository. I initially started cleaning data on R, as the data came in a .rda format, but soon realized that Python pandas library provided much better functionality for manipulating data frames.
In my cleaning, I had to account for many discrepancies in how data was organized in different datasets–some were characterized by season, some by castaway, some by specific events (vote or challenge). Additionally, there were many castaways like Parvati and Amanda who were finalists multiple times, so I identified each time a player was a finalist using both season and name.
What surprised me about the data cleaning process was how long it took. It took me more time to clean data than to construct the actual machine learning algorithm! It was definitely challenging at times but a lot of fun to debug.
Modeling:
Firstly, I started by randomly splitting the data into a 70% training group and 30% test group using scikit-learn’s test_train_split function. This percentage split is generally considered good practice to prevent overfitting. I also scaled all features to improve accuracy in my model, as some values were percentages and others were whole numbers. This would prevent any feature from having a larger weight than another when predicting outcomes.
I used a multivariate logistic regression model to classify each finalist as either a winner (1) or a loser (0). I used statsmodels with the training set to construct an initial model, and here are what the results looked like:
Something to note is the p-values (P>|z|) in the latter table: none are less than 0.05, indicating that it is unlikely that any of the variables I examined are statistically significant.
Then, I calculated the variance inflation factor (VIF) for each variable. This is a measure of multicollinearity, or how correlated various variables are to each other. In other words, it may indicate that variables are not independent. Having a high VIF can negatively impact the accuracy of a model.
From my examination of both p-values and VIF, I decide to drop the percent_vote variable as it has a high value of both.
Model Evaluation:
Now that I had constructed my model, it was time to test its accuracy! Using the remaining 30% of data I had set aside, I predicted the probability that a player had won. I used a decision boundary of 0.5 to separate predictions of winners and losers.
After testing on all finalists, my model came up with a maximum of 60% accuracy rate, even as I tried different variable combinations and decision boundaries.
I was slightly disappointed to come to this conclusion, as I was hoping for a higher accuracy rate. However, I think this project did illustrate an important point about the nature of Survivor, which I will delve deeper into in the conclusion.
Conclusion:
In my quest to quantifiably predict Survivor winners, I realized that measurable factors can only account for so much. In my analysis of quantitative factors, I couldn’t account for arguably the most impactful and valuable part of Survivor: the human connections. Human connection and emotion are unpredictable, irrational, and immeasurable. Some of my favorite moments of Survivor can never be measured in a machine learning algorithm–like seeing Cirie finish a challenge after falling countless times or watching Aubry turn from a self-proclaimed nervous wreck to a strategic powerhouse.
My project concluded that, yes, there might be some factors that give a finalist a higher chance of winning, but part of me also believes that a large component of what makes Survivor worth watching is the unpredictability and underdog stories. If we could predict who would win with 100% accuracy, there would be no thrill in watching the show!
This project also helped me see that there’s no one “right way” to win Survivor. Everyone plays a different game, and no one aspect is mandatory to win. Take Sandra, for example, one of two players to win Survivor twice. She never even found an immunity idol!
And with that, my project concludes, although I’ll continue to hone this model for the upcoming season. I might also consider including data from other Survivor spin-offs such as Survivor Australia or Survivor South Africa! By the way, Survivor Season 44 premieres on March 1st at 8 pm :) I’ll definitely be watching then!
References:
Gholamy, Afshin. ScholarWorks@UTEP, El Paso, Texas, 2018, pp. 1–6, Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation.
Krishnan, Sowmya. “Multivariate Logistic Regression in Python.” Medium, Towards Data Science, 11 June 2020, https://towardsdatascience.com/multivariate-logistic-regression-in-python-7c6255a286ec.
“Multicollinearity in Multiple Regression.” GraphPad by Dotmatics, 2011, https://www.graphpad.com/support/faq/multicollinearity-in-multiple-regression/.
Sozer, Uygar. “MSIA.” MSiA Student Research, Northwestern University, 30 Apr. 2021, https://sites.northwestern.edu/msia/2021/04/30/making-of-a-sole-survivor/.
“Survivor (Official Site) Season 42 – Watch on CBS.” Paramount+, 14 Dec. 2022, https://www.cbs.com/shows/survivor/.
Amazing job Avery. Super thorough and interesting read!