The goal of this assignment was to prove that we are able to collect, clean, reshape and visualise data. We were instructed to conduct an exploratory data analysis on a topic of our choice. The interpretation of the results and their scientific significance were not the focus of this project.
I choose to only look at four categories: Best actor and actress and best supporting actor and actress. I combined a few different datasets for my analysis: I got information about the nominations from 1927-2015 from an existing dataset, and the rest by scraping the Academy Awards website. In addition, I got detailed information about the actors and the movies they played in via the IMDb API, and information about the actors place of birth via the Google Maps API.
Altogether, I examined 1728 nominations for 926 different actors who played in 1151 different movies.
I structured my analysis in the following way: In a first step, I looked at the characteristics of the actors that were nominated for an Academy Award. Secondly, I attempted to identify the types of films and roles that resulted in nominations.
The Gender Age Gap
The left plot above reveals two things: First, the average age for both men and women to be nominated for an Academy Award has increased over the years. Second, women are consistently younger than men at the time of nomination, although the size of the age gap varies over the years.
The boxplots confirm the observation made in the first plot: Women are on younger when they are nominated for an Academy Award.
I wanted to take a closer look a the controversy that has been brought up again and again in recent years: The lack of racial diversity of nominees. However, I could not find a dataset about the race of actors, and it would be inappropriate to classify them myself. Therefore I decided to use another variable to measure diversity: place of birth. Of course, that is only a weak proxy for race, but it was the best I could do with the data at hand.
It is immediately apparent that a large amount of the actors and actresses were born in North America. There is a second cluster in Europe, although significantly smaller. Hardly any actors were born in the global south.
This pattern is not too surprising, as there is a bias towards people of English mother tongue in the movie industry. However, even accepting this language bias, in 2010, nearly 30% of the U.S. population were POC (self-reported).
The IMDb maintains a list of "trademarks", in which the trademarks of actors and actresses are listed. For example, the trademark for Angelina Jolie is simply "Full lips" - for Christian Bale it is "Often portrays obsessive and detached or loner characters".
I wanted to see what trademarks are important for actors that were nominated and if they differ by gender.
For both genders, their voice seems often to be an important trademark. Apparently, women are known for having husky voices, while men are known for having a raspy, commanding or baritone voices.
The trademarks of women are often physical features like hair color, eye color, and descriptions of their body shape (e.g. petite, thick, thin). Physical characteristics (e.g. moustache, hair, hat, nose) are also listed as trademarks of men, but they are less frequently mentioned.
When it comes to non-physical characteristics, there is a big difference between the genders: Women tend to be described as strong-willed, vulnerable, and emotional, while men are described as authoritative, tough and leading. 🤷♀️
How are nominees connected?
The following plot shows the network of actors, with edges between them if they were nominated for an Academy Award for a role in the same movie and note size determined by their degree in the network. Granted, this plot is not very informative, but the focus of this assignment was on data visualisation, and I wanted to include a network plot.
The Genre Gender Gap
The IMDb distinguishes between 20 genres, however, a movie can belong to multiple genres. In total, there are 249 unique genre combinations in the dataset.
Drama is by far the biggest genre, and in combination with Romance and Comedy, these genres account for nearly 40% of all movies.
The plot above compliments the Venn diagram: Even though there are many combinations of genres, there is a large concentration of movies in a small number of genre combinations, which indicates that there are "typical" genre combinations.
The plots above show that men are nominated for roles in a wider range of genres than women. This could indicate that the Academy judges only likes to see women in some genres. However, there are generally more roles for men than for women in movies. So the lack of women in some genres might just show the overall lack of gender parity in the movie industry.
Other Attributes of Movies That Got Their Actors Nominated for Oscars
The plots above clearly show two things: Movies did become longer over the years, their average runtime increased by about 20 minutes between 1927 and 2019. Interestingly, the average movie duration for roles that got actresses nominated is shorter than the duration for men.
It seems as there might be a slight correlation between rating and runtime. What is obvious, however, is that movies with greater budgets are longer, which is of course not surprising.
Interestingly, men are nominated and win Awards for roles in higher rated movies than women. This might again stem from the problem of the general underrepresentation of women in movies: The academy might have less choice when they nominate actresses.
The exploratory data analysis revealed some interesting patterns about who exactly is nominated for and wins an Academy Award.
Characteristics of actors & actresses: On average, women are younger than men when they are nominated for Academy Awards. With few exceptions, most nominated actors come from North America and Europe. Actors and actresses trademarks - the traits they are known for - reflect existing gender stereotypes.
Characteristics of movies: Actors are nominated for movies in a wider range of genres than actresses. There are typical genres and genre combinations, such as drama, romantic drama or comedy-drama which make up for a large number of nominations. On average, movies became longer. Women tend to be nominated for roles in shorter movies than men. There might be a connection between rating and runtime, to confirm this, more analysis is needed.
This project could be extended in various ways. For instance, would be interesting to add information about the actors that were not nominated for an Academy Award to see if there are differences between them.
Remarks and Code
Please keep in mind that I started this project a year ago (November 2019), and that it was one of my first projects in Python. I would definitely do a lot of things differently today. In particular, my code would be much more efficient, elegant, and better structured.
The code for this project can be found on my GitHub page.