Natural Language Processing for Movie Roles


The representation of character roles and genders in movies has significant social impact. To study the distribution of roles and genders in movies, we need to first extract this information. How can we effectively extract characters’ roles and genders from movie summaries and credit lists?


Credit list and summary information for around 34,000 US movies with 10 or more IMDb reviews and IMDb and/or Wikipedia summaries.


  • We found 114,922 character name variants in the summaries, for which we were able to extract 71,216 candidate character roles (2.1 roles per movie).

  • On a semi-random evaluation set of 10 movies, our algorithm achieved a 54% precision (proportion of correct, descriptive character roles among those that were matched), and 46% recall (character names from summaries correctly matched with IMDB credits).

  • View our project report here.


Information Architecture, Data Cleaning, POS and NER tagger, Name Extraction, Name Filtering, Alias Association, Regex and Chunking