and prediction of movie ratings
M. Saraee, S. White & J. Eccleston
University of Salford, England
This paper details our analysis of the Internet Movie Database (IMDb), a free, user-maintained, online resource of production details for over 390,000 movies, television series and video games, which contains information such as title, genre, box-office taking, cast credits and user's ratings.
We gather a series of interesting facts and relationships using a variety of data mining techniques. In particular, we concentrate on attributes relevant to the user ratings of movies, such as discovering if big-budget films are more popular than their low budget counterparts, if any relationship between movies produced during the "golden age" (i.e. Citizen Kane, It’s A Wonderful Life, etc.) can be proved, and whether any particular actors or actresses are likely to help a movie to succeed. The paper also reports on the techniques used, giving their implementation and usefulness.
We have found that the IMDb is difficult to perform data mining upon, due to the format of the source data. We also found some interesting facts, such as the budget of a film is no indication of how well-rated it will be, there is a downward trend in the quality of films over time, and the director and actors/actresses involved in a film are the most important factors to its success or lack thereof. The data used in this paper is not freely distributable, but remains copyright to the Internet Movie Database inc. It is used here within the terms of their copying policy. Further distribution of the source data used in this paper may be prohibited.
Keywords: IMDb, Internet Movie Database, data mining, classification, movies, films.
Data Mining V, A. Zanasi, N. F. F. Ebecken & C. A. Brebbia (Editors) © 2004 WIT Press, www.witpress.com, ISBN 1-85312-729-9
344 Data Mining V
The IMDb is an excellent resource to find detailed information about almost any film ever made. It contains a vast amount of data, which undoubtedly contains much valuable information about general trends in films.
Data mining techniques enable us to uncover information which will both confirm or disprove common assumptions about movies, and also allow us to predict the success of a future film given select information about the film before its release. The main difficulty in attempting to use data mining to extract useful information from the IMDb is the format of the source data – it is only available in a number of inconsistently structured text files.
The outcome of this research is therefore twofold; it provides tools/techniques to transform the IMDb data into a format suitable for data mining, and provides a selection of information mined from this refined data, in section 4.2 Experimental results.
The organisation of the paper is as follows: Section 2 provides more details about the problem domain and the particular problems in attempting data mining of the IMDb. Section 3 gives an overview of the techniques we use to perform our analysis. Section 4 describes the actual analysis performed, and then presents the results and a discussion thereof. Section 5 gives the conclusions reached and a note about possible further work.
As mentioned in section 1, the main problem encountered when attempting to mine the IMDb data is the source format. The data is provided as forty-nine separate text files. The common factor linking the information in these files is the title of the movie, which is in fact, a title with the production year in brackets appended, to account for multiple different versions, e.g. Godzilla (1954), Godzilla (1998).
The files themselves are in a variety of formats, with no conventions such as Comma Separated Values (CSV) used – the data is laid out to be human readable, not machine-readable. The data is generally consistent, but some errors are present. Much of the...