Search News Posts
General Inquiries 1-888-555-5555
•
Support 1-888-555-5555
Rating Prediction and Movie Recommendation
This case study will explore how a movie production company used these tools to gain a competitive advantage. The aim is to predict the ratings and do a top-N recommendation.
A company specializing in online movie streaming wants to increase user engagement on their platform. One way to achieve this goal is by providing personalized movie recommendations to each user based on their viewing history and preferences.
The company has a database of over 10,000 movies, each with various attributes such as genre, director, actors, release year, etc. They also have data on each user's viewing history and movie ratings.
My Case study project uses the Movielens dataset (~ 250 MB) which has 25 million rows and 62423 movies for model training. I also used another smaller version of this Movielens dataset which has 100 thousand rows. Sometimes the dataset is too big for the resources and needs to be subsampled. Here is the link for the datasets: https://grouplens.org/datasets/movielens/
MovieCo decided to invest in intelligent video analysis with data analytics. They partnered with a tech company that specialized in this area and began using their tools to analyze movies in development. The tools used artificial intelligence and machine learning algorithms to analyze various aspects of a movie, including the script, the cast, and the genre.
The tools were able to identify patterns and trends in successful movies and make predictions about how a particular movie would perform at the box office. For example, the tools could analyze the script and predict whether the story was likely to appeal to a particular audience. They could also analyze the cast and predict whether the actors would be able to draw in a large audience.
There are 5 parts to this case study.
We first start analytics with a genre map as below.
knn_map[2294] gives an output like this: [(1.0, 1), (1.0, 3114), (1.0, 3754), (1.0, 4016), (1.0, 4886)], It can be observed that the movies with ids 1, 3114, 3754, 4016 and 4886 are similar to the movie with id 2294.
knn_map[3754] gives an output like this: [(1.0, 1), (1.0, 2294), (1.0, 3114), (1.0, 4016), (1.0, 4886)], It can be observed that the movies with ids 1, 2294, 3114, 4016 and 4886 are similar to the movie with id 3754.
_____
User Id
MovieId
Rating
Timestamp
0
1
296
5.0
1147880044
1
1
306
3.5
1147880817
2
1
307
5.0
1147880828
3
1
665
5.0
1147880820
4
1
899
3.5
1147880510
_____
MovieId
Title
Genres
0
1
Toy Story (1995)
Adventure/Animation/Children/Comedy/Fantasy
1
2
Jumanji (1995)
Adventure/Children/Fantasy
2
3
Grumpier Old Men (1995)
Adventure/Children/Fantasy
3
4
Waiting to Exhale (1995)
Adventure/Children/Fantasy
4
5
Father of the Bride Part II
Adventure/Children/Fantasy
Here the Movielens dataset with 100 thousand rows is used for rating prediction. A rating map is created by using the dictionary data structure, which maps ratings to the concatenation of userId and movieId. Also a movie genres map is created which maps genres to the movieId. Movies and ratings are merged on movieId.
Making all the rating predictions for the 100K dataset takes between 3 minutes and 5 minutes in Google Colab, when content_based_rating_prediction() function is used in a for loop. The length of the 'predicted' list is 100836.
As an example, predicted ratings at indices 0, 3, 27, and 444 are 4.4, 3.1, 3.0, and 4.2 respectively.
MSE, MAE, and RMSE are calculated: MSE: 0.773, MAE: 0.682, RMSE: 0.879
Here datasketch library is used. MinHash, and MinHashLSHForest are imported. tokenize() function separates the input string into tokens according to the white spaces. get_forest(data, perms) function creates a MinHashLSHForest with the given data and number of permutations.
These permutations are used to create signatures for the data. predict_by_title(title, database, perms, num_results, forest) function makes the prediction. It takes 130 seconds to build the forest for the 100K rows dataset. The type of the built forest is datasketch.lshforest.MinHashLSHForest.
For the 100K rows dataset, it takes a very short time to query the forest, like 0.02 seconds. When experimented with different parameters for the 25M dataset, with permutations set to 40, 20 and 10 the session crashed due to insufficient memory. For the 5M dataset with permutations set to 10, building the forest took 30 minutes (1806 seconds).
It was not possible to save the hashtables of the forest to file because of the existence of both binary and text data. For the 5M dataset, it took 5.3 seconds to query the forest. Here are 15 recommendations for the movie title ‘Star Wars’:
_____
MovieId
Title
Genre
Rating
1
3526280
Spider-Man (2002)
Adventure/Children/Fantasy
1
2
3526280
Deep Blue Sea (1999)
Action/Horror/Sci-Fi/Thriller
2
3
3526280
Iron Man (1999)
Action/Adventure/Sci-Fi
8
4
3526280
Aliens (1986)
Action/Adventure/Horror/Sci-Fi
4
5
3526280
Start Wars Episode II (1978)
Action/Adventure/Sci-Fi/IMAX
5
6
3526280
Jurassic Park: Lost World(?)
Action/Adventure/Sci-Fi/Thriller
6
7
3526280
Men in Black II (?)
Action/Comedy/Sci-Fi
7
8
3526280
Green Lantern: First Flight (2009)
Action/Adventure/Animation/Fantasy/Sci-Fi
1
9
3526280
Timecrimes (Cronocimenes, Los) (2007)
Action/Adventure/Animation/Fantasy/Sci-Fi
1
10
3526280
King Kong (1976)
Adventure/Fantasy/Romance/Animation/Sci-Fi/Thriller
1
11
3526280
Short Circuit (1986)
Comedy/Sci-Fi
1
12
3526280
Back to the Future III (1999)
Adventure/Comedy/Sci-Fi/Western
1
13
3526280
Saint, The (1997)
Action/Romance/Sci-Fi/Thriller
1
Here popularity-based approach is used for the 25M dataset. Popularity of a movie is calculated by the number of ratings it gets from the users. Ratings are first grouped by movieId and title and then aggregated by count: movie_counts25M = ratings.groupby(['movieId','title']).agg({'userId': 'count'}).reset_index()
Movies that are already rated by the user are removed from the top-N recommendation list of movies. In order to display the ratings of the movies as well in the top-N recommendation list, movie_ratings is merged to the recommendation list:
Here is a sample of top-15 recommendations for a user:
_____
MovieId
Title
Score
0
356
Forrest Gump (1994)
81491
1
2
Shawshank Redemption (1994)
81491
2
3
Pulp Fiction (1994)
81491
3
4
Silence of the Lmabs (1991)
81491
4
5
Matrix, The
81491
Here, KNeighborsRegressor is used for the 25M rows dataset. For the K parameter several numbers are experimented with, from 3 to 80. KNN regressor here is convenient because it doesn't need training, and it uses low memory resources.
One example of the tools in action was with the movie "Superhero Showdown." Initially, the movie was set to be a standard superhero movie, but the tools identified a potential problem. They found that the audience for superhero movies was becoming oversaturated, and the market was starting to get tired of the genre. Based on this analysis, MovieCo made the decision to pivot the movie and make it a comedy instead. The movie was a huge success, and MovieCo was able to reap the benefits of their data-driven decision-making.
Using these tools, MovieCo was able to make more informed decisions about which movies to invest in. They were able to identify potential problems early on and make adjustments before investing too much money in a project. This led to a significant increase in the success rate of their movies.
Intelligent movie recommendation with data analytics can be a powerful tool for movie production companies. By using these tools, companies can make more informed decisions about which movies to invest in and identify potential problems before they become too costly. MovieCo was able to significantly increase their success rate and gain a competitive advantage in the industry by leveraging this technology.
We provide realtime decision-making with our deeply trained AI-based Recommendation engine solutions. Contact us for a customized solution.