Style Sampler

Layout Style

Patterns for Boxed Mode

Backgrounds for Boxed Mode

Search News Posts

  • General Inquiries 1-888-555-5555

  • Support 1-888-555-5555

anaplatform Data Consultancy

Media Solutions

Discover. Delight. Devour. Your Ultimate Movie Recommendation Guide!

Rating Prediction and Movie Recommendation

Case Study: Rating Prediction and Movie Recommendation

Problem

This case study will explore how a movie production company used these tools to gain a competitive advantage. The aim is to predict the ratings and do a top-N recommendation.

Backgound

A company specializing in online movie streaming wants to increase user engagement on their platform. One way to achieve this goal is by providing personalized movie recommendations to each user based on their viewing history and preferences.

Data

The company has a database of over 10,000 movies, each with various attributes such as genre, director, actors, release year, etc. They also have data on each user's viewing history and movie ratings.

My Case study project uses the Movielens dataset (~ 250 MB) which has 25 million rows and 62423 movies for model training. I also used another smaller version of this Movielens dataset which has 100 thousand rows. Sometimes the dataset is too big for the resources and needs to be subsampled. Here is the link for the datasets: https://grouplens.org/datasets/movielens/

Solution:

MovieCo decided to invest in intelligent video analysis with data analytics. They partnered with a tech company that specialized in this area and began using their tools to analyze movies in development. The tools used artificial intelligence and machine learning algorithms to analyze various aspects of a movie, including the script, the cast, and the genre.

The tools were able to identify patterns and trends in successful movies and make predictions about how a particular movie would perform at the box office. For example, the tools could analyze the script and predict whether the story was likely to appeal to a particular audience. They could also analyze the cast and predict whether the actors would be able to draw in a large audience.

Implementation

There are 5 parts to this case study.

  • Part-1 makes an evaluation for the content-based recommendation for the 25 million rows dataset.
  • Part-2 makes a content-based rating prediction using the 100K Movielens dataset, MAE and RMSE are calculated.
  • Part-3 makes a content-based recommendation using Locality Sensitive Hashing by using a 5M rows subsample of the 25M rows dataset; subsampling was required because the 25M rows dataset is too big for the available memory.
  • Part-4 makes top-N recommendation using the 25M rows dataset according to the popularity of a movie, this popularity is calculated by counting the number of ratings given to the movie by the users.
  • Part-5 makes a rating prediction using the 25M rows dataset by using the KNN Regressor method, several numbers are experimented for K from 3 to 60. K=60 doesn't fit into the memory, but K=40 fits into the memory, MSE and RMSE are calculated.

We first start analytics with a genre map as below.

Part 1 - Top-N Recommendation with Movielens 25M Dataset Using Content-Based Recommendation

knn_map[2294] gives an output like this: [(1.0, 1), (1.0, 3114), (1.0, 3754), (1.0, 4016), (1.0, 4886)], It can be observed that the movies with ids 1, 3114, 3754, 4016 and 4886 are similar to the movie with id 2294.

knn_map[3754] gives an output like this: [(1.0, 1), (1.0, 2294), (1.0, 3114), (1.0, 4016), (1.0, 4886)], It can be observed that the movies with ids 1, 2294, 3114, 4016 and 4886 are similar to the movie with id 3754.

User Id

MovieId

Rating

Timestamp

0

1

296

5.0

1147880044

1

1

306

3.5

1147880817

2

1

307

5.0

1147880828

3

1

665

5.0

1147880820

4

1

899

3.5

1147880510



MovieId

Title

Genres

0

1

Toy Story (1995)

Adventure/Animation/Children/Comedy/Fantasy

1

2

Jumanji (1995)

Adventure/Children/Fantasy

2

3

Grumpier Old Men (1995)

Adventure/Children/Fantasy

3

4

Waiting to Exhale (1995)

Adventure/Children/Fantasy

4

5

Father of the Bride Part II

Adventure/Children/Fantasy

Part 2 - Rating Prediction Using 100K Movielens Dataset

Here the Movielens dataset with 100 thousand rows is used for rating prediction. A rating map is created by using the dictionary data structure, which maps ratings to the concatenation of userId and movieId. Also a movie genres map is created which maps genres to the movieId. Movies and ratings are merged on movieId.

Making all the rating predictions for the 100K dataset takes between 3 minutes and 5 minutes in Google Colab, when content_based_rating_prediction() function is used in a for loop. The length of the 'predicted' list is 100836.

As an example, predicted ratings at indices 0, 3, 27, and 444 are 4.4, 3.1, 3.0, and 4.2 respectively.

MSE, MAE, and RMSE are calculated: MSE: 0.773, MAE: 0.682, RMSE: 0.879

Part 3 - LSH with Content-Based Recommendation on 5M Dataset

Here datasketch library is used. MinHash, and MinHashLSHForest are imported. tokenize() function separates the input string into tokens according to the white spaces. get_forest(data, perms) function creates a MinHashLSHForest with the given data and number of permutations.

These permutations are used to create signatures for the data. predict_by_title(title, database, perms, num_results, forest) function makes the prediction. It takes 130 seconds to build the forest for the 100K rows dataset. The type of the built forest is datasketch.lshforest.MinHashLSHForest.

For the 100K rows dataset, it takes a very short time to query the forest, like 0.02 seconds. When experimented with different parameters for the 25M dataset, with permutations set to 40, 20 and 10 the session crashed due to insufficient memory. For the 5M dataset with permutations set to 10, building the forest took 30 minutes (1806 seconds).

It was not possible to save the hashtables of the forest to file because of the existence of both binary and text data. For the 5M dataset, it took 5.3 seconds to query the forest. Here are 15 recommendations for the movie title ‘Star Wars’:

MovieId

Title

Genre

Rating

1

3526280

Spider-Man (2002)

Adventure/Children/Fantasy

1

2

3526280

Deep Blue Sea (1999)

Action/Horror/Sci-Fi/Thriller

2

3

3526280

Iron Man (1999)

Action/Adventure/Sci-Fi

8

4

3526280

Aliens (1986)

Action/Adventure/Horror/Sci-Fi

4

5

3526280

Start Wars Episode II (1978)

Action/Adventure/Sci-Fi/IMAX

5

6

3526280

Jurassic Park: Lost World(?)

Action/Adventure/Sci-Fi/Thriller

6

7

3526280

Men in Black II (?)

Action/Comedy/Sci-Fi

7

8

3526280

Green Lantern: First Flight (2009)

Action/Adventure/Animation/Fantasy/Sci-Fi

1

9

3526280

Timecrimes (Cronocimenes, Los) (2007)

Action/Adventure/Animation/Fantasy/Sci-Fi

1

10

3526280

King Kong (1976)

Adventure/Fantasy/Romance/Animation/Sci-Fi/Thriller

1

11

3526280

Short Circuit (1986)

Comedy/Sci-Fi

1

12

3526280

Back to the Future III (1999)

Adventure/Comedy/Sci-Fi/Western

1

13

3526280

Saint, The (1997)

Action/Romance/Sci-Fi/Thriller

1

Part 4 - Using Popularity-Based Recommender Model on 25M Dataset

Here popularity-based approach is used for the 25M dataset. Popularity of a movie is calculated by the number of ratings it gets from the users. Ratings are first grouped by movieId and title and then aggregated by count: movie_counts25M = ratings.groupby(['movieId','title']).agg({'userId': 'count'}).reset_index()

Movies that are already rated by the user are removed from the top-N recommendation list of movies. In order to display the ratings of the movies as well in the top-N recommendation list, movie_ratings is merged to the recommendation list:

Here is a sample of top-15 recommendations for a user:

MovieId

Title

Score

0

356

Forrest Gump (1994)

81491

1

2

Shawshank Redemption (1994)

81491

2

3

Pulp Fiction (1994)

81491

3

4

Silence of the Lmabs (1991)

81491

4

5

Matrix, The

81491

PART 5 - KNN Regressor - Rating Prediction

Here, KNeighborsRegressor is used for the 25M rows dataset. For the K parameter several numbers are experimented with, from 3 to 80. KNN regressor here is convenient because it doesn't need training, and it uses low memory resources.

PART 5 - 25M K=3 MSE=5.15 RMSE=2.27

PART 5 - 25M K=4 MSE=4.89 RMSE=2.21

PART 5 - 25M K=5 MSE=4.73 RMSE=2.18

PART 5 - 25M K=5 MSE=4.73 RMSE=2.18

One example of the tools in action was with the movie "Superhero Showdown." Initially, the movie was set to be a standard superhero movie, but the tools identified a potential problem. They found that the audience for superhero movies was becoming oversaturated, and the market was starting to get tired of the genre. Based on this analysis, MovieCo made the decision to pivot the movie and make it a comedy instead. The movie was a huge success, and MovieCo was able to reap the benefits of their data-driven decision-making.

Conclusion:

Using these tools, MovieCo was able to make more informed decisions about which movies to invest in. They were able to identify potential problems early on and make adjustments before investing too much money in a project. This led to a significant increase in the success rate of their movies.

Intelligent movie recommendation with data analytics can be a powerful tool for movie production companies. By using these tools, companies can make more informed decisions about which movies to invest in and identify potential problems before they become too costly. MovieCo was able to significantly increase their success rate and gain a competitive advantage in the industry by leveraging this technology.

We provide realtime decision-making with our deeply trained AI-based Recommendation engine solutions. Contact us for a customized solution.

avatar

Alper Yılmaz

Expert Data Scientist