Final Report

Recommendation System for Anime

Aaron Gokaslan, Ahmad Najeeb,

Haris Chaudhary, Nicholas Hartmann

Abstract

Our objective for this project was to build a recommendation system specifically for Anime TV programs. The end goal was that users should be able to provide their ratings/preferences, and our recommender should be able to use that data to offer some anime recommendations for the user that’s relevant to them. To build an effective recommender, we needed a big dataset so that our recommender is able to work accurately. We obtained data containing information regarding approximately 12,000 animes, over 3 million users, and approximately 170 million user ratings, each of which is between 1 to 10.

Using an Incremental SVD Julia package, we built a model that uses the ratings and anime characteristics to predict what score each user would give each of the animes that s/he has not yet rated. Our final product is a web application that operates through two modes: 1) User based: users can provide their MyAnimeList.net username, and our recommender looks up their profile and compiles a list of Animes which it predicts the user will like (based on their previous ratings). 2) Cold-Start: If a user doesn’t have a MyAnimeList.net account, they can simply provide up to 5 Animes which they’ve previous seen and liked, and our recommender will use this list to prepare recommendations for them.

Dataset Information

Our dataset originates from MyAnimeList.net, which is roughly the IMDB equivalent for Anime. The dataset we obtained could be broken down into two main categories:

Anime information: Information relating to specific Anime shows, such as their age rating (i.e. G, PG, PG-13, etc.), their runtime, etc.
User ratings: Ratings which users have given to a specific Anime show.

All of our above data originated as JSON files which we had to extensively clean and eventually convert to CSV. When cleaning, we had to remove Manga information from the Anime information (since Manga is data that we’re conceptually not interested in). We also removed Anime entries that did not have ratings or other vital incomplete data. Correspondingly, all entries in the user information pertaining to the removed entries in the Anime CSV also needed to be cleaned up as well.

Obtaining the Data

Due to an unfortunate history of DDOS attacks, MyAnimeList.net is very careful about who they give APIs keys. Only a handful of third parties have them, and the website no longer gives them out. However, this project was not the first of its kind. Other recommenders had been built using data from the website. After looking around, our group contacted the owner of the recommendation engine, and they mentioned how they still had an API key and could actively pull ratings. Additionally, the owner also had a sophisticated scraper to pull data from the website not covered by the limited API. After contacting him, we were able to receive a dump of the data.

Initial Processing

We wanted to start off our research into Anime trends by first doing some basic preliminary work to get a sense of our data looks like, and what can we expect from it going forwards. In order to do this, one of the questions we asked ourselves was how relevant the work we are doing actually is (i.e. are Animes even worth exploring?). We created a line graph to see the trend of new Anime releases and saw a staggering increase over the past two decades. The graph below was adapted from here.

We also were interested in finding out how the average ratings varied per genre to get a sense of what to expect from the underlying data in terms of recommendation. That is, what is the general consensus of users in terms of what is popular or what is not? We were surprised to find however that there was more or less a consistent trend in terms of popularity for the anime’s across a genre. What genre’s did have a low average rating were genre’s that did not have quite a lot of anime. We reasoned that such genre’s are a niche and creating good Anime’s in that category is often a hard task that gets judged more critically by viewers. The results can be seen below, there were adopted and modified from here:

Lastly, another aspect we wanted to visualize was how ratings were distributed across Anime’s themselves. For this purpose we used a Python script to create a visualization that would show us how the number of Anime were clustered around the score. The graph can be seen below. We found that generally most anime were rated around the 7 range. There was no significant high or low rating at all. This seemed to suggest that the overall ratings were not polarized in any way at all.

First Steps with Recommendation

With our initial research out of the way, we eventually started researching about how to build an accurate recommendation model, and this took a decent amount of time. To evaluate the effectiveness of any model that we came up with, we wanted to compare the ratings that we predicted to the actual user ratings. Each time we did this, we split our ratings data into training and testing subsets. After training the model on the training set, we attempted to predict the ratings in the testing set. We used the mean-squared-error as our measure of accuracy. To establish a benchmark to beat, we predicted the ratings using the methods in the table below, and got their respective MSEs:

Prediction Method	MSE
Assign Random Rating Between 0 and 10	25.10
Assign Rating of 5 (“middle” rating)	15.17
Predict average rating given for that anime in the training set	13.66

Recommender #1: pyFM

Our first attempt was to implement factorization machines with the help of the pyFM library (https://github.com/coreylynch/pyFM). A factorization machine is a variation on a support vector machine that handle high levels of sparsity well. With so many dimensions in our data, factorization machines seemed to be a good approach. However, we ran into issues with speed and memory. We could only run the code successfully on a small subset of the data. Additionally, this method only took into account user ratings, and we wanted to incorporate anime characteristics to improve our predictive power. We moved on from this approach, but achieved a MSE of 8.99 when building the model with a random subset of the ratings, so we kept that as a target benchmark to beat.

Recommender #2: MLlib via Spark running on AWS

We also set up an AWS instance and managed to get Spark running on the cluster. Our attempt was to test the MLlib library for training a recommendation engine. We tested some recommendation based on user profiles we received from people we know. We initially received results which were not too satisfactory. On further scrutiny and exploration we considered changing our technique. We were initially using an explicit feedback system that only took into account user scores. This did not take into account factors such as if the series was actually watched when rated, and so on. Another major problem was that ratings were mainly available for relatively more popular shows (because those are the shows that most people watched and rated highly), so the recommendation engine preferred recommending popular shows more, even when recommending a not so well known show made sense. We chose to adopt an implicit feedback system with a scaled score between 0-1 that was formulated using factors such as rating of the show, whether it was actually watched or not, and potentially what genres it belonged to, etc. We experimented with different implicit feedback values to see how it decreased our Root Mean Square Error (RMSE) (if at all) and also how the recommendations were changing. We managed to achieve a very satisfactory result however it came at the cost of running an expensive Spark Cluster. We tried to see what other option we could explore that gave us results that were just as satisfactory but did not need the processing power of a costly AWS setup. With this mission in mind, we plunged forward.

Recommender #3: RecDB

We tried to use RecDB as a suitable alternative for recommendation. It is an open source recommendation system built on top of PostgreSQL and has a number of options for different recommendation algorithms. The primary advantage of this approach was the ease of communication for recommendations (they could be written in the form of SQL queries). The database centered implementation also promised speed, particularly because we could train a model a priori to give faster recommendations. However after a lot of tinkering around we could not get it to work due to some issues with the implementation. After mailing the developer-list a few times we decided to move forward with other options.

Recommender #4: Simon Funk SVD (our final model)

Our final model uses the Simon Funk Singular Value Decomposition Model, implemented in Julia. This model allows us to combine user ratings and anime characteristics (genre, etc.) to make recommendations. Using the IncrementalSVD package in Julia, we were able to make predictions about how individual users might rate new items, as well as identify different features in our data that have the most predictive power.

Our recommendation engine has two functions:

Recommending Animes to existing MyAnimeList.net users: Given a user’s ID and his ratings for various animes, we predict his score on all animes that he has not yet seen, and recommend the animes with the highest predicted ratings.
Recommending Animes to a user based on Animes that they already like: Using the item-item similarity function in the Julia package, we recommend the most similar Animes to the ones the user specified.

Results and Evaluation

Using the Julia IncrementalSVD package described in the above section, we achieved an MSE of 7.29 (or RMSE of 2.7). This means that on average, the rating that our model predicts a user will give to an anime is only off by 2.7 points. The heatmap below shows a visualization of our results. It represents a confusion matrix for a random subset of our ratings/predictions where the vertical axis is the rounded score that our model predicts a user will give to an anime, and the horizontal axis is the actual score that the user gave. The “hotter” the square, the more frequent the intersection:

Ideally, we would like to see a diagonal trend from the upper left corner to the lower right. This trend is somewhat present in our heatmap, but we have some unexpected ‘heat’ in the upper right hand corner. It is difficult to say what might be causing this.

This appears to be an innate bias in our data set that we struggle to account for. One likely explanation is that certain shows have cultlike followings that cause a show that the model anticipates to be bad actually turns out good. The most likely explanation has to do with the scale itself. Most recommendations rely on a consistent and easily parsable 5 score scale. A show can either be very good (5), good (4), ok (3), bad (2) or very bad (1). This scale works well to record general sentiment because it leaves a very little degree of confusion for the user as opposed to a 10pt one. Some user will rate a show they find average as an 8, while other users will rates these shows as low as sixes. Simply put, there is no well defined expected value of an unknown show and this makes the task of machine learning quite difficult. Furthermore, this is the minimum RMSE error we could obtain using Netflix’s algorithms. One interesting conclusion to draw from this data, is that Netflix’s algorithms do not work well on this particular dataset as seen by this bias that were unable to resolve be reformating the data or modifying the hyperparameters.

One possible solution is to rescale the scores between 1 – 5, but even for humans this is difficult. One user’s seven could serve as another user’s eight.

Overall, we are happy to have a RMSE of 2.7. We have learned that ratings are difficult to predict, but after trying several models and experimenting with optimizing parameters, this was the best we did.

It is difficult to find a metric to evaluate the performance of the ‘cold-start’ mode in our application (Cold-Start is when a user manually specifies which Animes they already like without specifying a MyAnimeList.net username). For this mode of operation, we simply find those Animes that are the most similar to those that the user inputs, based on the cosine similarity function applied to find similar users and items. We did, however, decide to compare our recommendations with user-provided recommendations on MyAnimeList.net. The site allows users to suggest shows that one might like if s/he likes another given Anime. We were able to scrape the site to get the list of the most frequently recommended Animes (given by users) for each Anime on the site. We then compared our list of recommendations (similar items) with the list of user-provided recommendations scraped from the site. The graph below shows the distribution of the percentage overlap (i.e. what percentage of our recommendations appeared in the list of user recommendations) for a set of 60 of the most frequently rated animes:

overlap (1).png

We see from the graph that in general there is a non-trivial amount of overlap between our recommendations and the recommendations that users gave on the site. Although this is not exactly a measure of “accuracy” of our recommendations, it is reassuring to see that users knowledgeable about the Animes on the site often suggest some of the same shows that our ‘Cold-Start’ model does.

Our Application in Action

On the main screen, choose “I’m a MyAnimeList.net user” or “Cold-Start” mode and enter the appropriate information:

Hit Submit and view your recommendations:

The graph at the bottom shows the top 5 genres that we recommended for the user. Mousing over one of the bars reveals a tooltip of the number of animes we recommended within that genre.

Conclusion

We are satisfied with our Data Science project. We believe that we met our 100% deliverable standard, which was to use all of our data to make a recommendation engine incorporating both user ratings and anime characteristics, and to present our recommendations in an application. We also made a step towards our 125% deliverable goal, by creating the top genre interactive visualization which accompanies the recommendations.

An idea for further improvement would be to enhance our cold-start feature. We could allow the user to enter not just titles of animes, but also their personal ratings. Incorporating the ratings we could perhaps weight the different titles differently and adjust our recommendations or at least sort them by appropriateness. We could also attempt to figure out how to treat the cold-start user as an existing user, and try to make recommendations by predicting his ratings on new titles (as done in the “existing user” mode).

APPENDIX: Code and Data Descriptions

user_extractor.py parses the json anime file to remove Manga and other unwated anime data and produce a concise CSV file

anime.csv is produced by user_extractor.py. It is a CSV file containing characteristics of the animes, including genre, image, average user rating, etc. to display as part of the results in the application. This file was loaded into SQL as the “animes” table.

cold_start.jl finds the 10 most similar animes for each anime and produces a text file, which is processed into a csv by similar.py. The output file is similar_animes.csv, which contains an anime followed by its 10 cold-start recommendations separated by pipes (“|”). This CSV was loaded into SQL as the “similar” table, and used for making cold-start recommendations.
Website code: Our website was made using PHP and MySQL. All the files used for our website are inside the Website/files/ folder (inside the ‘Code & Partial Data’ RAR file). Additionally, inside the Website/database_backup/ folder (inside the ‘Code & Partial Data’ RAR file), you’ll find a file called anime.sql. This is a complete backup of our MySQL database that is used for ‘lookup’ features of our website (e.g. to map a MyAnimeList.net username to a user ID, to provide drop-down texts for text fields, to get ‘Cold-Start’ recommendations, etc.).

Status Update – 3rd May 2016

BlogPost #3

Group Members:

Aaron Gokaslan (agokasla), Ahmad Najeeb (anajeeb), Haris Choudhary (hchoudha) Nicholas Hartmann (nhartman)

________________________________________________________________

RecDB

This week we have been dealing with issues with RecDB. Currently, it is not responding to any of our queries. We have posted the issue on GitHub, but have not received a response. The query response is always the following:

"select * from relations
recommend anime_id to user_id on score using ItemPearCF
where user_id = 499231
order by score desc
limit 50;
LOG:  server process (PID 21008) was terminated by signal 11: Segmentation fault
DETAIL:  Failed process was running: select * from relations
        recommend anime_id to user_id on score using ItemPearCF
        where user_id = 499231
        order by score desc
        limit 50;
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2016-05-2 05:39:33 UTC
FATAL:  the database system is in recovery mode
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  record with zero length at C/22592BF0
LOG:  redo is not required
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
^CLOG:  received fast shutdown request
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down

We are currently trying different ways to get past the segmentation fault (training other classifiers, using different algorithms and mailing the dev list).

Building Our Application

While we deal with our RecDB issues, we are also working to build our application through which we will present the project. We have loaded all of the anime data (characteristics of the animes themselves, such as title, genres, id, etc.) into a SQL database to be used in the application. We have also created a search bar with an auto-complete feature. The idea is to allow the user to search for animes to rate, and the ratings will be used to make recommendations. Using PHP and Ajax, our first step in building our application was creating a search bar in which the user can type in a section of a title, and auto-complete options appear (referring back to the animes in the SQL database).

Content Filtering

As mentioned in the previous week’s blog post, we have been exploring the method of content filtering in order to incorporate anime genres and other features into our recommendation engine. Our current model uses a user’s ratings of animes to estimate preferences for each genre–a preference for genre x is calculated as the average of all the user’s scores for animes within genre x. Next we loop through all of the animes that the user has not seen (rated). We give each anime a score that equals the sum of the user’s preferences for each of the genres that the anime comprises. This method of content filtering is modeled after what is described in the tutorial here. We may consider some slight modifications. One potential change might be computing each anime’s score as the average preference of the genres of the anime, rather than the sum, so that we do not over-weight animes with many different genres. Another modification we could consider is allowing users to specify their preferences for genres, rather than trying to infer them from their ratings, which may give us more reliable numbers to work with. Once we have the collaborative filtering portion (RecDB) of the recommendation engine working, we will experiment and decide what will work best.

Status Update – 26th April 2016

BlogPost #2

Group Members:

Aaron Gokaslan (agokasla), Ahmad Najeeb (anajeeb), Haris Choudhary (hchoudha) Nicholas Hartmann (nhartman)

________________________________________________________________

Last week we mentioned that we were exploring RecDB as an alternative to using mLib + SPARK for our anime recommendation needs. We are still in the process of investigating that and this whole last week we were primarily setting up the RecDB database.

This past week we have primarily been setting up a RecDB database which took a decent amount of time. We are currently in the process of setting up our RecDB database and we’re still trying to determine whether RecDB is a good alternative to mLib or not. Setting up and testing RecDB is taking this time because we have to first build a database and import all our cleaned anime data into the database. The data was imported in around 20 hours of computation time. It took long primarily because we needed a Ubuntu VM to setup RecDB in. Unfortunately sample queries were taking upwards of 2 hours without being completed. We have decided to create a Apriori recommender on an Amazon EC2 instance that has much more compute power and use that to significantly cut query times. We are now migrating RecDB to an Amazon EC2 instance where we can ask for a lot of extra compute power. We will create a Apriori recommender that will be able to be migrated back to a local computer and be used for faster response times. We are also investigating different algorithms for the recommender..

RecDB Algorithms:

Although we have not yet been able to run any of the algorithms, we have been doing research and weighing the pros and cons of using the different recommendation algorithms that RecDB allows:

Item-item collaborative filtering using cosine similarity measure
Item-item collaborative filtering using Pearson correlation
User-user collaborative filtering using cosine similarity measure
User-user collaborative filtering using Pearson correlation

For a few reasons, we are leaning towards item-based filtering instead of user-based filtering. One reason is scale; the number of animes on the site is much less volatile than the number of users, so item-item similarities could be computed offline and accessed when needed. We anticipate that there would be greater scale challenges with user-based filtering. Another point in favor of item-based filtering is that item similarities are more likely to converge over time than user similarities. It is not immediately clear whether Pearson correlation or cosine similarity will be a better weight measure for our purposes, so we will likely explore both and compare their performances.

Shortcoming of RecDB:

A notable shortcoming of RecDB is that it will restrict us to using one of its pre-written recommendation algorithms. This limits our ability to incorporate genre, actors, and other features of the animes in our recommendation engine. To get around this issue, we might also use a content-based filtering algorithm (which we could write ourselves) to make recommendations based only on features of the anime. We could look at the results of the collaborative filtering model from RecDB and the results of the content-based filtering model together to determine what to recommend to the user. This is an idea we plan to explore this coming week. Building the content-based system could be done rather easily because we would be relying only on the data from our anime csv (containing roughly 11,000 animes and their features), which is of a much more manageable size than the user data we are working with.

Status Update – 17th April 2016

Group Members:

Aaron Gokaslan (agokasla), Ahmad Najeeb (anajeeb), Haris Choudhary (hchoudha) Nicholas Hartmann (nhartman)

————-

We set up an AWS instance and managed to get Spark running on the cluster. The library we are currently using for our recommendation engine training is MLib. We tested some recommendation based on user profiles we received from people we know. We initially received results which were not too satisfactory. On further scrutiny and exploration we considered changing our technique. We were initially using an explicit feedback system that only took into account user scores. This did not take into account factors such as if the series was actually watched when rated or other dimensions of the data. Another major problem was that ratings were mainly available for relatively more popular shows (because those are the shows that most people watched and rated highly), so the recommendation engine preferred recommending popular shows more, even when recommending a not so well known show made sense. We chose to adopt a implicit feedback system with a scaled score between 0-1 that was formulated using factors such as rating of the show, whether it was actually watched and potentially what genre it is etc. We currently hypothesize that if all other factors that are easier to scale between 0-1 are correctly incorporated the genre’s should be correct on account of all other factors. We are in the process of experimenting with different implicit feedback values to see how it decreases our Root Mean Square Error (RMSE) if at all and also how the recommendations are changing.

Some recommendation outputs via Spark based on Aaron’s tastes are as follows:

1: Clannad: After Story 2: Clannad 3: Fate/stay night: Unlimited Blade Works 2nd Season 4: Hello Kitty to Issho 5: Neon Genesis Evangelion: The End of Evangelion 6: Mahou Shoujo Madoka★Magica Movie 3: Hangyaku no Monogatari 7: Clannad: Mou Hitotsu no Sekai, Tomoyo-hen 8: Mahou Shoujo Madoka★Magica 9: Fate/stay night: Unlimited Blade Works 10: Clannad: After Story – Mou Hitotsu no Sekai, Kyou-hen 11: Bishoujo Senshi Sailor Moon: Sailor Stars 12: Fate/Zero 2nd Season 13: Bishoujo Senshi Sailor Moon S 14: Shigatsu wa Kimi no Uso 15: Pikmin Short Movies 16: Mahou Shoujo Madoka★Magica Movie 2: Eien no Monogatari 17: Bishoujo Senshi Sailor Moon R 18: Fate/stay night: Unlimited Blade Works – Prologue 19: Neon Genesis Evangelion 20: Mahou Shoujo Madoka★Magica Movie 1: Hajimari no Monogatari 21: Code Geass: Hangyaku no Lelouch R2 22: Bishoujo Senshi Sailor Moon SuperS 23: Fullmetal Alchemist: Brotherhood 24: Xi You Ji 25: Digimon Adventure: Bokura no War Game! 26: Dragon Ball Z 27: Bishoujo Senshi Sailor Moon S: Kaguya Hime no Koibito 28: Angel Beats! 29: Monogatari Series: Second Season 30: TV-ban Pocket Monsters Special Masara Town Hen Soushuuhen

The recommendations are decent at best. Some recommended TV series such as Hello Kitty at #4, Aaron insists he wouldn’t be caught dead watching. We approached other people we knew for their MyAnimeList profiles that would include their tastes and tried to generate recommendations for them, one such result is as follows:

Anime recommended for you:
1: Fullmetal Alchemist: Brotherhood
2: Tengen Toppa Gurren Lagann
3: JoJo no Kimyou na Bouken (TV)
4: JoJo no Kimyou na Bouken: Stardust Crusaders 2nd Season
5: JoJo no Kimyou na Bouken: Stardust Crusaders
6: Ghost in the Shell: Stand Alone Complex 2nd GIG
7: Kill la Kill
8: Monogatari Series: Second Season
9: Cowboy Bebop
….

There were 50 results but have been truncated for brevity. (A full list of recommendations can be received by contacting us). As you can see multiple seasons or forms of the same series are also included. We will need to work a way around it and manipulate the data or the algorithm such that it does not recommend different versions (TV, Movie etc) of the same series.

We’ve been playing around with the parameters in our implicit rankings recommendation engine code and have managed to reduce RMSE to as low as 0.49 (compared to 0.79 which was the baseline RMSE we got without any parameter tweaking). However we still have to explore this further to make sure we’re using the best parameter values, and to also avoid overfitting.

The results we got from our tinkering around and usage of implicit feedback has really increased the quality of our recommendations. This time when we procured recommendations for Aaron, they were all high quality and watchable Series that either Aaron felt he’d be interested in or mostly things he’s watched and enjoyed already.

The improved recommendations are listed as follows:

Anime recommended for you: 1: Higurashi no Naku Koro ni 2: Higurashi no Naku Koro ni Kai 3: Elfen Lied 4: Neon Genesis Evangelion 5: Mahou Shoujo Madoka★Magica 6: Code Geass: Hangyaku no Lelouch 7: Code Geass: Hangyaku no Lelouch R2 8: Another 9: Steins;Gate 10: Death Note 11: Shiki 12: Fate/Zero 13: Fate/Zero 2nd Season 14: Psycho-Pass 15: Neon Genesis Evangelion: The End of Evangelion 16: Darker than Black: Kuro no Keiyakusha 17: Durarara!! 18: Ergo Proxy 19: Baccano! 20: Claymore

We are also inviting other people who want to see some Anime recommendations from us to contact us if they have a MyAnimeList profile to see what results we can generate for them and potentially give us feedback on how well it is. Currently generation is limited to MyAnimeList users (unless someone wants to come in and manually select a bunch of Anime to base recommendation off of).

Another issue we’re currently exploring is if we want to continue using AWS or not. AWS is great and offers a lot of scalability feature which are useful to us, but we only have a limited amount of AWS credit available to us, and at this point we’re not sure if that will be sufficient for the entirely of our project, if we were to continue using AWS. Because of this, we might switch to a department machine with enough RAM (about 30GB) for our use, or use a department super-computer cluster. At this point this isn’t decided and we’re still looking into it. Inline with the point just mentioned, we’re also exploring RecDB as a potential recommendation engine we can use which also supports collaborative filtering. At this point we’re just trying to get it running.

Midterm Report

Introduction

The goal of this project is to make a recommendation engine for anime shows. We are using the data from the website myanimelist.net, which allows users to give ratings (scores from 0 to 10) to anime shows. We have processed our data to create two csv files.

The first csv is of anime shows and their characteristics. Variables include genre, date of release, title, actors, and a unique identifier. This file includes 11,517 unique anime shows.

The second csv contains users and their ratings. Each row of the csv contains a user id, anime id, time stamp (time of rating), and score out of 10. There are approximately 169 million rows.

Visualizations

The visualization below shows the average ratings of all anime across a particular genre. When we initially thought about this, we expected there to be a large deviation between all the genres since there are a huge amount of titles and anime series attributed to only a small amount of genres. As a result, we thought we could use this property of seeing which Anime are rated low to some advantage in our recommendation engine. The visualization however shows a more or less high and consistent trend across almost all the genres. There are some genres which are rated low over all but this can be attributed to the fact that these genre contain a few shows (some have only a couple of series). Another interesting trend we observed is that Adult themed anime (primarily pornographic categories) seem to score consistently low compared to their counterparts. Graph was adapted and changed from this link.

FireShot Capture 5 - - http___localhost_8888_html_

Another interesting aspect we thought to plot was the overall trend of released Anime to see if our recommendation engine is still viable and in demand. It seems that Anime’s have seen a consistently increasing trend when it comes to being released. More and more series are being produced per year which shows a high demand in the industry. This graph was adapted from here.

Graph1

First Steps with Machine Learning

Since the goal of our project is to build a recommendation engine, we will be relying heavily on machine learning. We are implementing factorization machines with the help of the pyFM library (https://github.com/coreylynch/pyFM). A factorization machine is a variation on a support vector machine that handle high levels of sparsity well. With so many dimensions in our data, factorization machines seem to be the best approach.

With our data set being so enormous (see the Discussion section), running the factorization machine algorithm on the whole data set is extremely time consuming and requires an exorbitant amount of memory. Therefore, for our preliminary steps with building our model, we have been working with a subset of the data which includes a random sample of 10,000 users and their ratings on the top 100 most frequently rated anime shows on the site. This results in a set of 331,025 ratings stored in a file called “users_top_100.csv”. (On average, each of the users in the sample has rated about 33 of the top 100 most frequently rated animes.) The format of the CSV is as follows:

user_id, anime_id, score, time_stamp
14090, 199, 8, 1190287278
……………………………….
……………………………….

Although we recognize that there may be some biases in the set of the top 100 most frequently rated animes, we believe that using this subset is a good starting point for building and testing models. Going forward we will try to incorporate all (or at least more) of the data, as well as more variables.

Process:

We start by randomly splitting up the ratings in “users_top_100.csv” into two subsets — a training set and a testing set. We train the model on training set. Then we use the model to predict the ratings given in the testing set, compare the predictions to the actual ratings given, and compute the mean squared error as a measure of accuracy.

Because of the randomness in splitting our data into training and testing sets, the mean squared error varies slightly with each execution of the algorithm. The following table gives the average MSE over 5 executions when predictions are made based on various methods–our factorization machines method as well as some benchmark methods for comparison

Prediction Method	Average MSE
Factorization Machines	8.99
Assign Random Rating Between 0 and 10	25.10
Assign Rating of 5 (“middle” rating)	15.17
Predict average rating given for that anime in the training set	13.66

As we can see by the table, our factorization machine method handily beats all of the benchmarks. Moving forward, we might treat the MSE score of 8.99 as a benchmark to beat. We hope to be able to improve our accuracy by including more data points and incorporating more features of the data that may be predictive, such as genres and time of rating, for example.

Discussion

What is hardest part of the project that you’ve encountered so far?:

The biggest problem by far that we’ve have been facing till now is related to the raw *size* of our dataset! We have over 169 million ratings that rate over 11,000 anime shows! This is an enormous amount of data to load and process as part of any python script/code that we write. Also, we have dozens of attributes for each user and ratings, which further increases the size exponentially. Parsing, cleaning, filtering and processing this enormous amount of data takes a lot of time, disk space and memory.

What are your initial insights?

From the visualizations, we have seen that generally all genres share a consistently high average score which is something we did not predict. There also seems to be a surge of anime produces each year which increases the relevancy of our recommendation engine.

Another interesting insight we have seen while analyzing the data is that a random sample of 10,000 users all have consistently voted for 33 out of the top 100 anime. We can reason for this intuitively. Anime’s generally have shows that every newcomer will watch and this will get him/her into the hobby of watching other obscure anime. So a certain set of top anime are a lot more famous than all the others and act as a “Gateway Series” for other Anime series.

Are there any concrete results you can show at this point? If not, why not?

As mentioned in Machine Learning section, our baseline model outperforms some proposed benchmarks. Using the data subset “users_top_100.csv”, we achieved a mean squared error on our test data of around 9. To interpret this, our predicted ratings (out of 10) differ from the actual ratings by about 3 on average. We hope to improve our accuracy when incorporating more data and variables/features. We may also explore different evaluation methods in addition to mean squared error.

Going forward, what are the current biggest problems you’re facing?

One of our major concerns going forward is that the Python library we’re using for recommendations, doesn’t work properly when given very big input files (to base an model on). Because of this, we had to limit the amount of users we use to just a fraction of our total user-base, just so pyFM can work. Going forward, we might have to investigate this further and possibly switch to another Python library for recommendations if needed.

Do you think you are on track with your project? If not, what parts do you need to dedicate more time to?

We feel that overall, we are on track and are not lagging behind in any area.

Given your initial exploration of the data, is it worth proceeding with your project?

We definitely think that it’s worth exploring this project further. Based on our aforementioned findings, it’s clear that our recommendation engine is already able to predict ratings much better than a random guess, and it’s exciting to think how we can further improve on this.