Steam Review Analysis

Brief Overview: In this project we examine a dataset containing more than 100 million Steam reviews and attempt to create two machine learning models: one to predict the helpfulness of user reviews, and the other to create distinct clusters of Steam reviewers.

Data was processed and stored on the San Diego Computer Center (SDSC) and their Expanse system with primary usage of Spark through PySpark.

Initial data from Kaggle includes statistics about the review and reviewer including their playtime, number of games owned, and number of positive ratings that the review got. Steam also provides a parameter described as a “weighted vote score” of how useful a review is.

This project is still ongoing as additional predictive and clustering models are explored.

View Full Repository in GitHub

Khanh Phan