Predicting The Success of Video Games
¶Kevin Rathbun
¶Video game developers and companies develop their games with one major goal in mind: to produce a successful game.
But what makes a game successful and how would we measure its success? In this data exploration project, I will
explore these questions. This will be useful for video game developers in determining what they could be doing to
improve their chance of creating a successful game. This will also serve to show what players value in a video game.
18 Columns:
0: appid:
The unique appid associated with the game
1: name:
The name of the game
2: release_date:
When the game was first released on Steam in YYYY-MM-DD format
3: english:
1 if the game is in English, 0 otherwise
4: developer:
Name(s) of developer(s) separated by semicolon if multiple
5: publisher:
Name(s) of publisher(s) separated by semicolon if multiple
6: platforms:
The supported platform(s) separated by semicolons if multiple.
Possible platforms are windows;mac;linux.
7: required_age:
Minimum age required according to PEGI UK standards.
Entries of 0 are unrated or unsupplied
8: categories:
Categories associated with the game separated by semicolons e.g., Single-player;Multi-player
9: genres:
The generes associated with the game separated by semicolons e.g., Action;Indie
10: steamspy_tags:
The steamspy_tags associated with the game separated by semicolons.
Similar to genres, but are community voted e.g., Action;Indie
11: achievements:
Number of achievements in the game
12: positive_ratings:
Number of positive ratings on Steam
13: negative_ratings:
Number of negative ratings on Steam
14: average_playtime:
Average playtime of users in minutes
15: median_playtime:
Median playtime of users in minutes
16: owners:
Estimated number of owners (lower and upper bound) e.g., 20000-50000
17: price:
Price of game in GBP
It is important to note that this data was gathered around May of 2019, so it is not entirely representative of
the games on Steam today or the metrics associated with them.
It should also be noted that Steam is for PC games only so this data is not representative of console games
or their playerbase.
To begin, let's gather the data we need and transform it to suit our goals.
Here we gather the two datasets we need: the dataset described in the introduction
as well as the hardware requirements dataset (also available at https://www.kaggle.com/datasets/nikdavis/steam-store-games).
# Imports
import pandas as pd
import numpy as np
import re
from sklearn.preprocessing import MultiLabelBinarizer
import matplotlib.pyplot as plt
from statistics import mean, median
from sklearn.linear_model import LinearRegression
from scipy.stats import spearmanr
# Increase the max number of columns and rows which can be displayed
pd.set_option("max_columns", 500)
pd.set_option("max_rows", 500)
# Load the steam_games csv into a pandas dataframe
games_df = pd.read_csv('steam_games.csv')
# Load the steam_requirements_data into a pandas dataframe
requirements_df = pd.read_csv('steam_requirements_data.csv')
print("Steam games data:")
display(games_df.head(10))
print("Steam game requirements data:")
display(requirements_df.head(10))
print("Value counts of required_age")
print(games_df["required_age"].value_counts())
Steam games data:
| appid | name | release_date | english | developer | publisher | platforms | required_age | categories | genres | steamspy_tags | achievements | positive_ratings | negative_ratings | average_playtime | median_playtime | owners | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | Counter-Strike | 2000-11-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Multi-player;Online Multi-Player;Local Multi-P... | Action | Action;FPS;Multiplayer | 0 | 124534 | 3339 | 17612 | 317 | 10000000-20000000 | 7.19 |
| 1 | 20 | Team Fortress Classic | 1999-04-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Multi-player;Online Multi-Player;Local Multi-P... | Action | Action;FPS;Multiplayer | 0 | 3318 | 633 | 277 | 62 | 5000000-10000000 | 3.99 |
| 2 | 30 | Day of Defeat | 2003-05-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Multi-player;Valve Anti-Cheat enabled | Action | FPS;World War II;Multiplayer | 0 | 3416 | 398 | 187 | 34 | 5000000-10000000 | 3.99 |
| 3 | 40 | Deathmatch Classic | 2001-06-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Multi-player;Online Multi-Player;Local Multi-P... | Action | Action;FPS;Multiplayer | 0 | 1273 | 267 | 258 | 184 | 5000000-10000000 | 3.99 |
| 4 | 50 | Half-Life: Opposing Force | 1999-11-01 | 1 | Gearbox Software | Valve | windows;mac;linux | 0 | Single-player;Multi-player;Valve Anti-Cheat en... | Action | FPS;Action;Sci-fi | 0 | 5250 | 288 | 624 | 415 | 5000000-10000000 | 3.99 |
| 5 | 60 | Ricochet | 2000-11-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Multi-player;Online Multi-Player;Valve Anti-Ch... | Action | Action;FPS;Multiplayer | 0 | 2758 | 684 | 175 | 10 | 5000000-10000000 | 3.99 |
| 6 | 70 | Half-Life | 1998-11-08 | 1 | Valve | Valve | windows;mac;linux | 0 | Single-player;Multi-player;Online Multi-Player... | Action | FPS;Classic;Action | 0 | 27755 | 1100 | 1300 | 83 | 5000000-10000000 | 7.19 |
| 7 | 80 | Counter-Strike: Condition Zero | 2004-03-01 | 1 | Valve | Valve | windows;mac;linux | 0 | Single-player;Multi-player;Valve Anti-Cheat en... | Action | Action;FPS;Multiplayer | 0 | 12120 | 1439 | 427 | 43 | 10000000-20000000 | 7.19 |
| 8 | 130 | Half-Life: Blue Shift | 2001-06-01 | 1 | Gearbox Software | Valve | windows;mac;linux | 0 | Single-player | Action | FPS;Action;Sci-fi | 0 | 3822 | 420 | 361 | 205 | 5000000-10000000 | 3.99 |
| 9 | 220 | Half-Life 2 | 2004-11-16 | 1 | Valve | Valve | windows;mac;linux | 0 | Single-player;Steam Achievements;Steam Trading... | Action | FPS;Action;Sci-fi | 33 | 67902 | 2419 | 691 | 402 | 10000000-20000000 | 7.19 |
Steam game requirements data:
| steam_appid | pc_requirements | mac_requirements | linux_requirements | minimum | recommended | |
|---|---|---|---|---|---|---|
| 0 | 10 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 1 | 20 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 2 | 30 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 3 | 40 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 4 | 50 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 5 | 60 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 6 | 70 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 7 | 80 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | [] | [] | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 8 | 130 | {'minimum': '\r\n\t\t\t<p><strong>Minimum:</st... | {'minimum': 'Minimum: OS X Snow Leopard 10.6.... | {'minimum': 'Minimum: Linux Ubuntu 12.04, Dual... | 500 mhz processor, 96mb ram, 16mb video card, ... | NaN |
| 9 | 220 | {'minimum': '<strong>Minimum:</strong><br><ul ... | {'minimum': '<strong>Minimum:</strong><br><ul ... | [] | OS: Windows 7, Vista, XP Processor: 1.7 Ghz Me... | NaN |
Value counts of required_age 0 26479 18 308 16 192 12 73 7 12 3 11 Name: required_age, dtype: int64
Now, lets transform the dataframe to be better suited for our data analysis by removing unnecessary columns and transforming columns to be better suited for analysis.
english: Even if games in English are associated with more or less success, it doesn't really tell us anything
about the game itself.
achievements: The number of achievements in a game is most likely not a good indicator of success.
required_age: As shown above, the dataset contains mostly 0s for this metric so it will not be very useful.
positive_ratings and negative_ratings: Instead, we will have one ratings column for the percent of people who rated positively.
This makes it easier to compare across different games
steamspy_tags: Instead, we will just focus on genres. The genres are less nuanced than the tags
making for easier comparisons. There are 300+ tags making it hard to do comparisons between them all.
The columns we will be transforming are:
release_date : Only the year is important
platforms, categories, genres: One-hot encode for easier analysis
owners: Is currently a range--just take the middle of this range
minimum: Currently has all the minimum hardware requirements... Instead we will just look at the minimum ram
requirement (in GB) so we have a metric which can easily be compared across games.
We will also add a new column est_revenue which is a product of the number of owners and the price to estimate the revenue of the game
(in GBP)
def transform_owners(x):
split = x.split("-")
low = int(split[0])
high = int(split[1])
return int((low + high) * 0.5)
def transform_reqs(reqs):
if (reqs is not np.nan):
# Remove spaces and make the string lowercase
reqs = reqs.replace(" ", "")
reqs = reqs.lower()
# Find how much ram and the unit (mb or gb)
res = re.search('[0-9]+(mbram|gbram)', reqs)
# If found
if res:
# The string found (e.g., 96mbram, 1gbram)
res_str = res.group()
# If mb, convert to gb and return
if 'mb' in res_str:
# Extract the number
res = re.search('[0-9]+', res_str)
number = int(res.group())
# Convert to gb
return number / 1000
elif 'gb' in res_str:
# Extract the number
res = re.search('[0-9]+', res_str)
return int(res.group())
else:
return np.nan
else:
return np.nan
games_df["release_date"] = games_df["release_date"].apply(lambda x: int(x[0:4]))
games_df["owners"] = games_df["owners"].apply(transform_owners)
games_df["rating"] = games_df["positive_ratings"] / (games_df["positive_ratings"] + games_df["negative_ratings"])
# Join the two dataframes on appid
games_df = games_df.join(requirements_df.set_index("steam_appid"), on=["appid"])
# Drop unneeded columns mentioned above
games_df.drop(columns=['english', 'achievements', 'required_age', 'positive_ratings', 'negative_ratings', 'steamspy_tags'], inplace=True)
# Drop unneeded columns resulting from joining the 2 dataframes
games_df.drop(columns=['pc_requirements', 'mac_requirements', 'linux_requirements', 'recommended'], inplace=True)
games_df["minimum"] = games_df["minimum"].apply(transform_reqs)
# Raname "minimum" to "min_req_ram"
games_df.rename(columns={'minimum': 'min_req_ram'}, inplace=True)
games_df["est_revenue"] = games_df["owners"] * games_df["price"]
display(games_df.head(10))
| appid | name | release_date | developer | publisher | platforms | categories | genres | average_playtime | median_playtime | owners | price | rating | min_req_ram | est_revenue | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | Counter-Strike | 2000 | Valve | Valve | windows;mac;linux | Multi-player;Online Multi-Player;Local Multi-P... | Action | 17612 | 317 | 15000000 | 7.19 | 0.973888 | 0.096 | 107850000.0 |
| 1 | 20 | Team Fortress Classic | 1999 | Valve | Valve | windows;mac;linux | Multi-player;Online Multi-Player;Local Multi-P... | Action | 277 | 62 | 7500000 | 3.99 | 0.839787 | 0.096 | 29925000.0 |
| 2 | 30 | Day of Defeat | 2003 | Valve | Valve | windows;mac;linux | Multi-player;Valve Anti-Cheat enabled | Action | 187 | 34 | 7500000 | 3.99 | 0.895648 | 0.096 | 29925000.0 |
| 3 | 40 | Deathmatch Classic | 2001 | Valve | Valve | windows;mac;linux | Multi-player;Online Multi-Player;Local Multi-P... | Action | 258 | 184 | 7500000 | 3.99 | 0.826623 | 0.096 | 29925000.0 |
| 4 | 50 | Half-Life: Opposing Force | 1999 | Gearbox Software | Valve | windows;mac;linux | Single-player;Multi-player;Valve Anti-Cheat en... | Action | 624 | 415 | 7500000 | 3.99 | 0.947996 | 0.096 | 29925000.0 |
| 5 | 60 | Ricochet | 2000 | Valve | Valve | windows;mac;linux | Multi-player;Online Multi-Player;Valve Anti-Ch... | Action | 175 | 10 | 7500000 | 3.99 | 0.801278 | 0.096 | 29925000.0 |
| 6 | 70 | Half-Life | 1998 | Valve | Valve | windows;mac;linux | Single-player;Multi-player;Online Multi-Player... | Action | 1300 | 83 | 7500000 | 7.19 | 0.961878 | 0.096 | 53925000.0 |
| 7 | 80 | Counter-Strike: Condition Zero | 2004 | Valve | Valve | windows;mac;linux | Single-player;Multi-player;Valve Anti-Cheat en... | Action | 427 | 43 | 15000000 | 7.19 | 0.893871 | 0.096 | 107850000.0 |
| 8 | 130 | Half-Life: Blue Shift | 2001 | Gearbox Software | Valve | windows;mac;linux | Single-player | Action | 361 | 205 | 7500000 | 3.99 | 0.900990 | 0.096 | 29925000.0 |
| 9 | 220 | Half-Life 2 | 2004 | Valve | Valve | windows;mac;linux | Single-player;Steam Achievements;Steam Trading... | Action | 691 | 402 | 15000000 | 7.19 | 0.965601 | 0.512 | 107850000.0 |
# A helper function used to see the unique elements over an entire column in games_df.
# Each element of the column should be a list e.g., ['Action', 'Indie']
def unique_col_elts(column_name):
col_as_list = [x for x in games_df[column_name]]
new = []
for e in col_as_list:
for e1 in e:
if e1 not in new:
new.append(e1)
print("Unique entries:")
print(new)
print("Length:")
print(len(new))
# Turn platforms column into a list splitting on ';' in order to do one-hot encoding
games_df["platforms"] = games_df["platforms"].apply(lambda x: x.split(';'))
# One-hot encode platforms
mlb = MultiLabelBinarizer()
mlb.fit(games_df['platforms'])
new_col_names = ["platform_%s" % c for c in mlb.classes_]
# Create new DataFrame with one-hot encoded platforms
platforms = pd.DataFrame(mlb.fit_transform(games_df['platforms']), columns=new_col_names, index=games_df['platforms'].index)
# Join platforms into games_df
games_df = games_df.join(platforms)
# Turn genres column into a list splitting on ';' in order to do one-hot encoding
games_df["genres"] = games_df["genres"].apply(lambda x: x.split(';'))
# One-hot encode genres
mlb = MultiLabelBinarizer()
mlb.fit(games_df['genres'])
new_col_names = ["genre_%s" % c for c in mlb.classes_]
# Create new DataFrame with one-hot encoded genres
genres = pd.DataFrame(mlb.fit_transform(games_df['genres']), columns=new_col_names, index=games_df['genres'].index)
# Join genres into games_df
games_df = games_df.join(genres)
# Turn categories column into a list splitting on ';' in order to do one-hot encoding
games_df["categories"] = games_df["categories"].apply(lambda x: x.split(';'))
# One-hot encode categories
mlb = MultiLabelBinarizer()
mlb.fit(games_df['categories'])
new_col_names = ["category_%s" % c for c in mlb.classes_]
# Create new DataFrame with one-hot encoded categories
categories = pd.DataFrame(mlb.fit_transform(games_df['categories']), columns=new_col_names, index=games_df['categories'].index)
# Join categories into games_df
games_df = games_df.join(categories)
display(games_df.head(10))
print("Dimensionality of DataFrame:")
print(games_df.shape)
| appid | name | release_date | developer | publisher | platforms | categories | genres | average_playtime | median_playtime | owners | price | rating | min_req_ram | est_revenue | platform_linux | platform_mac | platform_windows | genre_Accounting | genre_Action | genre_Adventure | genre_Animation & Modeling | genre_Audio Production | genre_Casual | genre_Design & Illustration | genre_Documentary | genre_Early Access | genre_Education | genre_Free to Play | genre_Game Development | genre_Gore | genre_Indie | genre_Massively Multiplayer | genre_Nudity | genre_Photo Editing | genre_RPG | genre_Racing | genre_Sexual Content | genre_Simulation | genre_Software Training | genre_Sports | genre_Strategy | genre_Tutorial | genre_Utilities | genre_Video Production | genre_Violent | genre_Web Publishing | category_Captions available | category_Co-op | category_Commentary available | category_Cross-Platform Multiplayer | category_Full controller support | category_In-App Purchases | category_Includes Source SDK | category_Includes level editor | category_Local Co-op | category_Local Multi-Player | category_MMO | category_Mods | category_Mods (require HL2) | category_Multi-player | category_Online Co-op | category_Online Multi-Player | category_Partial Controller Support | category_Shared/Split Screen | category_Single-player | category_Stats | category_Steam Achievements | category_Steam Cloud | category_Steam Leaderboards | category_Steam Trading Cards | category_Steam Turn Notifications | category_Steam Workshop | category_SteamVR Collectibles | category_VR Support | category_Valve Anti-Cheat enabled | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | Counter-Strike | 2000 | Valve | Valve | [windows, mac, linux] | [Multi-player, Online Multi-Player, Local Mult... | [Action] | 17612 | 317 | 15000000 | 7.19 | 0.973888 | 0.096 | 107850000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 20 | Team Fortress Classic | 1999 | Valve | Valve | [windows, mac, linux] | [Multi-player, Online Multi-Player, Local Mult... | [Action] | 277 | 62 | 7500000 | 3.99 | 0.839787 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2 | 30 | Day of Defeat | 2003 | Valve | Valve | [windows, mac, linux] | [Multi-player, Valve Anti-Cheat enabled] | [Action] | 187 | 34 | 7500000 | 3.99 | 0.895648 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 40 | Deathmatch Classic | 2001 | Valve | Valve | [windows, mac, linux] | [Multi-player, Online Multi-Player, Local Mult... | [Action] | 258 | 184 | 7500000 | 3.99 | 0.826623 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 50 | Half-Life: Opposing Force | 1999 | Gearbox Software | Valve | [windows, mac, linux] | [Single-player, Multi-player, Valve Anti-Cheat... | [Action] | 624 | 415 | 7500000 | 3.99 | 0.947996 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5 | 60 | Ricochet | 2000 | Valve | Valve | [windows, mac, linux] | [Multi-player, Online Multi-Player, Valve Anti... | [Action] | 175 | 10 | 7500000 | 3.99 | 0.801278 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 6 | 70 | Half-Life | 1998 | Valve | Valve | [windows, mac, linux] | [Single-player, Multi-player, Online Multi-Pla... | [Action] | 1300 | 83 | 7500000 | 7.19 | 0.961878 | 0.096 | 53925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 80 | Counter-Strike: Condition Zero | 2004 | Valve | Valve | [windows, mac, linux] | [Single-player, Multi-player, Valve Anti-Cheat... | [Action] | 427 | 43 | 15000000 | 7.19 | 0.893871 | 0.096 | 107850000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 8 | 130 | Half-Life: Blue Shift | 2001 | Gearbox Software | Valve | [windows, mac, linux] | [Single-player] | [Action] | 361 | 205 | 7500000 | 3.99 | 0.900990 | 0.096 | 29925000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 220 | Half-Life 2 | 2004 | Valve | Valve | [windows, mac, linux] | [Single-player, Steam Achievements, Steam Trad... | [Action] | 691 | 402 | 15000000 | 7.19 | 0.965601 | 0.512 | 107850000.0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Dimensionality of DataFrame: (27075, 76)
We now have our DataFrame with all the necessary data and in the form we want it.
We can now move on to the next step.
We will now work to visualize the data in order to find trends in how genres, categories,
price, and hardware requirements relate to success (revenue, playtime, and rating).
Let's begin by looking at how the genre of a game relates to its success.
We will look at how the genre relates to the revenue, playtime, and rating.
unique_col_elts('genres')
Unique entries: ['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG', 'Animation & Modeling', 'Video Production', 'Casual', 'Simulation', 'Racing', 'Violent', 'Massively Multiplayer', 'Nudity', 'Sports', 'Early Access', 'Gore', 'Utilities', 'Design & Illustration', 'Web Publishing', 'Education', 'Software Training', 'Sexual Content', 'Audio Production', 'Game Development', 'Photo Editing', 'Accounting', 'Documentary', 'Tutorial'] Length: 29
As shown above, there are a lot of different genres and many of them are not associated with video games.
From the above list, we will only look at 17:
Let's first graph the revenue of all the games to see if there are any outliers we should drop
plt.figure(figsize=(5,5))
plt.xlabel("appid", fontsize=16)
plt.ylabel("Revenue (in GBP)", fontsize=16)
plt.title("Revenue Per Game", fontsize=16)
plt.scatter(x=games_df['appid'], y=games_df['est_revenue'])
plt.show()
genres = ['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG', 'Casual', 'Simulation', 'Racing', 'Violent', \
'Massively Multiplayer', 'Nudity', 'Sports', 'Early Access', 'Gore', 'Education', 'Sexual Content']
plt.figure(figsize=(30,10))
plt.xlabel("Genre", fontsize=16)
plt.ylabel("Average Revenue (in GBP)", fontsize=16)
plt.title("Average Revenue Per Genre", fontsize=16)
genre_to_rev = {}
# For each genre we will plot the average revenue
for genre in genres:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["genre_" + genre] == 1]
# There is only one game with a revenue of more than 0.5*10^9 (has value 2*10^9)
# We can drop this outlier.
filtered_df = filtered_df.drop(filtered_df[filtered_df["est_revenue"] > 0.5*10**9].index)
# Round to avoid long chart labels
genre_to_rev.update({genre : round(mean(filtered_df['est_revenue']), 2)})
barplot = plt.bar(genre_to_rev.keys(), genre_to_rev.values(), color='green')
plt.bar_label(barplot, labels=genre_to_rev.values(), fontsize=14)
plt.show()
plt.figure(figsize=(30,10))
plt.xlabel("Genre", fontsize=16)
plt.ylabel("Median Revenue (in GBP)", fontsize=16)
plt.title("Median Revenue Per Genre", fontsize=16)
genre_to_rev = {}
# For each genre we will plot the median revenue
for genre in genres:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["genre_" + genre] == 1]
# There is only one game with a revenue of more than 0.5*10^9 (has value 2*10^9)
# We can drop this outlier.
filtered_df = filtered_df.drop(filtered_df[filtered_df["est_revenue"] > 0.5*10**9].index)
# Round to avoid long chart labels
genre_to_rev.update({genre : round(median(filtered_df['est_revenue']), 2)})
barplot = plt.bar(genre_to_rev.keys(), genre_to_rev.values(), color='green')
plt.bar_label(barplot, labels=genre_to_rev.values(), fontsize=14)
plt.show()
As we can see, in terms of average revenue over all games, Action, RPG, and Massively Multiplayer games
are the most successful. These three genres have a very similar average revenue and are significantly ahead
of the next closest genre (Strategy games). All of these games have an average revenue of over £1,000,000.
This is extremely different when we instead look at the median revenue. For all 17 genres, the median revenue
is significantly less than the average revenue.
Since all of the median revenues are significantly less than the average revenues, this indicates that
a small percent of the games in each genre have a very high revenue. This is especially prevalent in the
RPG, Massively Multiplayer, Action, and Strategy genres since they experience the most dramatic drop.
This is not too suprising. Many games may go unnoticed or get little traction, while some games will
become very popular. This just shows us that RPG, Massively Multiplayer, Action, and Strategy genres
experience this the greatest: some of the games in these genres become extremely popular (more so than
other genres).
print("The number of Education games:")
print(sum(games_df['genre_Education']))
print("The number of Massively Multiplayer games:")
print(sum(games_df["genre_Massively Multiplayer"]))
print("The number of Free Massively Multiplayer games:")
print(len(games_df.loc[(games_df["genre_Massively Multiplayer"] == 1) & (games_df["price"] == 0)]))
The number of Education games: 51 The number of Massively Multiplayer games: 723 The number of Free Massively Multiplayer games: 375
A couple other interesting results from these graphs are the median revenue of 0 for Massively Multiplayer
games and the relatively high Education median. The median revenue of 0 for the Massively Multiplayer games
can be explained by the fact that over half of the Massively Multiplayer games are free to play.
The high median revenue for the Education games can be explained by the fact that there are so few Education
games.
Now let's look at the average playtimes for each of these genres.
genres = ['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG', 'Casual', 'Simulation', 'Racing', 'Violent', \
'Massively Multiplayer', 'Nudity', 'Sports', 'Early Access', 'Gore', 'Education', 'Sexual Content']
plt.figure(figsize=(30,10))
plt.xlabel("Genre", fontsize=16)
plt.ylabel("Average Playtime (in Minutes)", fontsize=16)
plt.title("Average Playtime Per Genre (in Minutes)", fontsize=16)
genre_to_playtime = {}
# For each genre we will plot the average playtime
for genre in genres:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["genre_" + genre] == 1]
# Round to avoid long chart labels
genre_to_playtime.update({genre : round(mean(filtered_df['average_playtime']), 2)})
barplot = plt.bar(genre_to_playtime.keys(), genre_to_playtime.values(), color='teal')
plt.bar_label(barplot, labels=genre_to_playtime.values(), fontsize=14)
plt.show()
We can very clearly see that there are two genres which lead by a substantial margin in terms
of average playtime: Massively Multiplayer games and Free to Play games. Massively Multiplayer
games have an average playtime of 725 minutes (just over 12 hours), Free to Play games have an
average playtime of 554 minutes (9.23 hours), and the next closest genre is RPG with an average
playtime of 277 minutes (4.62 hours) (half of the Free to Play playtime).
plt.figure(figsize=(30,10))
plt.xlabel("Free Genres", fontsize=16)
plt.ylabel("Average Playtime (in Minutes)", fontsize=16)
plt.title("Average Playtime Per Free Genre (in Minutes)", fontsize=16)
free_games = games_df.loc[games_df["genre_Free to Play"] == 1]
genre_to_playtime = {}
for genre in genres:
# Filter the df to just include the rows/games which are the genre 'genre'
free_genre_games = free_games.loc[games_df["genre_" + genre] == 1]
# Round to avoid long chart labels
genre_to_playtime.update({genre : round(mean(free_genre_games['average_playtime']), 2)})
barplot = plt.bar(genre_to_playtime.keys(), genre_to_playtime.values(), color='teal')
plt.bar_label(barplot, labels=genre_to_playtime.values(), fontsize=14)
plt.show()
Above is a graph of the average playtimes for each genre which is also free to play (e.g., free action games).
We can see that free adventure, RPG, and massively multiplayer games are what give the Free to Play genre
a high average playtime of 554 minutes (all other genres lower this average). We have already explained the high
playtime of massively multiplayer games, but what makes free adventure and free RPG games so replayable?
RPG means Role Playing Game. In an RPG, you play as a fictional character in a fictional world. The world is
often large with many quests/tasks to do. They also tend to have a sense of progression (e.g., leveling) in which
you improve your character. Much of what the player does is up to the player: what you want your character to be,
what quests you want to do, where you want to explore, etc. This provides many hours of content. Adventure games
are a more broad genre. RPGs can be thought of as a subset of adventure games. Adventure games often feature a large
open world, a story, fantasy elements, action (some combination of these), and many other elements. These elements
give the game a lot of content which the player can enjoy for many hours.
(sources and more info:
https://www.techopedia.com/definition/27052/role-playing-game-rpg
https://store.steampowered.com/category/adventure)
Now let's look at the average ratings of games for each of these genres.
genres = ['Action', 'Free to Play', 'Strategy', 'Adventure', 'Indie', 'RPG', 'Casual', 'Simulation', 'Racing', 'Violent', \
'Massively Multiplayer', 'Nudity', 'Sports', 'Early Access', 'Gore', 'Education', 'Sexual Content']
plt.figure(figsize=(30,10))
plt.xlabel("Genre", fontsize=16)
plt.ylabel("Average Rating", fontsize=16)
plt.title("Average Rating Per Genre", fontsize=16)
genre_to_rating = {}
# For each genre we will plot the average rating
for genre in genres:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["genre_" + genre] == 1]
# Round to avoid long chart labels
genre_to_rating.update({genre : round(mean(filtered_df['rating']), 4)})
# Sort genre_to_rating in descending order
sorted_dict = dict(sorted(genre_to_rating.items(), key=lambda x:x[1], reverse=True))
barplot = plt.bar(sorted_dict.keys(), sorted_dict.values(), color='gold')
plt.bar_label(barplot, labels=sorted_dict.values(), fontsize=14)
plt.ylim([0, 1])
plt.show()
There is not nearly as much variation between genres as there was when we looked at revenue or playtime.
However, there are still some suprising and interesting results. The first being the average rating of
Massively Multiplayer games. It is the lowest by a pretty substantial margin. The rating is 2% lower than the
next lowest rated genre, Violent games. This is suprising given how strongly favored Massively Multiplayer games
are in terms of revenue and playtime. What might be the reason for this?
We will now look at how the game category relates to its success. Categories and genres are
not too different from one another, so we should expect similar results to that of genres
(e.g., massively multiplayers are a successful genre, so 'multiplayer' or 'online' categories
will likely be successful as well).
Again, we will look at revenue, playtime, and ratings to determine success.
print(unique_col_elts('categories'))
categories = ['Multi-player', 'Online Multi-Player', 'Local Multi-Player', 'Single-player', 'Partial Controller Support',
'Cross-Platform Multiplayer', 'Includes level editor', 'In-App Purchases', 'Co-op', 'Full controller support',
'Online Co-op', 'Shared/Split Screen', 'Local Co-op', 'MMO', 'VR Support', 'Mods', 'Mods (require HL2)']
for cat in categories:
print("Num entries of category " + cat + ":", len(games_df.loc[games_df["category_" + cat] == 1]))
Unique entries: ['Multi-player', 'Online Multi-Player', 'Local Multi-Player', 'Valve Anti-Cheat enabled', 'Single-player', 'Steam Cloud', 'Steam Achievements', 'Steam Trading Cards', 'Captions available', 'Partial Controller Support', 'Includes Source SDK', 'Cross-Platform Multiplayer', 'Stats', 'Commentary available', 'Includes level editor', 'Steam Workshop', 'In-App Purchases', 'Co-op', 'Full controller support', 'Steam Leaderboards', 'SteamVR Collectibles', 'Online Co-op', 'Shared/Split Screen', 'Local Co-op', 'MMO', 'VR Support', 'Mods', 'Mods (require HL2)', 'Steam Turn Notifications'] Length: 29 None Num entries of category Multi-player: 3974 Num entries of category Online Multi-Player: 2487 Num entries of category Local Multi-Player: 1615 Num entries of category Single-player: 25678 Num entries of category Partial Controller Support: 4234 Num entries of category Cross-Platform Multiplayer: 1081 Num entries of category Includes level editor: 1036 Num entries of category In-App Purchases: 690 Num entries of category Co-op: 1721 Num entries of category Full controller support: 5695 Num entries of category Online Co-op: 1071 Num entries of category Shared/Split Screen: 2152 Num entries of category Local Co-op: 1059 Num entries of category MMO: 421 Num entries of category VR Support: 231 Num entries of category Mods: 2 Num entries of category Mods (require HL2): 1
There are a lot of categories that aren't related to video games, so we will ignore these.
Let's begin by looking at how much a game in each of the different categories makes on average.
categories = ['Multi-player', 'Online Multi-Player', 'Local Multi-Player', 'Single-player', 'Partial Controller Support',
'Cross-Platform Multiplayer', 'Includes level editor', 'In-App Purchases', 'Co-op', 'Full controller support',
'Online Co-op', 'Shared/Split Screen', 'Local Co-op', 'MMO', 'VR Support']
plt.figure(figsize=(38,10))
plt.xlabel("Category", fontsize=16)
plt.ylabel("Average Revenue", fontsize=16)
plt.title("Average Revenue Per Category", fontsize=16)
cat_to_rev = {}
# For each category we will plot the average revenue
for cat in categories:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["category_" + cat] == 1]
# There is only one game with a revenue of more than 0.5*10^9 (has value 2*10^9)
# We can drop this outlier.
filtered_df = filtered_df.drop(filtered_df[filtered_df["est_revenue"] > 0.5*10**9].index)
# Round to avoid long chart labels
cat_to_rev.update({cat : round(mean(filtered_df['est_revenue']), 2)})
barplot = plt.bar(cat_to_rev.keys(), cat_to_rev.values(), color='green')
plt.bar_label(barplot, labels=cat_to_rev.values(), fontsize=14)
plt.show()
plt.figure(figsize=(38,10))
plt.xlabel("Category", fontsize=16)
plt.ylabel("Median Revenue", fontsize=16)
plt.title("Median Revenue Per Category", fontsize=16)
cat_to_rev = {}
# For each category we will plot the median revenue
for cat in categories:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["category_" + cat] == 1]
# There is only one game with a revenue of more than 0.5*10^9 (has value 2*10^9)
# We can drop this outlier.
filtered_df = filtered_df.drop(filtered_df[filtered_df["est_revenue"] > 0.5*10**9].index)
# Round to avoid long chart labels
cat_to_rev.update({cat : round(median(filtered_df['est_revenue']), 2)})
barplot = plt.bar(cat_to_rev.keys(), cat_to_rev.values(), color='green')
plt.bar_label(barplot, labels=cat_to_rev.values(), fontsize=14)
plt.show()
print("Number of 'Includes level editor' games: ", len(games_df.loc[games_df["category_Includes level editor"] == 1]))
Number of 'Includes level editor' games: 1036
For both the mean and median revenue per category, the top 3 categories are Co-op, Multiplayer, and Includes Level Editor.
Two of these three are directly indicative of playing with other players. This suggests that players highly value being able to play
with and against other players. The other category, includes level editor, suggests that players value being able to customize
the game to suit their desires. An example of such a game is Trackmania Turbo which is a racing game where players can also
create their own tracks and share and play these with others (examples of games with level editors:
https://www.slant.co/topics/6445/~games-on-steam-with-a-level-editor).
Now let's look at the average playtime for each category
plt.figure(figsize=(38,10))
plt.xlabel("Category", fontsize=16)
plt.ylabel("Average Playtime (in Minutes)", fontsize=16)
plt.title("Average Playtime Per Category", fontsize=16)
cat_to_playtime = {}
# For each category we will plot the median revenue
for cat in categories:
# Filter the df to just include the rows/games which are the genre 'genre'
filtered_df = games_df.loc[games_df["category_" + cat] == 1]
# Round to avoid long chart labels
cat_to_playtime.update({cat : round(mean(filtered_df['average_playtime']), 2)})
barplot = plt.bar(cat_to_playtime.keys(), cat_to_playtime.values(), color='teal')
plt.bar_label(barplot, labels=cat_to_playtime.values(), fontsize=14)
plt.show()
There is a clear outlier here: MMO (Massively Multiplayer Online) games have double the average playtime of any other
category. This is perhaps not too suprising given the success of the Massively Multiplayer genre we discussed in 3.1.
These results are another indication that multiplayer games are played a lot more compared to other categories.
As we can see, 2 of the top 3 categories in terms of playtime are multiplayer (MMO and Cross-Platform Multiplayer).
The other category in the top 3 is in-app purchases. This is a mostly free-to-play category (as we can see from the
median revenue of this category being 0), so this is another indicator that free-to-play games are played a lot
compared to other categories.
plt.figure(figsize=(38,10))
plt.xlabel("Category", fontsize=16)
plt.ylabel("Average Rating", fontsize=16)
plt.title("Average Rating Per Category", fontsize=16)
cat_to_rating = {}
# For each category we will plot the average rating
for cat in categories:
# Filter the df to just include the rows/games which are the category 'cat'
filtered_df = games_df.loc[games_df["category_" + cat] == 1]
# Round to avoid long chart labels
cat_to_rating.update({cat : round(mean(filtered_df['rating']), 4)})
# Sort cat_to_rating in descending order
sorted_dict = dict(sorted(cat_to_rating.items(), key=lambda x:x[1], reverse=True))
barplot = plt.bar(sorted_dict.keys(), sorted_dict.values(), color='gold')
plt.bar_label(barplot, labels=sorted_dict.values(), fontsize=14)
plt.ylim([0, 1])
plt.show()
Most of the categories have an average rating in the 70% - 80% range. However, the two lowest are well below the rest
with average ratings of 64% and 62%. These two categories are in-app purchases and MMO. The low rating of the in-app
purchase category suggests that players do not like companies urging players to pay for in-game currency or items.
In-app purchases are used to improve your character or overall experience, and many games urge users to pay for these
improvements since many are free-to-play (in-app purchases is how they make money). These low ratings show that this
may be a hindrance on the user experience (source and more info:
https://www.androidauthority.com/in-app-purchases-good-bad-ugly-truth-324604/).
Let's now see how the price of a game relates to it's success.
Are more accessible (cheaper) games typically more successful? Or are expensive games more successful due to
(what should be) a higher quality product?
The data may or may not be linearly related. We will fit a Linear Regression line for each set of variables we compare
(e.g., Revenue of game vs Price), and we will measure how well this line fits by calculating the R-Squared value
of the linear model. To account for the fact that the data may not be linearly related, we will also look at Spearman Rank Correlation.
We will graph the revenue of each game vs the price of the game and also fit a linear regression line to the scatter plot
to see how price relates to revenue in general.
# Drop the major outlier
filtered_df = games_df.drop(games_df[games_df["est_revenue"] > 0.5*10**9].index)
# Drop those with a price > 100 as there are only a few games with that price and it makes our graph unreadable
filtered_df = filtered_df.drop(filtered_df[filtered_df["price"] > 100].index)
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("Price (in GBP)", fontsize=12)
plt.ylabel("Revenue (in GBP)", fontsize=12)
plt.title("Revenue of Game vs Price", fontsize=12)
# Scatter plot of est_revenue vs price
plt.scatter(filtered_df['price'], filtered_df['est_revenue'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of prices and revenues
x_train = np.array(filtered_df["price"]).reshape(-1, 1)
y_train = np.array(filtered_df["est_revenue"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (revenue) from x (price)
plt.plot(filtered_df["price"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [335584.10169458] R Squared: 0.07756478770143238 Spearman Rank Correlation: 0.8367535231200439 p-value: 0.0
By looking at the scatter plot, we can see that it appears that as price increases so does revenue. This can be seen in
the several increasing spokes of data points. Our linear regression line is not a good fit for the data, however, (R-Squared
value close to 0) so our data does not have a strong linear relationship. The Spearman Rank Correlation and it's p-value,
however, shows that the data has a very strong positive correlation (value close to 1), meaning that as price increases, so does
revenue, confirming our initial thoughts. This is not too suprising given that, for example, a game that is £20 will have to sell
3x as many copies as a £60 game to make the same revenue. Also, in general, higher priced games have more work that go into them
which may lead to a higher quality game/more desirable game on average. It doesn't seem like a lower price/more accessible game
is enough to cause the game to generate more revenue than a higher priced game on average.
Let's see how playtime relates to the price of the game.
# There are a few games with an average playtime of over 50,000 minutes. These are extreme outliers
# and may even be errors.
filtered_df = games_df.drop(games_df[games_df["average_playtime"] > 50_000].index)
# Drop those with a price > 100 as there are only a few games with that price and it makes our graph unreadable
filtered_df = filtered_df.drop(filtered_df[filtered_df["price"] > 100].index)
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("Price (in GBP)", fontsize=12)
plt.ylabel("Playtime (in minutes)", fontsize=12)
plt.title("Playtime of Game vs Price", fontsize=12)
# Scatter plot of playtime vs price
plt.scatter(filtered_df['price'], filtered_df['average_playtime'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of prices and playtimes
x_train = np.array(filtered_df["price"]).reshape(-1, 1)
y_train = np.array(filtered_df["average_playtime"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (average_playtime) from x (price)
plt.plot(filtered_df["price"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [17.21861957] R Squared: 0.013777200440742932 Spearman Rank Correlation: 0.08010711260585777 p-value: 9.048357111488325e-40
There does not appear to be any significant results from this graph. There are no obvious trends from just
observing the graph and there are no trends based on the R-Squared value or the Spearman Rank Correlation.
Both values are close to 0 meaning there is no (or very small) relation between price and playtime.
Now let's see how price relates to the ratings of the games.
# Drop those with a price > 100 as there are only a few games with that price and it makes our graph unreadable
filtered_df = filtered_df.drop(filtered_df[filtered_df["price"] > 100].index)
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("Price (in GBP)", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.title("Rating of Game vs Price", fontsize=12)
# Scatter plot of rating vs price
plt.scatter(filtered_df['price'], filtered_df['rating'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of prices and ratings
x_train = np.array(filtered_df["price"]).reshape(-1, 1)
y_train = np.array(filtered_df["rating"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (rating) from x (price)
plt.plot(filtered_df["price"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [0.00306605] R Squared: 0.007689622315858458 Spearman Rank Correlation: 0.12216057444578153 p-value: 1.8246289754049238e-90
Again there doesn't appear to be any significant results from this graph. No obvious trends from observation
and this is confirmed by the close-to-zero values for R-Squared and Spearman Rank. However, there may be a slight
positive correlation between price and rating since the Spearman Rank Correlation is 0.12.
For the last metric, we will explore how the hardware requirments of a game relate to it's
success. Do more hardware-intensive games have more success? This would suggest that better graphics
of a game increase the chance of success. Or instead do less hardware-intensive games have more success
since more people can play them?
We will evaluate our graphs in the same way we did in Section 3.3.
# Drop the major outlier
filtered_df = games_df.drop(games_df[games_df["est_revenue"] > 0.5*10**9].index)
# Drop those with > 16 GB ram required (only a few and some of the extreme outliers make the graph unreadable)
filtered_df = filtered_df.drop(filtered_df[filtered_df["min_req_ram"] > 16].index)
# Drop those with NaN in min_req_ram (about 2.5k entries)
filtered_df = filtered_df.dropna(subset = ['min_req_ram'])
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("RAM Requirement (in GB)", fontsize=12)
plt.ylabel("Revenue (in GBP)", fontsize=12)
plt.title("Revenue of Game vs RAM Requirement", fontsize=12)
# Scatter plot of revenue vs RAM
plt.scatter(filtered_df['min_req_ram'], filtered_df['est_revenue'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of RAM and revenue
x_train = np.array(filtered_df["min_req_ram"]).reshape(-1, 1)
y_train = np.array(filtered_df["est_revenue"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (est_revenue) from x (min_req_ram)
plt.plot(filtered_df["min_req_ram"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [190191.19552378] R Squared: 0.00289594821321415 Spearman Rank Correlation: 0.06251399865790182 p-value: 9.087453487640155e-23
There are no obvious trends in this graph and this is confirmed by the very close to zero results
of R-Squared and Spearman Rank. This suggests that RAM has little to no impact on the revenue of the
game. This may be used to suggest that higher hardware requirements do not have much impact on the revenue
of the game. This is a bit surprising.
# Drop those with > 16 GB ram required (only a few and some of the extreme outliers make the graph unreadable)
filtered_df = games_df.drop(games_df[games_df["min_req_ram"] > 16].index)
# There are a few games with an average playtime of over 50,000 minutes. These are extreme outliers
# and may even be errors.
filtered_df = filtered_df.drop(filtered_df[filtered_df["average_playtime"] > 50_000].index)
# Drop those with NaN in min_req_ram (about 2.5k entries)
filtered_df = filtered_df.dropna(subset = ['min_req_ram'])
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("RAM Requirement (in GB)", fontsize=12)
plt.ylabel("Playtime (in minutes)", fontsize=12)
plt.title("Playtime of Game vs RAM Requirement", fontsize=12)
# Scatter plot of playtime vs RAM
plt.scatter(filtered_df['min_req_ram'], filtered_df['average_playtime'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of RAM and playtimes
x_train = np.array(filtered_df["min_req_ram"]).reshape(-1, 1)
y_train = np.array(filtered_df["average_playtime"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (average_playtime) from x (min_req_ram)
plt.plot(filtered_df["min_req_ram"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [5.64626105] R Squared: 0.00017148913340880867 Spearman Rank Correlation: -0.04926200841772169 p-value: 1.0265072179264815e-14
Again, there are no obvious trends and the R-Squared value and the Spearman Rank Correlation are very close to 0
suggesting that there is no relationship between RAM requirement and playtime.
# Drop those with > 16 GB ram required (only a few and some of the extreme outliers make the graph unreadable)
filtered_df = games_df.drop(games_df[games_df["min_req_ram"] > 16].index)
# Drop those with NaN in min_req_ram (about 2.5k entries)
filtered_df = filtered_df.dropna(subset = ['min_req_ram'])
# Set up graph
plt.figure(figsize=(10,10))
plt.xlabel("RAM Requirement (in GB)", fontsize=12)
plt.ylabel("Rating", fontsize=12)
plt.title("Rating of Game vs RAM Requirement", fontsize=12)
# Scatter plot of rating vs RAM
plt.scatter(filtered_df['min_req_ram'], filtered_df['rating'])
# Now we fit a linear regression line to the graph
linreg = LinearRegression()
# The training values we use are the entire set of RAM and ratings
x_train = np.array(filtered_df["min_req_ram"]).reshape(-1, 1)
y_train = np.array(filtered_df["rating"])
lin_model = linreg.fit(x_train, y_train)
# Predict y (rating) from x (min_req_ram)
plt.plot(filtered_df["min_req_ram"], lin_model.predict(x_train), color='orange')
plt.show()
print("Slope of line:")
print(linreg.coef_)
# Evaluate how accurate our linear model is
print("R Squared: ", lin_model.score(x_train, y_train))
# Evaluate a relationship between x and y
corr, p_value = spearmanr(x_train, y_train)
print("Spearman Rank Correlation:", corr)
print("p-value:", p_value)
Slope of line: [-0.00606724] R Squared: 0.0033711174148746137 Spearman Rank Correlation: -0.05310791389951006 p-value: 7.322090104562239e-17
Again, there are no obvious trends and the R-Squared value and the Spearman Rank Correlation are very close to 0
suggesting that there is no relationship between RAM requirement and rating.
From our analysis of game genres and success, we found that, in terms of average revenue, RPG, Massively Multiplayer,
Action, and Strategy games make the most. If we instead look at median revenue, we find different results. The median
revenue is significantly lower for all of these genres showing that there are a small percentage of games in these
genres making a lot of money. RPGs and Strategy games are still highly ranked in terms of median revenue, however.
In terms of the playtimes for the different game genres, we found that Massively Multiplayer, Free to Play, and RPG
games are significantly ahead of the other genres. On the other hand, Massively Multiplayer games were rated the lowest
by a significant margin.