Steam Market Gap Analysis

The Idea

Say you wanted to build a game for Steam. You could look at what's popular and try to compete, but the more interesting question is: where is there demand that nobody is serving well? I wanted to see if I could answer that with data, so I built a pipeline that pulls from three different APIs, trains a recommendation engine, and scores over 140,000 market niches by how much opportunity they represent.

It's part recommendation system, part market research tool. The recommender tells you what games a given user would like; the market gap analysis tells you where a new game could actually make money.

Data Collection

No single API gives you everything you need here. The Steam Web API has user behaviour (who plays what, and for how long) but not much about the games themselves. SteamSpy has ownership estimates, pricing, and review scores. RAWG fills in genres, platforms, Metacritic scores, and release dates. So the pipeline collects from all three:

SourceWhat it providesScale
Steam Web APIUser game libraries and playtime10,000 users via BFS friend-graph crawl
SteamSpyOwnership estimates, price, review scores, tags50,005 games
RAWGGenres, platforms, Metacritic scores, release dates9,951 matched titles

The Steam collection stage is probably the most interesting part. It starts from a few seed user IDs and does a breadth-first crawl of the Steam friend graph, picking up each user's game library along the way. Playtime is the key signal here: someone who's put 200 hours into a game is telling you something very different from someone who bought it and never opened it. About 54% of the user-game pairs in the dataset have zero playtime, so those get filtered out.

Since Steam and RAWG don't share game IDs, matching titles across the three sources required fuzzy string matching (via the thefuzz library). The whole pipeline supports checkpointing too, which turned out to be essential given RAWG's rate limits.

Once the data is merged, you can start looking at how genres relate to each other. The co-occurrence matrix is a good place to start. Action + Indie is by far the most common pairing, which won't surprise anyone, but the clusters around Simulation, Strategy, and RPG are more interesting:

Genre co-occurrence matrix showing which genre pairs appear together most frequently

Hybrid Recommendation Engine

The recommender has two halves. The collaborative filtering side uses ALS (Alternating Least Squares, via the implicit library) to learn 64-dimensional embeddings for users and games from the playtime matrix. I log-transform the playtime values with log1p so that someone with 5,000 hours in a game doesn't have 50x the weight of someone with 100 hours.

The content-based side computes cosine similarity over feature vectors built from genre one-hot encoding, tag TF-IDF (up to 200 terms), price buckets, platform flags, and normalised Metacritic scores. This is mainly useful for the cold-start problem: new games with few interactions can still get recommended based on what they look like.

The two scores are blended per-item, with a confidence weight based on how many interactions that game has. Popular games lean on collaborative filtering; obscure ones lean on content:

Hybrid blending
# Per-item confidence weight
alpha = min(interaction_count / threshold, 1.0)  # threshold = 100

# Popular games trust CF; cold-start games trust content similarity
hybrid_score = alpha * cf_score + (1 - alpha) * cb_score

The scatter below gives a sense of the data the model is working with. Each point is a game, plotted by median playtime against estimated owners, coloured by genre. The range is enormous: four orders of magnitude on both axes.

Playtime vs ownership scatter plot coloured by genre

Revenue-Weighted Evaluation

Standard recommendation metrics like Precision@K treat all correct recommendations equally. But if the goal is market intelligence, recommending a $30 game matters more than recommending a free-to-play one. So I added a revenue-weighted hit rate that weights each hit by price times estimated ownership. The hybrid model scored 14.8% at K=20 on this metric, versus 4.8% for a popularity baseline (about 3x better).

Market Gap Scoring

This is really the point of the whole project. I define a "niche" as a combination of 2 or 3 Steam tags (e.g. "Multiplayer + Open World + RPG"). Since every game has multiple tags, the combinatorial explosion gives you over 140,000 distinct niches. For each one I compute four things:

  • Supply: how many games exist in this niche
  • Demand: total estimated ownership across all games in the niche
  • Engagement: median playtime
  • Satisfaction: median review score

These get normalised to [0, 1] and combined into an opportunity score. The idea is simple: you want niches where lots of people play, they play for a long time, they're happy with what they find, but there aren't that many games competing:

Opportunity scoring
# Normalise each component to [0, 1]
# Inverse supply: fewer competitors = higher opportunity
opportunity_score = (demand_norm * engagement_norm * satisfaction_norm) / supply_inv_norm

The heatmap shows the top-scoring niches with their individual component scores. "Multiplayer + Open World" scores well across the board. Some of the e-sports niches have very high revenue but low competition, though that's partly because a few massive titles (CS2, Valorant) dominate those categories:

Niche quality scorecard heatmap showing normalised scores for top market niches

I also added a recency trend: comparing revenue from games released in the last 3 years against older titles in the same niche. A ratio above 1.0 means newer games are outperforming the older ones, which is a decent signal that the niche is growing rather than stagnating.

Top Opportunities

"Multiplayer + Open World" came out on top (690 games, 1.47B total players). "Adventure + Open World" had the strongest new-entrant potential, with estimated revenue of $217K to $20M and a 2.9x recency trend. "Multiplayer + Shooter" had the highest growth signal at 3.9x, meaning recent titles were earning nearly four times what older games in the same space managed.

Revenue potential by market niche showing median and interquartile range

Price Sensitivity

I also fitted a log-linear model to see how price relates to ownership across genres:

Price model
# Observational model (not causal!)
log(owners_mid) ~ price_dollars + genre + review_score + platform_count

The global coefficient is actually positive (+0.74% per dollar), which sounds wrong until you think about it: better games cost more and sell more. It's a textbook endogeneity problem, and the model isn't trying to hide that. The R² is only 0.18. Quality dominates pricing as a predictor of sales.

The genre-level breakdown is more useful. Free-to-Play and MMO categories are the most price-sensitive, while other genres are more tolerant. The violins below show revenue distribution by genre on a log scale. The spread in some genres is huge, which suggests the market isn't very efficient: there's room at lots of price points.

Revenue distribution by genre shown as violin plots on a log scale

Dashboard and Pipeline

Everything feeds into a Streamlit dashboard with five tabs: market overview, niche explorer, recommender results, price analysis, and data quality. The niche explorer is the most useful one, letting you filter by tag combination, sort by opportunity score, and drill into individual niches.

The full pipeline runs as CLI commands:

Running the pipeline
python -m src.collect    # Steam crawl → SteamSpy → RAWG → clean & merge
python -m src.train      # Train hybrid recommender
python -m src.analyse    # Market gap scoring & price analysis
python -m src.visualise  # Generate charts
streamlit run src/visualisation/dashboard.py  # Interactive dashboard

There are 82 tests covering the data cleaning, merging, feature engineering, recommender, market gap scoring, and evaluation metrics. They all run on synthetic data, so no API keys needed.

Summary

  • Multi-source pipeline: Steam API + SteamSpy + RAWG (50K+ games)
  • Hybrid recommender (ALS + content-based), 3x popularity baseline on revenue-weighted hit rate
  • 140,000+ niches scored by opportunity, with revenue estimates ($200K–$21M for top niches)
  • Recency trend detection for emerging niches
  • Genre-level price sensitivity modelling (R² = 0.18, quality dominates)
  • Streamlit dashboard with 5 tabs
  • 82 unit tests on synthetic data