Steam Market Gap Analysis
The Idea
Say you wanted to build a game for Steam. You could look at what's popular and try to compete, but the more interesting question is: where is there demand that nobody is serving well? I wanted to see if I could answer that with data, so I built a pipeline that pulls from three different APIs, trains a recommendation engine, and scores over 140,000 market niches by how much opportunity they represent.
It's part recommendation system, part market research tool. The recommender tells you what games a given user would like; the market gap analysis tells you where a new game could actually make money.
Data Collection
No single API gives you everything you need here. The Steam Web API has user behaviour (who plays what, and for how long) but not much about the games themselves. SteamSpy has ownership estimates, pricing, and review scores. RAWG fills in genres, platforms, Metacritic scores, and release dates. So the pipeline collects from all three:
| Source | What it provides | Scale |
|---|---|---|
| Steam Web API | User game libraries and playtime | 10,000 users via BFS friend-graph crawl |
| SteamSpy | Ownership estimates, price, review scores, tags | 50,005 games |
| RAWG | Genres, platforms, Metacritic scores, release dates | 9,951 matched titles |
The Steam collection stage is probably the most interesting part. It starts from a few seed user IDs and does a breadth-first crawl of the Steam friend graph, picking up each user's game library along the way. Playtime is the key signal here: someone who's put 200 hours into a game is telling you something very different from someone who bought it and never opened it. About 54% of the user-game pairs in the dataset have zero playtime, so those get filtered out.
Since Steam and RAWG don't share game IDs, matching titles across the three sources required fuzzy string matching (via the thefuzz library). The whole pipeline supports checkpointing too, which turned out to be essential given RAWG's rate limits.
Once the data is merged, you can start looking at how genres relate to each other. The co-occurrence matrix is a good place to start. Action + Indie is by far the most common pairing, which won't surprise anyone, but the clusters around Simulation, Strategy, and RPG are more interesting:

Hybrid Recommendation Engine
The recommender has two halves. The collaborative filtering side uses ALS (Alternating Least Squares, via the implicit library) to learn 64-dimensional embeddings for users and games from the playtime matrix. I log-transform the playtime values with log1p so that someone with 5,000 hours in a game doesn't have 50x the weight of someone with 100 hours.
The content-based side computes cosine similarity over feature vectors built from genre one-hot encoding, tag TF-IDF (up to 200 terms), price buckets, platform flags, and normalised Metacritic scores. This is mainly useful for the cold-start problem: new games with few interactions can still get recommended based on what they look like.
The two scores are blended per-item, with a confidence weight based on how many interactions that game has. Popular games lean on collaborative filtering; obscure ones lean on content:
# Per-item confidence weight
alpha = min(interaction_count / threshold, 1.0) # threshold = 100
# Popular games trust CF; cold-start games trust content similarity
hybrid_score = alpha * cf_score + (1 - alpha) * cb_scoreThe scatter below gives a sense of the data the model is working with. Each point is a game, plotted by median playtime against estimated owners, coloured by genre. The range is enormous: four orders of magnitude on both axes.

Revenue-Weighted Evaluation
Standard recommendation metrics like Precision@K treat all correct recommendations equally. But if the goal is market intelligence, recommending a $30 game matters more than recommending a free-to-play one. So I added a revenue-weighted hit rate that weights each hit by price times estimated ownership. The hybrid model scored 14.8% at K=20 on this metric, versus 4.8% for a popularity baseline (about 3x better).
Market Gap Scoring
This is really the point of the whole project. I define a "niche" as a combination of 2 or 3 Steam tags (e.g. "Multiplayer + Open World + RPG"). Since every game has multiple tags, the combinatorial explosion gives you over 140,000 distinct niches. For each one I compute four things:
- Supply: how many games exist in this niche
- Demand: total estimated ownership across all games in the niche
- Engagement: median playtime
- Satisfaction: median review score
These get normalised to [0, 1] and combined into an opportunity score. The idea is simple: you want niches where lots of people play, they play for a long time, they're happy with what they find, but there aren't that many games competing:
# Normalise each component to [0, 1]
# Inverse supply: fewer competitors = higher opportunity
opportunity_score = (demand_norm * engagement_norm * satisfaction_norm) / supply_inv_normThe heatmap shows the top-scoring niches with their individual component scores. "Multiplayer + Open World" scores well across the board. Some of the e-sports niches have very high revenue but low competition, though that's partly because a few massive titles (CS2, Valorant) dominate those categories:

I also added a recency trend: comparing revenue from games released in the last 3 years against older titles in the same niche. A ratio above 1.0 means newer games are outperforming the older ones, which is a decent signal that the niche is growing rather than stagnating.
Top Opportunities
"Multiplayer + Open World" came out on top (690 games, 1.47B total players). "Adventure + Open World" had the strongest new-entrant potential, with estimated revenue of $217K to $20M and a 2.9x recency trend. "Multiplayer + Shooter" had the highest growth signal at 3.9x, meaning recent titles were earning nearly four times what older games in the same space managed.

Price Sensitivity
I also fitted a log-linear model to see how price relates to ownership across genres:
# Observational model (not causal!)
log(owners_mid) ~ price_dollars + genre + review_score + platform_countThe global coefficient is actually positive (+0.74% per dollar), which sounds wrong until you think about it: better games cost more and sell more. It's a textbook endogeneity problem, and the model isn't trying to hide that. The R² is only 0.18. Quality dominates pricing as a predictor of sales.
The genre-level breakdown is more useful. Free-to-Play and MMO categories are the most price-sensitive, while other genres are more tolerant. The violins below show revenue distribution by genre on a log scale. The spread in some genres is huge, which suggests the market isn't very efficient: there's room at lots of price points.

Dashboard and Pipeline
Everything feeds into a Streamlit dashboard with five tabs: market overview, niche explorer, recommender results, price analysis, and data quality. The niche explorer is the most useful one, letting you filter by tag combination, sort by opportunity score, and drill into individual niches.
The full pipeline runs as CLI commands:
python -m src.collect # Steam crawl → SteamSpy → RAWG → clean & merge
python -m src.train # Train hybrid recommender
python -m src.analyse # Market gap scoring & price analysis
python -m src.visualise # Generate charts
streamlit run src/visualisation/dashboard.py # Interactive dashboardThere are 82 tests covering the data cleaning, merging, feature engineering, recommender, market gap scoring, and evaluation metrics. They all run on synthetic data, so no API keys needed.
Summary
- Multi-source pipeline: Steam API + SteamSpy + RAWG (50K+ games)
- Hybrid recommender (ALS + content-based), 3x popularity baseline on revenue-weighted hit rate
- 140,000+ niches scored by opportunity, with revenue estimates ($200K–$21M for top niches)
- Recency trend detection for emerging niches
- Genre-level price sensitivity modelling (R² = 0.18, quality dominates)
- Streamlit dashboard with 5 tabs
- 82 unit tests on synthetic data