How We Search Through 120K+ Games in Under 3 Seconds
It all started on a lazy Saturday afternoon. My friend and I were on Discord, eager to dive into a new game that matched our mood. We sifted through countless titles, read numerous reviews, and watched several trailers. Hours passed, and we found ourselves overwhelmed by the sheer number of options. Frustrated by our inability to find the perfect game quickly, we ended up not playing anything at all. This experience made me realize the need for a smarter, faster way to discover games; a tool that could cut through the noise and present tailored recommendations. Thus, the idea for GameSeek was born.
The Challenge: Overwhelming Choices
With the ever-growing catalog on Steam, finding the right game can be like searching for a needle in a haystack. With over 120,000 games available, the need for a smart, fast search engine became crystal clear. I wanted to create a tool that could quickly sift through this massive collection and deliver the most relevant game suggestions, all while supporting the creative teams behind each title.
Building GameSeek: Our Journey
The first step was to gather the data. Using the Steam API, I scraped the entire Steam catalog and stored everything in a local database on disk. This approach was chosen over traditional socket-based databases (like PostgreSQL) because local storage allowed us to achieve the speed we needed.
Once the data was in place, the next challenge was to build a recommendation system that could understand natural language queries like “steampunk survival exploration game with co-op.” Achieving this required a sophisticated recommendation system that combined traditional keyword matching with advanced filtering techniques.
Our Secret Weapon: TF-IDF and Beyond
At the core of GameSeek's search functionality lies the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, a cornerstone in information retrieval systems.
Understanding TF-IDF
TF-IDF is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It balances two factors:
Term Frequency (TF)
Indicates how often a term appears in a document. The assumption is that terms appearing more frequently are more significant.
\[ \text{TF}(t,d) = \frac{f(t,d)}{\sum_{w \in d} f(w,d)} \]
Inverse Document Frequency (IDF)
Measures the rarity of a term across the entire corpus. Rare terms are often more informative than common ones.
\[ \text{IDF}(t) = \log \left( \frac{N}{df(t)} \right) \]
The TF-IDF score is the product of these two values:
\[ \text{TF-IDF}(t,d) = \text{TF}(t,d) \times \text{IDF}(t) \]
This score helps prioritize terms that are frequent in a specific document but rare across the corpus, making them more relevant for distinguishing that document from others.
Beyond TF-IDF: Exploring Alternative Search Algorithms
While TF-IDF is powerful, we also experimented with other search techniques:
- BM25: A probabilistic ranking model that extends TF-IDF for better relevance ranking.
- Word Embeddings (Word2Vec, FastText): Used to capture semantic relationships between words.
- Transformer-Based Models: Investigating BERT-like models for deeper natural language understanding.
Intelligent Query Cleaning
Before we compare your search query against our database, we clean it up using a lightweight neural network (a perceptron). This step removes common words like “a,” “an,” or “the”—words that often add noise rather than meaning. The cleaned query is then transformed into a vector representation, allowing us to quickly compare it with the vectors of over 120,000 games using similarity measures.
Optimizing for Speed: Technical Enhancements
Handling over 120,000 games and delivering results in under three seconds required meticulous optimization:
- Database Indexing: We indexed frequently queried columns (e.g., tags, keywords) to expedite data retrieval.
- Parallel Processing: Utilizing multi-threading to compare queries across the dataset simultaneously.
- Caching: Implementing both RAM and disk-based caching reduced redundant computations and accelerated response times.
- Efficient Text Processing: By optimizing our text processing pipeline, we minimized unnecessary computations, ensuring swift query handling.
- Vectorized Computation: Utilizing NumPy's fast vector operations allowed us to compare search queries with game vectors in bulk, significantly reducing processing time.
The Result: Lightning-Fast Game Discovery
By combining smart search algorithms with deep optimizations at every level, GameSeek can sift through over 120,000 games in under three seconds, connecting gamers with titles that match their unique preferences.
GameSeek's search operates at:
- Query Processing Time: 1 - 4 seconds (depending on complexity).
- Dataset Size: 120,000+ games, indexed and analyzed.
- Memory Usage: Optimized for under 1GB in RAM.
Why It Matters
GameSeek isn't just about technology, it's about enhancing the way we discover games. Our goal is to support the creative teams behind each title by making it easier for gamers to find their next favorite game. By streamlining the search process, we hope to bring more attention to innovative, under-the-radar, and indie titles.
In Conclusion
What started as an afternoon of frustration turned into a project that revolutionizes game discovery. With GameSeek, searching through a vast catalog of over 120K games becomes a quick, enjoyable, and insightful experience. We're proud of what we've accomplished, and we're excited to see how this tool transforms the way gamers find games.
We're just getting started. Some of the next improvements include:
- More Advanced NLP Models: Improving natural language search accuracy.
- Personalized Recommendations: Leveraging user preferences for tailored results.
- Community Features: Allowing users to tag, rate, and contribute insights on game recommendations.
Feel free to try GameSeek and let us know your thoughts, we're always looking to improve and iterate based on community feedback.