A Pentium II with 128MB of RAM achieved a remarkable 35.9 tokens per second.
EXO Labs recently shared an intriguing blog post detailing their experience with running the Llama AI model on a vintage Windows 98 system, complemented by a short social media video. The clip highlights an old Elonex Pentium II @ 350 MHz powering up with Windows 98, after which EXO launches its specialized C-based inference engine derived from Andrej Karpathy’s Llama2.c. They command the LLM to concoct a tale about “Sleepy Joe,” and impressively, it does so quite swiftly.
LLM operating on a 26-year-old Windows 98 PC with an Intel Pentium II CPU and 128MB RAM. Utilizing llama98.c, our tailored pure C inference engine based on @karpathy llama2.c. Code and DIY guide pic.twitter.com/pktC8hhvvaDecember 28, 2024
This groundbreaking achievement is just the beginning for EXO Labs. Emerging from obscurity in September, EXO Labs announced its mission to make AI accessible to all. Founded by a group from Oxford University, EXO is driven by the conviction that AI controlled by a few large corporations is detrimental to culture, truth, and societal foundations. Their goal is to create open infrastructure to train cutting-edge models and enable anyone to operate them on virtually any device. This demonstration using Windows 98 is a prime example of what’s possible with minimal resources.
The video shared on Twitter is quite brief, but thankfully, EXO’s blog post about Running Llama on Windows 98 offers more details. It is part of their “12 days of EXO” series, so there’s more to look forward to.
Acquiring an old Windows 98 PC on eBay was straightforward for EXO, but configuring it was not without its challenges. Data transfer was particularly tricky, leading them to utilize “good old FTP” to manage files through the retro machine’s Ethernet connection.
Adapting modern code to run on Windows 98 was another significant hurdle. Fortunately, they discovered Andrej Karpathy’s llama2.c — a lean 700-line C script capable of running inference on Llama 2 architecture models. Using the vintage Borland C++ 5.02 IDE and compiler, with a few adjustments, they managed to compile a Windows 98-compatible executable. The completed code is available on GitHub.
35.9 tok/sec on Windows 98 This is a 260K LLM with Llama-architecture.We also tried larger models. Results in the blog post. https://t.co/QsViEQLqS9 pic.twitter.com/lRpIjERtSrDecember 28, 2024
Alex Cheema from EXO praised Andrej Karpathy for his innovative code, which allowed a 260K LLM to run at 35.9 tok/sec on the old Windows 98 system. Karpathy, a former AI director at Tesla and co-founder of OpenAI, has contributed significantly to the field. Although a 260K LLM is relatively small, it still performed well on the dated 350 MHz single-core PC. According to the EXO blog, scaling up to a 15M LLM slowed generation speed to just over 1 tok/sec, while Llama 3.2 1B was extremely slow at 0.0093 tok/sec.
BitNet: A Vision for the Future
The story goes beyond merely running an LLM on a Windows 98 system. EXO concludes its post by discussing its future aspirations with BitNet, a transformative architecture using ternary weights. This design means a 7B parameter model only needs 1.38GB of storage, a manageable load even for a 26-year-old Pentium II, and trivial for more recent hardware. BitNet prioritizes CPU usage, avoiding the costly need for GPUs and claims to be 50% more efficient than full-precision models, capable of supporting a 100B parameter model on a single CPU at human-like reading speeds (around 5 to 7 tok/sec).
Before wrapping up, it’s worth noting that EXO is still seeking collaborators. If you’re interested in preventing the monopolization of AI by large corporations and think you can help, consider reaching out. For more casual engagement, EXO hosts a Discord Retro channel where enthusiasts discuss running LLMs on vintage tech like old Macs, Gameboys, Raspberry Pis, and more.
Similar Posts
- AI Model Runs on Xbox 360: See How 3-Core PowerPC with 512 MB Tackles Llama2.c!
- Cerebras AI Smashes AWS, Writes Code 75x Faster Using World’s Biggest Chip!
- Moore Threads GPUs Deliver ‘Excellent’ Performance with DeepSeek Models!
- Nvidia’s RTX 4090 Crushes AMD’s 7900 XTX in AI Speed, Nearly 50% Faster!
- Meta’s $65 Billion Bet on AI: Plans for a 2GW Data Center with 1.3 Million Nvidia GPUs

Avery Carter explores the latest in tech and innovation, delivering stories that make cutting-edge advancements easy to understand. Passionate about the digital age, Avery connects global trends to everyday life.