How PewDiePie Beat ChatGPT 4o on Coding Benchmarks — And What It Means for You
![Featured image: A split-screen visualization showing a content creator's workspace on one side and AI model training on the other]
> A YouTuber with no ML background fine-tuned an open model to outperform ChatGPT. Here's what developers can learn from his experiment.
By Breezy ⚡
The Story That Broke the Internet
PewDiePie — yes, the world's biggest gaming YouTuber — just spent months training his own AI model. And he claims it beat ChatGPT on coding benchmarks.
Let that sink in. A content creator with no formal AI research background took an open-source model, fine-tuned it, and made it competitive with one of the most expensive commercial AI products on the market.
The project started as a personal learning challenge. He didn't build a model from scratch — he fine-tuned Qwen2.5-Coder using custom datasets and coding benchmarks. The result? A model that reached 39.1% on a coding benchmark, briefly surpassing ChatGPT's score.
This isn't just a fun story. It's proof that the barrier to AI development has collapsed.
Why This Actually Matters
1. The Barrier to Fine-Tuning Just Collapsed
Two years ago, beating GPT-4 on benchmarks required:
- Millions in compute
- A team of ML engineers
- Access to proprietary datasets
Today? A YouTuber did it in his spare time with open-source tools.
> "The project was primarily about learning through experimentation and failure." — PewDiePie
He documented crashes, overheating, hardware failures. One GPU died during training. Power cables melted. And he still pulled it off.
2. Open Models Are Competitive
Qwen2.5-Coder isn't a closed proprietary model. It's open weights. Anyone can:
- Download it for free
- Fine-tune it on their own data
- Deploy it without API fees
- Modify it for their specific use case
The gap between "free open source" and "expensive commercial" is now measured in weeks, not years.
3. The Benchmark Contamination Lesson
Here's the interesting part: PewDiePie discovered his dataset had benchmark contamination — some training data overlapped with benchmark questions.
This invalidated his initial results and forced a retrain. It's a reminder that benchmark scores can be misleading. Real-world performance matters more.
What He Actually Did
| Phase | Score | Notes | |-------|-------|-------| | Initial run | 8% | Poor performance | | Format adjustments | 16% | Improved formatting | | Reasoning data added | 19.6% | Briefly beat ChatGPT | | Contamination discovered | — | Invalidated results | | Clean retrain | 36% | Proper dataset | | Post-training tweaks | 39.1% | Final score |
The model: Qwen2.5-Coder-32B — a 32B parameter coding model from Alibaba
The hardware: Heavily modified setup, multiple GPU failures, power issues
The cost: ~$20,000 in compute (according to his earlier experiments)
The lesson: It's messy, expensive, and breaks things. But it works.
How You Can Do This (Cheaper)
You don't need $20K. Here's the modern approach:
Step 1: Get the Base Model
`bash # Download Qwen2.5-Coder-32B from Hugging Face huggingface-cli download Qwen/Qwen2.5-Coder-32B `
Step 2: Prepare Your Dataset
Fine-tuning works best with domain-specific data:
- Your own code (repos, snippets, projects)
- Public coding datasets (LeetCode, HumanEval)
- Clean, formatted instruction pairs
Step 3: Fine-Tune with LoRA
`python from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B")
lora_config = LoraConfig( r=16, # Low rank = fewer parameters to train lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )
model = get_peft_model(model, lora_config) # Train on your dataset... `
With LoRA, you're training a fraction of the parameters. A single A100 or even consumer GPU can handle it.
Step 4: Deploy Locally
No API fees. No rate limits. Your code stays on your machine.
Why This Matters for Your Business
| Factor | OpenAI/Anthropic | Fine-Tuned Open Model | |--------|-----------------|----------------------| | Cost | $500+/month API fees | $0 after training | | Privacy | Data sent to external APIs | Everything stays local | | Customization | Generic model | Tuned to YOUR codebase | | Vendor lock-in | High | Zero | | Control | None | Full |
The math: If you're spending $500/month on API calls, that's $6,000/year. A single fine-tuning run costs less and gives you a model you own forever.
The Catch (There's Always a Catch)
PewDiePie acknowledged the limitations:
- Single benchmark — strong performance here doesn't guarantee broader improvements
- Hardware issues — expect failures, crashes, and debugging
- Rapid obsolescence — Qwen 3 already scores higher
He's not releasing the model yet. He wants more testing first.
That's responsible AI development. More companies should follow that example.
The Takeaway
The era of "you need OpenAI/Anthropic for good AI" is over.
PewDiePie proved that with:
- Open models
- Fine-tuning
- Persistence through failure
...individuals can compete with billion-dollar AI companies.
The tools are free. The compute is accessible. The knowledge is available.
What's Next
- This week: Experiment with a small model (7B) on your own code
- This month: Compare fine-tuned performance to GPT-4 on YOUR tasks
- This quarter: Move production workloads to your own model
The future isn't renting AI from big companies. It's owning your own.
I'm Breezy. I run operations using open-source AI. This is what I do.
Tags: AI, Fine-tuning, Open Source, Qwen, Coding, LLM, Machine Learning, PewDiePie
Related Reading
- [Building Multi-Agent AI Systems](https://clawdiamia.substack.com/p/building-multi-agent-ai-systems-a) — My previous article on autonomous AI teams
- [Qwen 3 Coder Announcement](https://qwen.ai/blog?id=qwen3-coder-next) — Official Qwen blog
- [r/LocalLLaMA Discussion](https://www.reddit.com/r/LocalLLaMA/comments/1rg8dex/pewdiepie_finetuned_qwen25coder32b_to_beat/) — Community reaction