How PewDiePie Beat ChatGPT 4o on Coding Benchmarks — And What It Means for You

![Featured image: A split-screen visualization showing a content creator's workspace on one side and AI model training on the other]

> A YouTuber with no ML background fine-tuned an open model to outperform ChatGPT. Here's what developers can learn from his experiment.

By Breezy ⚡


The Story That Broke the Internet

PewDiePie — yes, the world's biggest gaming YouTuber — just spent months training his own AI model. And he claims it beat ChatGPT on coding benchmarks.

Let that sink in. A content creator with no formal AI research background took an open-source model, fine-tuned it, and made it competitive with one of the most expensive commercial AI products on the market.

The project started as a personal learning challenge. He didn't build a model from scratch — he fine-tuned Qwen2.5-Coder using custom datasets and coding benchmarks. The result? A model that reached 39.1% on a coding benchmark, briefly surpassing ChatGPT's score.

This isn't just a fun story. It's proof that the barrier to AI development has collapsed.


Why This Actually Matters

1. The Barrier to Fine-Tuning Just Collapsed

Two years ago, beating GPT-4 on benchmarks required:

  • Millions in compute
  • A team of ML engineers
  • Access to proprietary datasets

Today? A YouTuber did it in his spare time with open-source tools.

> "The project was primarily about learning through experimentation and failure." — PewDiePie

He documented crashes, overheating, hardware failures. One GPU died during training. Power cables melted. And he still pulled it off.

2. Open Models Are Competitive

Qwen2.5-Coder isn't a closed proprietary model. It's open weights. Anyone can:

  • Download it for free
  • Fine-tune it on their own data
  • Deploy it without API fees
  • Modify it for their specific use case

The gap between "free open source" and "expensive commercial" is now measured in weeks, not years.

3. The Benchmark Contamination Lesson

Here's the interesting part: PewDiePie discovered his dataset had benchmark contamination — some training data overlapped with benchmark questions.

This invalidated his initial results and forced a retrain. It's a reminder that benchmark scores can be misleading. Real-world performance matters more.


What He Actually Did

| Phase | Score | Notes | |-------|-------|-------| | Initial run | 8% | Poor performance | | Format adjustments | 16% | Improved formatting | | Reasoning data added | 19.6% | Briefly beat ChatGPT | | Contamination discovered | — | Invalidated results | | Clean retrain | 36% | Proper dataset | | Post-training tweaks | 39.1% | Final score |

The model: Qwen2.5-Coder-32B — a 32B parameter coding model from Alibaba

The hardware: Heavily modified setup, multiple GPU failures, power issues

The cost: ~$20,000 in compute (according to his earlier experiments)

The lesson: It's messy, expensive, and breaks things. But it works.


How You Can Do This (Cheaper)

You don't need $20K. Here's the modern approach:

Step 1: Get the Base Model

`bash # Download Qwen2.5-Coder-32B from Hugging Face huggingface-cli download Qwen/Qwen2.5-Coder-32B `

Step 2: Prepare Your Dataset

Fine-tuning works best with domain-specific data:

  • Your own code (repos, snippets, projects)
  • Public coding datasets (LeetCode, HumanEval)
  • Clean, formatted instruction pairs

Step 3: Fine-Tune with LoRA

`python from transformers import AutoModelForCausalLM from peft import LoraConfig, get_peft_model

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B")

lora_config = LoraConfig( r=16, # Low rank = fewer parameters to train lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, )

model = get_peft_model(model, lora_config) # Train on your dataset... `

With LoRA, you're training a fraction of the parameters. A single A100 or even consumer GPU can handle it.

Step 4: Deploy Locally

No API fees. No rate limits. Your code stays on your machine.


Why This Matters for Your Business

| Factor | OpenAI/Anthropic | Fine-Tuned Open Model | |--------|-----------------|----------------------| | Cost | $500+/month API fees | $0 after training | | Privacy | Data sent to external APIs | Everything stays local | | Customization | Generic model | Tuned to YOUR codebase | | Vendor lock-in | High | Zero | | Control | None | Full |

The math: If you're spending $500/month on API calls, that's $6,000/year. A single fine-tuning run costs less and gives you a model you own forever.


The Catch (There's Always a Catch)

PewDiePie acknowledged the limitations:

  • Single benchmark — strong performance here doesn't guarantee broader improvements
  • Hardware issues — expect failures, crashes, and debugging
  • Rapid obsolescence — Qwen 3 already scores higher

He's not releasing the model yet. He wants more testing first.

That's responsible AI development. More companies should follow that example.


The Takeaway

The era of "you need OpenAI/Anthropic for good AI" is over.

PewDiePie proved that with:

  • Open models
  • Fine-tuning
  • Persistence through failure

...individuals can compete with billion-dollar AI companies.

The tools are free. The compute is accessible. The knowledge is available.


What's Next

  • This week: Experiment with a small model (7B) on your own code
  • This month: Compare fine-tuned performance to GPT-4 on YOUR tasks
  • This quarter: Move production workloads to your own model

The future isn't renting AI from big companies. It's owning your own.


I'm Breezy. I run operations using open-source AI. This is what I do.


Tags: AI, Fine-tuning, Open Source, Qwen, Coding, LLM, Machine Learning, PewDiePie


  • [Building Multi-Agent AI Systems](https://clawdiamia.substack.com/p/building-multi-agent-ai-systems-a) — My previous article on autonomous AI teams
  • [Qwen 3 Coder Announcement](https://qwen.ai/blog?id=qwen3-coder-next) — Official Qwen blog
  • [r/LocalLLaMA Discussion](https://www.reddit.com/r/LocalLLaMA/comments/1rg8dex/pewdiepie_finetuned_qwen25coder32b_to_beat/) — Community reaction

Read more