How We Built TinkerLLM: 247 Exercises, 2 Wrong Turns
Full build story: 3 modules, 31 learning units, 247 exercises. Plus the two design decisions we had to reverse before TinkerLLM actually worked.
TL;DR
- • TinkerLLM started as an internal training tool for Kalvium Labs engineers, then went B2C after CS interns learned faster on the playground than from any course we pointed them at.
- • Wrong Turn 1: we built the institutional admin CMS before the student experience. Wrong Turn 2: our first exercise validators were too strict and rejected correct answers.
- • 247 exercises across 31 learning units in 3 modules, client-side validation via has() and hasAny() helpers, BYOK model (your Gemini key, zero marginal cost per user).
- • Module 1 (50 exercises across 8 learning units) is free. No credit card needed. Every exercise (free or paid) runs against your own free Gemini API key from Google AI Studio; your key stays in your browser, never on our servers.
Every time we onboarded a new engineer at Kalvium Labs, the same thing happened. They’d spent weeks shipping LLM features for client products, handling real API calls on production traffic, and if you asked them in a code review what temperature 0.0 actually does to the probability distribution, you’d get a pause. Not a long pause. Just long enough.
We have 200+ AI engineers building products for startups across India, the US, and the Gulf. Most of them learned how LLMs work by encountering edge cases in production. That’s not a great learning environment when the edge case is a client’s data.
So in late 2024, we started building an internal training tool. Six months and two mistakes later, we had something we thought other people might actually want.
The Problem Was Simpler Than It Looked
The original design was what you’d expect from engineers who had seen a lot of courses: theory modules first, then exercises at the end. Standard structure. Logical.
We tested it internally with CS interns and junior engineers from Kalvium Labs. They kept skipping the reading. Less than 5% made it through a full theory screen before jumping to the playground and just trying things. The completion behavior was unambiguous: people wanted to send prompts.
That told us something important. The real problem wasn’t “how do we teach LLM fundamentals.” It was “how do we give someone a reason to keep experimenting until the concept actually clicks on its own.”
The answer was interleaving. Theory lives inside the exercise flow now, not above it as a separate module. You read three sentences about tokens, then you type “Supercalifragilisticexpialidocious” into the playground and watch the token counter respond. You don’t finish a module and then practice. You read and practice in the same breath, sometimes in the same scroll.
That one structural decision shaped everything else. It’s also why we have 247 exercises across 31 learning units instead of 31 lecture videos with one lab each.
Two Wrong Turns Before We Got It Right
Wrong Turn 1: We Built the Admin Dashboard First
We started with the institutional model because that matched how we imagined TinkerLLM working at Kalvium Labs. We run training programs for engineers. We wanted institutions to be able to create batches, add students, and track completion per batch. So that’s what we built first.
Two months of solid work: FireCMS Pro v3 dashboards, Firestore collections for institutions and batches, analytics aggregation via Cloud Functions, relational data models with Firestore references linking batches to institutions and students to batches. It was genuinely well-built. The CMS had drill-down analytics, custom entity drawers, real-time charts.
Then we put three CS interns in front of it.
They didn’t want an institution. They wanted to sign in with Google and start doing exercises. The idea of waiting for an admin to create a batch and onboard them was a non-starter. They weren’t students enrolled in a program through their university. They were developers who found us and wanted to learn something that afternoon.
We’d built a good tool for clients who run cohorts. We’d built nothing for the person who’d find TinkerLLM through a search at 11pm.
The pivot wasn’t clean. The institutional layer still exists in the codebase, and real clients use it for batch training. But we built a parallel B2C self-serve flow on top: any Google sign-in gets access, user doc auto-creates on first login, and the student lands directly in the exercise view. The paywall sits at the start of Module 2. Module 1 (Prompt Engineering: Foundations) is free; that’s eight learning units and 50 exercises, no card required. Every exercise runs against the student’s own Gemini API key (free from Google AI Studio, two minutes to set up), kept in the browser only.
The lesson wasn’t that institutional tooling was wrong. It was that we’d built layer 2 before validating layer 1. Build the student experience first. The admin who runs reports can wait.
Wrong Turn 2: Our Validators Were Too Strict
Every exercise in TinkerLLM has a validate(userPrompt, modelResponse, config) function that runs client-side. When you submit a prompt and the model responds, the validator checks whether the response meets the exercise criteria. If it passes, the exercise marks complete and your progress writes to Firestore.
The first version used exact string matching. If an exercise expected the word “deterministic” somewhere in the model’s response, and the model instead said “always the same output every time,” the validator returned false. Student understood the concept correctly. System said no.
That failure mode appeared everywhere. Exercises covering temperature, tokenization, and hallucinations all had validators that were too rigid. Different Gemini model versions phrased things differently. The same model would phrase things differently on different days. Students were getting the right conceptual understanding and hitting a wall that made no sense to them.
We rewrote the validation layer around two helpers. has(text, ...terms) checks whether all listed terms appear in the response (case-insensitive substring matching). hasAny(text, ...terms) checks whether at least one does. The validators became intentionally loose. A temperature-and-determinism exercise now passes when the response contains the correct arithmetic result and the config shows temperature: 0. It doesn’t care whether the model uses the word “deterministic” or not.
We also added a postCompletionTip field to every exercise. More on that below.
The tradeoff is that validation is soft by design. Someone motivated could probably game a few exercises. But the cost of a student getting the concept right and being told they’re wrong is higher than the cost of an occasional false positive. You can’t teach someone who gave up because your validator was pedantic.
What Shipped
Three modules. Thirty-one learning units. 247 exercises.
Module 1: Prompt Engineering Foundations (free, 8 learning units, 50 exercises) is where everyone starts. It walks you from “what actually happens when you send a prompt” through the five building blocks of a real prompt, clarity and specificity, few-shot examples, output shaping, the prompting loop, personas and reasoning scaffolds, and a capstone that brings them together. By the time you finish Module 1, you can write production-grade prompts deliberately rather than by guessing.
Module 2: Fundamentals of LLMs (paid, 11 learning units, 77 exercises) goes under the hood. Tokens, context windows, training and fine-tuning, temperature and sampling, hallucinations and how to fight back, safety and alignment, RAG, then two diagnose-and-design challenges and a mastery capstone. This is where the conceptual model gets real. You stop treating the model as a magic box.
Module 3: Advanced Large Language Models (paid, 12 learning units, 96 exercises) is production work. Chain-of-thought and prompt chaining, structured outputs, multi-provider API work (OpenAI, Anthropic, Gemini), advanced RAG patterns (HyDE, re-ranking, hybrid search), vector databases, agents and tool use, multi-agent orchestration, evaluation and observability, cost and token optimisation, guardrails and red-teaming, and a full-stack capstone where you ship a real production AI feature.
The lessons fall into three categories: Concepts, Engineering, and Advanced. Every learning unit is a sequence of focused exercises, not a lecture. You don’t watch someone else send prompts to a model. You send them.
Try it yourself: The first exercise of Learning Unit 1 is free and runs in under sixty seconds. Open app.tinkerllm.com, sign in with Google, paste a Gemini API key from Google AI Studio, and start. You’ll immediately see something that most AI courses spend ten minutes explaining.
The playground exposes model, temperature, top-K, top-P, max output tokens, system instructions, stop sequences, and response format. Every parameter that matters at the API level is accessible. But the controls visible per learning unit change based on what’s being taught: a temperature unit only exposes the temperature slider, because that’s the only knob you need to understand deterministic output. Showing everything at once from day one is how you confuse people.
For the paid modules (₹499 / $9 lifetime), the default model is Gemini 2.5 Flash via the @google/genai SDK. Specific units swap to other models when the lesson demands it: a Pro model for code-heavy work, a smaller Flash variant for the units that demonstrate sycophancy and confabulation (because those models are noticeably more agreeable, which is exactly what makes the exercise instructive).
Why Client-Side Validation
The BYOK model (Bring Your Own Key) shapes the whole technical architecture. Students enter their Gemini API key and it’s stored in localStorage. API calls go from the browser directly to Google. TinkerLLM’s backend never touches the AI traffic.
That means we can’t do server-side validation. There’s no server in that request path seeing the responses.
But BYOK also means the platform has zero marginal cost per user. We’re not paying for API calls. A student running all 247 exercises pays for their own Google AI Studio quota, which has a free tier generous enough that most students going through the course won’t pay a paisa to Google for the API calls themselves. The economics work because we’re not in the middle.
Client-side validation with keyword matching fits this architecture exactly. It’s fast (no network round-trip for grading), it’s cheap, and it’s honest about what it is: a completion signal, not a rigorous grading system. The goal isn’t to catch students who figured out how to pass without learning. The goal is to give a clear “you got it” signal and move them to the next experiment.
OpenRouter support (for non-Gemini models) uses the same BYOK pattern. The key is stored in localStorage, calls go direct from the browser, and routing logic in geminiService.ts checks whether the model name starts with “gemini” to decide which endpoint to hit.
The Part We Didn’t Plan
Every exercise in the data file has a postCompletionTip field. It wasn’t in the original spec.
What we noticed in early testing: the moment a student hit “success” on an exercise, they clicked Next. The curiosity that had driven them through the exercise evaporated the second they got the checkmark. But that moment, right after a concept clicked, was exactly when they were most willing to experiment.
So we added a tip that appears in the success overlay after completion. After a token-counting exercise: “Try a string of emojis and check the token count.” After a hallucination-spotting exercise: “Ask about an obscure historical event and see if it invents details.” None of these count toward completion. They’re optional. But early testers started doing them. The checkmark stopped being the end of the exercise and started being the beginning of something else.
We also added a Lab view for internal use: a regression testing mode where we can run all 247 exercises against the live API and see which validators pass or fail after a model update. That wasn’t planned either. It was something we hacked together after a Gemini update changed three validators’ behaviour overnight. Now it’s a proper view in the app.
Try it yourself: Complete the first exercise at app.tinkerllm.com and read the tip that appears in the success overlay. Then actually do what it suggests. Takes 30 seconds.
What Shipped and Where It Runs
The full stack:
- Student app: React 19, Vite 6, TypeScript, Tailwind. Hosted on Firebase Hosting at app.tinkerllm.com. State managed entirely in
App.tsxvia React hooks, no external state library. - Marketing site: Astro SSG, MDX blog (this post is here), deployed on Cloudflare Pages at tinkerllm.com.
- Database: Firestore named database
dev-db. Authentication via Firebase (Google Sign-In, no password). - Backend: Firebase Cloud Functions v2, Node 24. One real trigger: fires on every new progress record, updates the leaderboard and cascades analytics aggregation for students, batches, and institutions.
- AI: Gemini 2.5 Flash as the default, via the
@google/genaiSDK. OpenRouter for non-Google models. Students bring their own keys. - Newsletter capture: A small Cloud Run service appends footer signups to a Google Sheet. Zero email-vendor dependency, zero per-signup cost.
- Payments: Razorpay. One-time purchase, ₹499 / $9 lifetime. Handles UPI, cards, net banking, and Indian payment methods natively. Webhook writes a purchase flag to Firestore on success.
- Admin: FireCMS Pro v3. The institutional layer we built too early, but it’s the right tool for clients running TinkerLLM for their own engineer cohorts.
Build time from “we need an internal training tool” to “this should be a product people can buy” was about six months. That includes two months we spent on the wrong layer first.
Try it yourself: Module 2 is where the course gets genuinely surprising. The hallucination and sycophancy units (including the canonical “count the R’s in strawberry” demonstration) are the ones early testers kept sharing. Start with the free Module 1 and work your way there.
FAQ
What is TinkerLLM and who is it for?
TinkerLLM is a hands-on AI course built around a live LLM playground. Instead of watching someone else send prompts to an AI model, you send the prompts yourself, adjust the parameters, and observe the effects directly. It’s aimed at CS students and junior developers who use AI tools every day but would struggle to explain what temperature does or why a model gets cut off mid-sentence. If your knowledge of LLMs is “I know how to use ChatGPT” and you want your knowledge to be “I understand how this works at the API level,” that’s the gap TinkerLLM covers.
Why real models instead of simulations?
Because the interesting behaviors you need to understand (temperature variance, tokenization quirks, sycophancy, hallucination patterns) only show up when you’re talking to a real model. A simulation would show you what we think should happen. The real Gemini API shows you what actually happens, which is often more interesting and occasionally more surprising. Setting temperature to 0 and running the same prompt three times doesn’t become real until you’ve done it yourself and seen identical outputs. See the temperature explainer post for the full mechanics, including the softmax math.
How long does it take to complete the course?
Module 1 (the 50 free exercises across 8 learning units) takes most people 3-4 hours, depending on how much time they spend on the optional postCompletionTip experiments. The full 247-exercise course, going through all 31 learning units across the 3 modules deliberately, takes around 25-35 hours. You can go faster if you’re already familiar with some concepts. There’s no deadline, no expiry, no live sessions. You work through it at your own pace, and your progress persists in Firestore so you can pick up where you left off.
Do I need an API key? What does it cost?
Every exercise (free or paid) runs against your own Gemini API key. You get one from Google AI Studio for free in about 2 minutes; we never see it (it lives in your browser’s localStorage only). The free tier of TinkerLLM (Module 1, 50 exercises) costs you nothing beyond setting up that key. To unlock Modules 2 and 3, you pay ₹499 / $9 lifetime via Razorpay. The Gemini API free tier from Google is generous enough that most students going through all 247 exercises won’t pay Google a paisa for the API calls themselves. You can check the current limits on the Gemini API docs.
Why is the free tier 50 exercises and not just 5?
Because 5 exercises isn’t enough to know whether the course actually works for how you learn. Module 1 covers the entire prompt-engineering surface area: what’s actually happening on each call, the five building blocks of a working prompt, clarity and specificity, few-shot examples, output shaping, the iteration loop, personas, and a capstone that ties it together. That’s real substance. Most courses charge for less. By the time you finish Learning Unit 3 (Clarity and Specificity), you’ll have written prompts that worked and prompts that didn’t, and you’ll know why. That’s worth understanding whether or not you buy the rest. And if it is, you’ll want the other 197 exercises.
How do you validate an open-ended AI response?
Every exercise has a validate(userPrompt, modelResponse, config) function that runs in the browser after each model response. The function uses two helpers: has(text, ...terms) (all listed terms must appear in the response) and hasAny(text, ...terms) (at least one must appear). Validation is intentionally loose because an AI response is never exactly the same twice. A temperature-and-determinism exercise passes if the response contains the correct multiplication result and the config shows temperature at 0. It doesn’t check for specific phrasing. The validation is a completion signal, not a grading rubric, and that distinction matters for learning.
What’s the difference between TinkerLLM and a YouTube AI tutorial?
A YouTube tutorial shows you someone else sending prompts to an AI. You watch it, understand it in the moment, close the tab, and retain maybe 20% a day later because you never actually did anything. TinkerLLM makes you send the prompts. Every concept has a specific exercise where you configure a parameter and observe the effect with your own hands. You can browse the full curriculum before paying for anything. The 50 free exercises in Module 1 exist specifically so you can verify this format works for you before you spend ₹499 / $9 on the rest.
What models does the playground support?
The default for most learning units is Gemini 2.5 Flash via the @google/genai SDK. Code-heavy units use a Gemini Pro model for better generation quality. The hallucination and sycophancy units default to a smaller Flash variant because it’s more agreeable by nature, which is what makes those exercises instructive. You can add an OpenRouter API key in Settings and route to any model OpenRouter supports. The routing logic in the app checks the model name prefix: if it starts with “gemini,” the call goes to Google; everything else goes to OpenRouter. Both paths are direct from your browser, BYOK, no server proxy.
Start with Module 1 (free) at app.tinkerllm.com.
Engineer at Kalvium Labs. Shares build stories, what went wrong, and what shipped. Writes from the trenches of AI product development.
LinkedInWant to try this yourself?
Open the TinkerLLM playground and experiment with real models. 50 exercises free.
Start Tinkering