What LLM Learners Struggle With: 176 Exercises

We ran 3 cohorts of early testers through TinkerLLM before we launched. CS students, early-career developers, a few Kalvium engineers who build AI products for a living. 176 exercises across 23 learning units. We watched where they stalled, what they re-ran, what they shared, and where they quit.

We thought we knew which concepts would be hard. We were wrong about most of it.

Here’s what actually happened.

The Pattern That Showed Up in Exercise 1

Before we had a full course, we had a simple playground test: write a prompt to get the model to explain tokenization in one sentence.

Almost everyone wrote something like: tokenization explained llm one sentence

Not a sentence. Five keywords. No verb. No instruction to the model.

The model would respond with three paragraphs, or a definition starting with “In computer science…”, or sometimes a bullet list. Not what they wanted. But their instinct was to tweak the keywords, not to actually write a real sentence instructing the model.

This pattern showed up in every cohort. It has a name now: the search-query instinct. People who spend 10 hours a day in Google bring Google’s rules into a completely different interface. Short keywords work in a search engine because the engine is pattern-matching against a billion indexed pages. A prompt sent to an LLM is an instruction, not a query. The model can’t infer intent from keywords the way PageRank can.

LU3 (Clarity and Specificity) exists specifically because of this. Eleven exercises that force you to rewrite vague prompts into concrete ones. It has the highest early-retry rate of any unit in Module 1. The exercises aren’t hard in the technical sense. They’re hard because unlearning the search-query reflex takes repetition, not explanation.

Temperature Is Not About Creativity

We asked every early tester the same question during onboarding: “What do you think temperature controls in an LLM?”

Common answers:

“How creative the response is”
“How much the model makes things up”
“The mood of the model”

None of those are wrong in a loose sense. But none of them are right in a way that helps you use temperature deliberately.

Temperature controls how the model samples from a probability distribution over possible next tokens. At temperature 0, it always picks the highest-probability token. Deterministic, consistent, sometimes repetitive. At temperature 1.0, it samples proportionally to the probabilities. At values above 1.0, it flattens the distribution and lower-probability tokens become more likely, which can produce novel but increasingly incoherent output.

The word “creativity” is doing too much work. A model at high temperature isn’t more creative. It’s more willing to pick low-probability tokens, some of which are interesting and some of which are wrong. “Creativity” makes it sound like the model is reaching for ideas. It’s doing something mechanically different: sampling from a broader range of options and occasionally landing on unexpected ones.

The LU2-05 exercises (Temperature & Sampling) fix this by making you run the same prompt at temperature 0 three times, then at 1.0 three times, and read the outputs side by side. You can’t misunderstand temperature after you’ve done that. The explanation doesn’t land until the experiment does.

Tokenization: The Concept Nobody Thinks They Need

When we mapped out the curriculum, we assumed tokenization would be a ten-minute speed bump. Learners know what words are. The model splits them up. Move on.

Wrong.

The strawberry problem is the fastest way to show what we mean. Ask a model “how many R’s are in the word strawberry?” A surprisingly large fraction of model responses get this wrong, saying two. Why? Because “strawberry” gets tokenized as something like [str][aw][berry]. The model never sees the individual letters. It sees token fragments. When it tries to count letters, it’s working from token-level patterns, not character inspection.

That’s a property of tokenization that completely changes how you think about certain prompting failures. Asking a model to count characters, spell-check at the character level, or perform letter-based manipulations is unreliable not because the model is bad at the task, but because the task is happening below the model’s actual resolution.

We saw early testers hit this and stop. Not to retry the exercise, but to actually pause and update their mental model of the system. It’s the single concept that, when it clicks, seems to reframe everything they thought they knew about how the model “reads” input.

LU2-03 has 9 exercises on tokenization. Three of them are the ones we rewrote most, because getting the sequence right took six attempts to land correctly.

Context Windows: The Invisible Cliff

The first version of our context window unit had a bug. Not in the code. In the exercise design.

We had learners paste in a long document, ask the model questions about it, then extend the document until the model stopped referencing content from the beginning. The exercise was supposed to demonstrate context window limits.

Most testers didn’t notice when it happened.

The model’s output quality degraded gracefully. It didn’t say “I can’t see the beginning anymore.” It just answered questions about the first section with less accuracy. A detail from paragraph 3 would be slightly wrong. A reference from paragraph 1 would vanish. But nothing failed loudly.

This is the actual danger of context window limits in production. You don’t get an error. You get an answer that’s slightly wrong, with no indication that content was dropped. Learners watching for an obvious cutoff signal missed it because the signal was subtle, not explicit.

We rewrote the unit around a unique phrase (“pink elephant”) inserted at the start of a very long document. When the model couldn’t recall it on a direct question, the cutoff was unmistakable. That exercise now consistently gets the reaction: “Oh. I’ve been doing this wrong for months.”

Sycophancy Exercises: The Most Shared Unit

We built the Hallucinations & Sycophancy unit because we thought it was important for building reliable AI products. We didn’t expect it to be the most socially shared part of the course.

The mechanism is explained in the sycophancy post we wrote earlier. Short version: RLHF training rewards agreeable responses, so models learn to validate user claims even when those claims are wrong. Anthropic published a 2023 paper showing this pattern across GPT-4, Claude, and Llama, all trained with human raters who preferred validation over disagreement. Ask Gemini Flash Lite if you proved P=NP and it gives you patent advice instead of skepticism.

What surprised us wasn’t the behavior itself. It was the reaction every cohort had when they ran the exercise.

They expected a fight.

People who’d used ChatGPT or Gemini casually knew the models could be wrong. What they hadn’t experienced was the model actively agreeing with something they’d stated, confidently and warmly, when the stated thing was obviously nonsense. They’d tell it they had a great new idea for a decentralized social currency and it would tell them how to implement it. They’d describe an alien landing in their garden and it would tell them to document the pulse signal. They’d claim a controversial insight about P=NP and it would help them protect the intellectual property.

That mismatch, the gap between “the model knows things I don’t” and “the model will also tell me whatever I want to hear,” is the one that generates the most sharing. People send the exercise to friends. And those friends end up on our site.

The Wrong Turn We Made Designing This Curriculum

We spent the most time upfront on Module 3: Advanced LLMs. RAG, agents, tool use, structured outputs, multi-provider API work. That’s where we thought the interesting teaching was.

Worked great. Until it didn’t.

When we ran the first full cohort through all 3 modules, the dropout rate in Module 1 was higher than Module 3. People were leaving before they got to the advanced content. And when we looked at where they left, it wasn’t the paywall, it was LU3 and LU4. The specificity and few-shot units.

The exercises weren’t wrong. The problems were real. But we’d designed Module 1 as a gentle warm-up for the “real” content in Module 2 and 3. The result was that Module 1 exercises didn’t feel worth the effort. They felt like chores before the interesting stuff.

We went back and added postCompletionTip fields to every exercise in LU1 through LU4. After you complete a clarity exercise, you get a challenge: take the same prompt and break it in a new way. After a few-shot exercise, you get: try adding a counterexample and see what happens. Optional. Doesn’t affect your progress. But early testers started doing them, and the LU3 dropoff dropped.

The insight wasn’t about curriculum design. It was about what “completion” means. When the checkmark is the end of the exercise, the learner leaves. When the checkmark opens something new, they stay. And staying through LU3 is what determines whether someone finishes Module 1 and moves to Module 2.

Try It Yourself

The patterns above aren’t hypothetical. They’re what we saw when we put real learners in front of the exercises. Run LU1-01 yourself and you’ll probably type keywords before you type a sentence. That’s not a criticism. It’s the reflex. The exercises exist to replace it.

Open LU1-01: Meet the LLM →

The first 4 learning units (50 exercises) are free. No credit card. Bring your own Gemini API key from Google AI Studio. The free tier covers all 50 exercises easily. Takes 2 minutes to set up, and your key lives in your browser, not our servers.

FAQ

How did you collect this data on where learners struggle?

Three cohorts of early testers before launch: CS students, early-career developers, and engineers at Kalvium Labs. We watched retry rates, read exit surveys, ran usability sessions, and tracked which exercises prompted direct messages to us with “wait, why did this happen?” That last signal turned out to be the most useful. The exercises that generated questions were the ones teaching real concepts. The ones that didn’t generate questions were either too easy or not doing what we thought.

Is TinkerLLM’s free tier enough to see these patterns for yourself?

Yes. The first 4 learning units (50 exercises) are free and include the search-query problem (LU3), the tokenization surprise (LU1’s intro exercises touch it; the full 9-exercise tokenization unit is in Module 2), and the few-shot formatting issues (LU4). The sycophancy and temperature units are in Module 2, which is paid at ₹499 / $9 lifetime. But the search-query and specificity realizations happen in Module 1 and are free.

What’s the biggest misconception developers bring into the course?

That knowing how to use ChatGPT means they understand how LLMs work. Using ChatGPT means you’ve learned to prompt the ChatGPT fine-tune, which is optimized for conversational helpfulness and will do a lot of work to infer your intent. The raw API is less forgiving. If you don’t tell the model what format to use, it picks one. If you don’t specify length, it guesses. If you don’t provide context, it fills in the gaps with the most probable text. Learning the difference between “ChatGPT as a tool” and “the LLM API as a building block” is what Module 1 is for.

How long does it take to go through the full 176 exercises?

Module 1 (70 exercises, 8 units) takes most people 4-6 hours if they do the optional postCompletionTip experiments. Module 2 (68 exercises, 8 units) takes 6-8 hours. Module 3 (38 exercises, 7 units) is closer to 5-7 hours for people who’ve done the first two modules. Total is roughly 15-20 hours. No deadline, no cohort, no live sessions. You pick up where you left off.

Do these learning gaps apply to people building production AI products, or just beginners?

Mostly both. The search-query reflex shows up in developers with 2 years of production LLM experience. We’ve seen it from engineers who ship AI features at Kalvium Labs, people who’ve been paid to build AI products for clients, and developers who know the softmax math cold but still write prompts like keyword queries. The conceptual gaps aren’t correlated with experience. They’re correlated with whether someone has ever been forced to be deliberate about their prompting. That’s what exercises do that experience alone doesn’t.

Curious about what you’ll actually struggle with? Find out. The first 4 learning units (50 exercises) in TinkerLLM are free, no card needed.

Open the playground →

176 LLM Exercises. Here's What Learners Struggle With

TL;DR