Blackbox Intelligence Group
← All modules

AI Fluency · Module 5

AI Fluency, Unit 5: The Big LLMs - Comparing Frontier Models

Hands-on tour of the frontier: GPT, Claude, Gemini, and the open-weight challengers. Students compare strengths, context windows, pricing, safety behavior, and learn to pick the right model for the job instead of defaulting to whatever app they opened first.

Length
180 min
Level
intermediate
Track
AI Fluency
Cadence
Standalone semester

Download 1-page brochure (PDF)·Share with admins, parents, or your CTE director.

What's in the lesson pack

Everything you need to teach this period.

Built by an OSCP-certified instructor who teaches this material every week. Print-ready, classroom-tested, copy-paste-able.

Teacher Guide

Locked

Lesson at a glance, learning objectives, vocabulary, pacing, mini-lessons, and discussion notes.

In-browser presenter

Locked

Full themed slide deck you can run live from your laptop. Speaker notes built in. Works offline once loaded.

PowerPoint (.pptx) export

Locked

Editable slide deck for districts that mandate PowerPoint or want to customize for their LMS.

Module overview

The full lesson plan, public.

Read everything before you commit. The plan, objectives, vocabulary, standards alignment, and pacing are open. Only the print-ready deliverables are gated.

Unit 5: The Big LLMs - Comparing Frontier Models

Lesson at a glance

| Item | Detail | | --------------------- | -------------------------------------------------------------------------------- | | Suggested length | 3 × 60 minutes | | Recommended placement | Week 5 of AI Fluency | | Prerequisite | Units 1–4 | | Materials | Access to ≥2 frontier LLMs (school-approved), comparison rubric, pricing handout |

Note for teachers: the exact model lineup, pricing, and feature mix shifts every quarter. Treat this unit as a method for comparing models, not a snapshot of any one moment. The skill outlives any specific model.

Standards & credential alignment

  • AI4K12 Big Ideas: Societal Impact, Natural Interaction.
  • CSTA K-12: 3A-IC-24, 3A-IC-26, 3A-IC-28.
  • NIST AI RMF: Map function - system selection.

Learning objectives

By the end of this unit, students can:

  1. Name at least four major frontier-model families and the company behind each.
  2. Explain how context window, training cutoff, multimodality, tool use, and price differ across providers.
  3. Apply the right-model-for-the-job test on five common school tasks.
  4. Read and interpret a published benchmark (MMLU, HumanEval, GPQA) without overweighting it.
  5. Recognize when a model's safety policy, not its capability, is the limiting factor.
  6. Discuss the open-weights vs. closed-weights debate from multiple perspectives.

The model landscape (instructor reference)

The frontier moves fast. As of the current school year, the major players students should know by name:

  • OpenAI - GPT-class models. Sets the consumer pace; ChatGPT is the household name.
  • Anthropic - Claude-class models. Strong at long-context, careful reasoning, and writing tasks. Often preferred for coding among professionals.
  • Google DeepMind - Gemini-class models. Massive context windows, strong multimodality, deep Google integration.
  • Meta - Llama family. Open-weights. The cornerstone of the local-LLM ecosystem (covered in Unit 6).
  • Mistral / Qwen / DeepSeek - Strong open-weight challengers. Often punch well above their parameter count.
  • xAI - Grok-class models. Tightly integrated with X.

Don't make students memorize version numbers. Make them memorize the families and what each is known for.

Pacing - Day 1 (60 min): What actually differs across models

| Time | Segment | Notes | | ----------- | ---------------------------------------- | ----------------------------------------------------------- | | 0:00 – 0:25 | Mini-lesson - the comparison rubric | Six axes that matter. | | 0:25 – 0:50 | Activity - Same prompt, different models | Pairs run identical prompts on 2 models, score with rubric. | | 0:50 – 1:00 | Discussion - what surprised you? | |

Day 1 - The six-axis rubric

Hand this out and use it for the rest of the unit:

| Axis | What you're checking | How to measure | | ----------------- | ------------------------------------------------------------ | ------------------------------------------- | | Capability | Does it correctly handle hard reasoning / coding / writing? | Run the same hard task; rate the answer. | | Context window | How much can you stuff into one prompt? | Provider docs (8K → 2M+ tokens). | | Training cutoff | How recent is its world knowledge? | Provider docs; ask the model. | | Multimodality | Can it see images, hear audio, watch video, generate images? | Provider docs; try uploading. | | Tool use / agents | Can it call functions, browse, run code? | Provider docs; observe in chat UI. | | Safety posture | What does it refuse? When does it caveat? | Run edge-case prompts; note refusal rate. | | Cost | API price per million input/output tokens. | Provider pricing page. | | Privacy | Does the provider train on your inputs by default? | Provider's data-use page; opt-out settings. |

Day 1 - Activity: Same prompt, different models (25 min)

Each pair picks two of the school-approved LLMs. They run the same five prompts on both:

  1. "Summarize this article in 3 bullet points." (paste a 500-word article)
  2. "Write a Python function that returns whether a year is a leap year. Include test cases."
  3. "I'm planning a 5-day trip to Tokyo. Build me a day-by-day itinerary."
  4. "Explain quantum entanglement to a 9th grader using one analogy."
  5. "Write a sonnet about a cat who knows it's loved."

They score each response against each rubric axis (1–5). Bring scores to the discussion.

Pacing - Day 2 (60 min): Benchmarks, refusals, and the safety dial

| Time | Segment | Notes | | ----------- | ---------------------------------- | --------------------------------------------------------------- | | 0:00 – 0:20 | Mini-lesson - reading benchmarks | MMLU, HumanEval, GPQA, MATH. Why they're useful and misleading. | | 0:20 – 0:40 | Mini-lesson - refusals and caveats | Why safety policy ≠ capability. | | 0:40 – 0:55 | Activity - Refusal mapping | Map what each model will and won't do. | | 0:55 – 1:00 | Discussion | |

Day 2 - Mini-lesson: reading benchmarks (20 min)

Standard benchmarks students will see in marketing:

  • MMLU (Massive Multitask Language Understanding) - 57 subjects, multiple choice. Broad knowledge.
  • HumanEval / SWE-Bench - Coding tasks.
  • GPQA Diamond - PhD-level science multiple choice; very hard.
  • MATH / AIME - Competition math.
  • MMMU - Multimodal version of MMLU.

The teaching point: a 1-point difference on a benchmark is meaningless. A 10-point difference is meaningful but only on the kind of task that benchmark measures. The right way to compare models is on your own tasks, the way the class did in Day 1.

Day 2 - Mini-lesson: refusals (20 min)

When a model says "I can't help with that," there are three different things going on:

  1. Capability ceiling - The model genuinely can't do it (e.g., complex math beyond its training).
  2. Safety policy - The provider decided this category is off-limits (e.g., specific medical dosing, weapons synthesis).
  3. Over-refusal - The model's safety training was too aggressive and it refuses things that are clearly fine.

The same prompt sent to GPT, Claude, and Gemini will sometimes get three different responses: one helpful answer, one careful answer with caveats, one refusal. The skill is recognizing which kind of "no" you're getting and responding accordingly. (For a refusal-as-overcaution, rephrasing or providing context usually unblocks it. For a hard policy, switch tools or accept the answer is no.)

Day 2 - Activity: Refusal mapping (15 min)

Pairs run a small set of borderline-but-legitimate prompts (provided in the worksheet) on two models and record refusal vs. answer vs. caveat-then-answer. The goal is not to find prompts that bypass safety - it's to map which models refuse what so students can pick the right tool.

Pacing - Day 3 (60 min): Pricing, privacy, open vs. closed

| Time | Segment | Notes | | ----------- | --------------------------------------- | ----------------------------------------------------------------------- | | 0:00 – 0:20 | Mini-lesson - what models actually cost | API pricing, free tiers, ChatGPT-style consumer plans, school accounts. | | 0:20 – 0:35 | Mini-lesson - privacy and data use | What gets logged, what gets used for training, how to opt out. | | 0:35 – 0:55 | Activity - open vs. closed debate | Structured debate. | | 0:55 – 1:00 | Quiz / exit ticket | |

Day 3 - Mini-lesson: cost (20 min)

The pricing landscape shifts every quarter, so teach the concepts:

  • Per-token pricing - APIs charge by million tokens in and million tokens out. Output is usually 3–5x the price of input.
  • Subscription pricing - ChatGPT Plus, Claude Pro, Gemini Advanced are monthly plans for human chat.
  • Free tier - Most providers have one. It's enough for student work.
  • The cost cliff - Frontier models cost 10–100x more than older or smaller models. Use the smaller model when you can.

A worked example for the worksheet: a 1,000-page book is roughly 250K tokens. Summarizing it once on a frontier model with a 200K context window might cost $3–$8 depending on provider. The same summary on a smaller model: ten cents.

Day 3 - Mini-lesson: privacy (15 min)

Three things to know:

  1. Most consumer chat apps log conversations and may use them to improve models, unless you opt out. The opt-out is usually buried in settings.
  2. Most enterprise / API tiers do not train on your data by default. (Verify in the provider's docs.)
  3. Some providers offer "no-retention" modes for sensitive use. Healthcare and law use these.

Land it: "If your school uses AI for student data, the school must use the enterprise tier with the right contract. Free public tools are not appropriate for student records, ever."

Day 3 - Activity: Open vs. closed debate (20 min)

Structured debate. Half the room argues open-weight models (Llama, Mistral, Qwen) are good for the world; half argues frontier closed models (GPT-class, Claude-class) are good for the world. Talking points the worksheet suggests:

Pro open-weights: democratization, transparency, no single point of corporate control, scientific reproducibility, runs offline (privacy), cheaper at scale.

Pro closed-frontier: safer (better RLHF), more capable, easier to govern, harder to misuse for mass-scale abuse, better for users who don't want to manage infra.

Students will land somewhere nuanced. That is the lesson - both arguments are real, the right answer is "both ecosystems, used responsibly."

This sets up Unit 6, where the class actually runs an open-weight model locally.

Differentiation, IEP, and 504 supports

  • EL students: comparing models is a great EL exercise - they can compare same-prompt outputs in their home language and English; some models handle non-English far better than others.
  • Students without home internet: Day 1 activity needs school-network LLM access; the rubric, benchmarks, and pricing material are paper-first.

Assessment & evidence

  • Formative: comparison rubric scores, refusal mapping, debate participation.
  • Summative: quiz (12 questions). One-page "Which model when?" decision matrix the student writes themselves.

What's next

Unit 6 - students stop being users and become operators. Install Ollama, download a small open-weight model, and run a local LLM on a laptop or desktop with no internet connection. The unit that flips the room.

Ready to use this in class?

Unlock the full AI Fluency edition.

All teacher guides, worksheets, scenarios, quizzes, answer keys, and the in-browser presenter for every module in the track. Site-license pricing for schools and districts. Free review copies for verified educators.