Unit 6: Local & Open-Source LLMs
Lesson at a glance
| Item | Detail | | --------------------- | ----------------------------------------------------------------------------------------------------------- | | Suggested length | 4 × 60 minutes (this unit benefits from the extra day) | | Recommended placement | Week 6 of AI Fluency | | Prerequisite | Units 1–5; signed AI Use Agreement on file; lab machines or BYOD with admin OK | | Materials | Ollama or LM Studio installed (admin-approved); ≥8 GB RAM per machine; ≥10 GB disk free; sample model files |
Safety: Students will install software and download model files. Do not run this lab on machines without IT approval. Coordinate with your IT department in advance. Sample models in this unit are uncensored only in the academic sense - they're general open-weight chat models with the same ethics the rest of the course teaches. Reinforce the AI Use Agreement.
Standards & credential alignment
- AI4K12 Big Ideas: Learning, Representation & Reasoning.
- CSTA K-12: 3A-CS-01, 3A-NI-04, 3A-AP-13.
- NIST AI RMF: Map and Measure functions.
Learning objectives
By the end of this unit, students can:
- Explain what "open-weight" means and how it differs from "open-source software."
- Read a model card and pick a model size appropriate for the available hardware.
- Install Ollama (or LM Studio) and pull at least two different open-weight models.
- Run a local model in chat mode and via the command line (or LM Studio's UI).
- Define quantization (Q4, Q5, Q8) and explain the size/quality/speed tradeoff.
- Explain three legitimate use cases for local LLMs (privacy, offline, cost, hackability).
- Recognize the limits of local models vs. frontier models and pick which to use when.
Vocabulary
- Open-weight model - A model whose trained weights are downloadable. You can run it offline, modify it, fine-tune it.
- Open-source model - Strictly, a model where weights and training code and training data are released. Rare. Most "open" models are open-weight only.
- Model card - The README that ships with a model. Lists training data sources, intended uses, limitations, license.
- Quantization - Compressing the model weights from 16-bit floats to 8-bit, 5-bit, or 4-bit integers. Smaller, faster, slightly less accurate.
- GGUF / safetensors - Common file formats for distributing model weights.
- Ollama - A free CLI/server tool that downloads and runs open-weight LLMs with a single command.
- LM Studio - A free desktop GUI app for running local LLMs.
- VRAM / RAM - Memory the model lives in while running. The first hard limit on which model you can run.
- Context window (local) - Same concept as Unit 2; smaller for most local models (8K–32K).
- Inference speed - Tokens generated per second. The second hard limit on usability.
Teacher background
This is the unit that flips the room. Many students arrive thinking "AI" is ChatGPT - a magical service in the cloud. By the end of this unit they will have downloaded a 4-gigabyte file, double-clicked it, turned off their wifi, and had a conversation with it. That moment changes their relationship with the technology permanently.
Two things to set expectations on early:
- Local models are not as smart as frontier models. A 7B-parameter model running on a laptop is roughly comparable to a frontier model from late 2022. That's still genuinely useful. But students who expect ChatGPT-level reasoning will be disappointed. Frame this honestly.
- The hardware matters. A school laptop with 8 GB RAM can run a Q4 7B model slowly. A gaming PC with a discrete GPU runs the same model 20x faster. A 70B model needs serious hardware. The worksheet has a sizing guide.
Recommended models for the unit, all roughly free under permissive licenses:
- Llama 3.x 8B (Meta) - well-rounded chat model.
- Mistral 7B or Mixtral - fast, strong general purpose.
- Qwen 2.5 7B (Alibaba) - surprisingly capable; strong at code and math.
- Phi-4 (Microsoft) - small, efficient, good for reasoning.
- Gemma 2 9B (Google) - well-aligned, good for school contexts.
Pacing - Day 1 (60 min): Why local? Install Ollama.
| Time | Segment | Notes |
| ----------- | --------------------------------- | ------------------------------------------------ |
| 0:00 – 0:20 | Mini-lesson - why local matters | Privacy, offline, cost, hackability, learning. |
| 0:20 – 0:50 | Lab - install Ollama or LM Studio | Walk-through. Verify each machine. |
| 0:50 – 1:00 | Lab - pull your first model | ollama pull llama3.1:8b or LM Studio download. |
Day 1 - Mini-lesson: why local? (20 min)
Four reasons local LLMs matter:
- Privacy. Nothing leaves your machine. For sensitive notes, journaling, ideation, drafting, this matters.
- Offline. Your AI works on a plane, in a basement, on a school network with restricted internet. The model is yours.
- Cost. After the one-time download, every prompt is free. For high-volume use, this dominates.
- Hackability. You can fine-tune it, modify the system prompt at the OS level, swap models in a script, build apps on top.
Plus one pedagogical reason:
- Learning. Owning the inference loop teaches more about LLMs than ten hours of chat. Watch the tokens stream out. Watch RAM fill up. Notice the model is a file.
Day 1 - Lab: install Ollama (30 min)
Pick one tool for the class - don't mix on the same day. Two excellent free options:
- Ollama (CLI; works great on macOS/Linux/Windows; perfect for terminal-comfortable students).
- LM Studio (desktop GUI; perfect for students who haven't used a terminal).
Install steps for Ollama:
# Windows: download installer from ollama.com and run.
# After install, open a terminal:
ollama --version # confirms install
ollama pull llama3.1:8b # downloads ~5 GB
ollama run llama3.1:8b # starts a chat
LM Studio: install the .exe / .dmg, open it, click "Discover," pick a 7B-class model with a green "fits in RAM" indicator, click download, then "Chat."
Verify each machine before moving on. A student whose install fails on Day 1 falls behind for the rest of the unit. Pre-check during planning week.
Pacing - Day 2 (60 min): Run it. Compare it.
| Time | Segment | Notes | | ----------- | ------------------------------------------- | ---------------------------------------------- | | 0:00 – 0:25 | Lab - first chat with a local model | Use the C.R.I.S.P. frame from Unit 3. | | 0:25 – 0:50 | Lab - local vs. frontier, head-to-head | Same 5 prompts, two models, score with rubric. | | 0:50 – 1:00 | Discussion - where local won, where it lost | |
Day 2 - Lab: first chat (25 min)
Students chat with the local model using prompts from their Unit 4 prompt library. They notice three things on their own:
- The model responds, but slower than ChatGPT.
- The answers are sometimes great and sometimes obviously weaker.
- The model has no knowledge of events after its training cutoff and no live tools.
Have them turn off wifi for at least one prompt. The "wait, this works with no internet?" reaction lands the unit.
Day 2 - Lab: local vs. frontier (25 min)
Same five-prompt drill from Unit 5, but now local 7B vs. frontier. Score on the six-axis rubric. The class will discover that:
- For simple tasks (rewording, summarizing short text, extracting structured data), local models hold their own.
- For complex reasoning (multi-step math, long-context analysis, hard coding), frontier wins clearly.
- For privacy-sensitive use, local wins automatically.
This is the right model for the job lesson made real.
Pacing - Day 3 (60 min): Quantization, model sizes, hardware
| Time | Segment | Notes | | ----------- | ------------------------------------ | -------------------------------------------------------- | | 0:00 – 0:25 | Mini-lesson - quantization explained | Q8 → Q5 → Q4. Why we pay this tax. | | 0:25 – 0:45 | Lab - pull two more models | Try a smaller model and a bigger one if hardware allows. | | 0:45 – 1:00 | Discussion - sizing guide | |
Day 3 - Mini-lesson: quantization (25 min)
The whiteboard graphic:
Original weights: 16-bit floats → model size = 2 bytes × parameters
e.g., 7B params × 2 = 14 GB
Quantize to:
Q8 (8-bit): ~1 byte/param → 7 GB. ~99% quality. Big win.
Q5 (5-bit): ~0.6 byte/param → 4.5 GB. ~97% quality. Sweet spot.
Q4 (4-bit): ~0.5 byte/param → 3.5 GB. ~95% quality. School-laptop friendly.
The tradeoff: quantization makes the model fit in less RAM and run faster, at a small accuracy cost. Q4 is the practical default for student machines. Q8 is the default for serious work. Q5 is the compromise.
Show this in Ollama / LM Studio: the same model has multiple quant sizes. The class picks the right one for their hardware.
Day 3 - Lab: pull two more models (20 min)
Each student tries one smaller model (e.g., Phi-4 mini, ~3 GB) and one bigger if their hardware can take it (e.g., 13B Q4, ~7 GB). They notice the speed difference. They notice the quality difference. They write three sentences about what they noticed.
Day 3 - Discussion: sizing (15 min)
Group consensus on a sizing guide:
| You have | Run this | | ---------------------- | -------------------------------- | | 8 GB RAM laptop | 3B Q4 (Phi-4 mini, Gemma 2 2B) | | 16 GB RAM laptop | 7B Q4 (Llama 3.1 8B, Mistral 7B) | | 16 GB RAM + 8 GB GPU | 7B Q5 / 13B Q4 | | 32 GB RAM + 12 GB+ GPU | 13B Q5, or 27B Q4 | | 64 GB+ workstation | 70B Q4 (slow but real) | | Rented GPU server | Anything you can pay for |
Pacing - Day 4 (60 min): Build something useful
| Time | Segment | Notes |
| ----------- | --------------------------------------------------- | ------------------------------------------------------------------------ |
| 0:00 – 0:15 | Mini-lesson - system prompts and personas in Ollama | The Modelfile / system prompt feature. |
| 0:15 – 0:55 | Lab - build "My Local Tutor" | Each student configures a custom local-model persona for their use case. |
| 0:55 – 1:00 | Show & tell | Volunteers demo their tutor. |
Day 4 - Mini-lesson: system prompts (15 min)
Both Ollama (via Modelfile) and LM Studio (via the system prompt field) let students bake a persona into the model. Example Modelfile:
FROM llama3.1:8b
SYSTEM """
You are a patient calculus tutor for a 10th-grade student. Always explain step by step. Never give the final answer; ask Socratic questions. Use one analogy per concept. Keep responses under 150 words.
"""
Then ollama create mytutor -f Modelfile and ollama run mytutor - the persona is permanent.
Day 4 - Lab: build "My Local Tutor" (40 min)
Each student designs and runs a persona for one of their other classes (calc tutor, history debate partner, French conversation buddy, lab-report editor, etc.). They iterate the system prompt. They submit the final Modelfile or LM Studio config as the artifact.
This is the unit's deliverable and one of the highest-value artifacts of the entire course.
Differentiation, IEP, and 504 supports
- Students whose machines can't run local models: provide a teacher-machine demo plus paired access. The conceptual lesson works fine without personal install.
- Students with terminal anxiety: route them to LM Studio. Same lesson, GUI flow.
- Advanced students: assign the fine-tuning preview - read about LoRA fine-tuning and write one paragraph on what data they'd fine-tune a model on if they could.
Assessment & evidence
- Formative: install verification, head-to-head rubric scores, sizing guide reflection.
- Summative: quiz (12 questions). The "My Local Tutor" Modelfile / LM Studio config is the major artifact - graded on persona clarity, constraint specificity, and demoed performance.
What's next
Unit 7 takes the model from "thing that talks" to "thing that does." Retrieval, tools, function calling, and agentic workflows. This is where AI starts to look like a product instead of a chatbot.
