AI Fluency · Module 8

AI Fluency, Unit 8: Multimodal AI - Vision, Voice, Image, Video

AI that sees, hears, draws, sings, and lies. Image generation, voice cloning, video synthesis, deepfakes, and the cryptographic fight back: C2PA, watermarking, and provenance. The unit parents most want students to take.

Length: 180 min
Level: intermediate
Track: AI Fluency
Cadence: Standalone semester

Career paths

Cloud Security

Unlock this module →Request a review copy Try the free pack

Download 1-page brochure (PDF)·Share with admins, parents, or your CTE director.

What's in the lesson pack

Everything you need to teach this period.

Built by an OSCP-certified instructor who teaches this material every week. Print-ready, classroom-tested, copy-paste-able.

Teacher Guide

Locked

Lesson at a glance, learning objectives, vocabulary, pacing, mini-lessons, and discussion notes.

In-browser presenter

Locked

Full themed slide deck you can run live from your laptop. Speaker notes built in. Works offline once loaded.

PowerPoint (.pptx) export

Locked

Editable slide deck for districts that mandate PowerPoint or want to customize for their LMS.

Module overview

The full lesson plan, public.

Read everything before you commit. The plan, objectives, vocabulary, standards alignment, and pacing are open. Only the print-ready deliverables are gated.

Unit 8: Multimodal AI

Lesson at a glance

| Item | Detail | | --------------------- | -------------------------------------------------------------------------------------------------------------------- | | Suggested length | 3 × 60 minutes | | Recommended placement | Week 8 of AI Fluency | | Prerequisite | Units 1–7 | | Materials | School-approved image generator + a vision-capable LLM; deepfake examples curated by teacher; printed C2PA reference |

Safety: This unit covers deepfakes and voice cloning. Demonstrations use public figures or fictional characters only - never students, staff, or community members. Reinforce the AI Use Agreement: cloning the voice or likeness of an identifiable person without consent is not allowed in this course and is illegal in many states.

Standards & credential alignment

AI4K12 Big Ideas: Perception, Natural Interaction, Societal Impact.
CSTA K-12: 3A-IC-24, 3A-IC-25, 3A-IC-28, 3A-IC-30.
NIST AI RMF: Govern, Manage - content provenance.

Learning objectives

By the end of this unit, students can:

Define multimodal AI and list four modalities models work with today (text, image, audio, video).
Use a vision-capable LLM to analyze an image and explain how it differs from text-only chat.
Use an image generator with a structured prompt and iterate it 3 rounds.
Define deepfake and recognize three categories (face swap, voice clone, full synthesis).
Explain C2PA content credentials and the role of provenance in fighting deepfakes.
Apply ethical and legal limits to image, voice, and video synthesis (consent, identifiable persons, copyright).

Vocabulary

Multimodal model - A model that handles more than one modality (e.g., text + image input, or text → image output).
Diffusion model - The architecture behind most image and video generators. Starts from noise, gradually denoises to an image.
Vision-language model (VLM) - An LLM that also accepts images as input.
Text-to-image - Models like DALL-E, Stable Diffusion, Imagen, Midjourney.
Voice cloning - Synthesizing speech that matches a target speaker's voice.
Deepfake - Synthetic media (video, audio, image) designed to impersonate a real person.
C2PA - Coalition for Content Provenance and Authenticity. An open standard that cryptographically signs media with edit history.
Watermarking - Embedding a signal in generated media that identifies it as AI-generated.
Right of publicity - A legal right protecting a person's name, voice, and likeness from unauthorized commercial use.

Pacing - Day 1 (60 min): Vision and image generation

| Time | Segment | Notes | | ----------- | ------------------------------------- | ------------------------------------------ | | 0:00 – 0:20 | Mini-lesson - vision-language models | What a VLM "sees" and what it doesn't. | | 0:20 – 0:40 | Activity - describe and analyze | Students upload images; observe responses. | | 0:40 – 1:00 | Lab - image generation with structure | Three rounds, iterating a single concept. |

Day 1 - Mini-lesson: vision-language models (20 min)

A VLM can:

Describe an image accurately at a high level.
Read text in images (OCR).
Identify objects, scenes, moods.
Reason about what's happening (e.g., "what's about to happen in this picture?").
Interpret diagrams, math equations, code screenshots.

A VLM cannot reliably:

Identify specific people (most refuse for safety reasons).
Read very small text or hand-drawn cursive consistently.
Count many small objects accurately.
Tell genuine vs. fake imagery (it can guess; it isn't a forensic tool).

Show, on the board, examples of each - VLMs are often shockingly good at one and shockingly bad at the next.

Day 1 - Activity: describe and analyze (20 min)

Students upload three images (school-appropriate; no people without consent) and ask the VLM:

"Describe this image."
"What's happening?"
"What might happen next?"

They note where the VLM nailed it and where it confidently described something that wasn't there. (Hallucination shows up in vision models too.)

Day 1 - Lab: image generation with structure (20 min)

Image-gen prompts follow their own version of C.R.I.S.P.:

Subject + Style + Composition + Lighting + Mood + Negative prompts (what to avoid)

Example:

"A red fox curled asleep in autumn leaves, warm afternoon sunlight, shallow depth of field, watercolor and ink style, soft focus background, no text, no people."

Students iterate one concept three times - a poster for a fictional school event, the cover for a fictional book, etc. They learn that prompt structure matters as much for image as for text.

Pacing - Day 2 (60 min): Voice, video, and the deepfake conversation

| Time | Segment | Notes | | ----------- | --------------------------------------- | ------------------------------------------------------- | | 0:00 – 0:20 | Mini-lesson - voice and video synthesis | What's possible. Live, curated demos. | | 0:20 – 0:40 | Mini-lesson - deepfakes and harm | Sextortion, election ads, scam calls. Honest and brief. | | 0:40 – 0:55 | Activity - spot the synthetic | Show 6 clips; class judges real vs. AI. | | 0:55 – 1:00 | Discussion | |

Day 2 - Mini-lesson: voice and video synthesis (20 min)

The state of the art (as of the current school year):

Voice cloning from 10–30 seconds of reference audio is reliable in commercial tools and open-source. Output is often indistinguishable in short clips.
Video synthesis (Sora-class, Veo-class) produces 5–60 seconds of realistic video from text. Quality is rapidly improving.
Real-time face swap in video calls is now possible on consumer hardware.
Lip-sync to audio is at production-quality on still images.

This is fast-moving. The teaching point is not any specific tool - it's that the capability is here and increasingly accessible.

Day 2 - Mini-lesson: deepfakes and harm (20 min)

The honest conversation, briefly:

Election interference. AI-generated audio of candidates saying things they never said. Several U.S. states now ban this within 60–90 days of an election; FCC banned AI-generated robocalls in 2024.
Romance / sextortion scams. Cloned voices of family members in fake "I'm in trouble" calls. Cloned likenesses for non-consensual imagery. The latter is a felony in a growing number of states; federal legislation has progressed.
Identity fraud. Voice clones bypassing voice authentication; face swap bypassing liveness checks.
Harassment of minors. AI-generated explicit imagery of classmates is the single most-reported AI misuse in schools as of this writing. It is a serious crime under federal law and most state laws. Treat it that way.

The line you have to land: "If the person didn't consent, you don't make it. You don't share it. You don't laugh at it. If you see one of a classmate, you tell a trusted adult."

If your school has had an incident, do not name it. If your school has not had one yet, you are likely to. Have the conversation now.

Day 2 - Activity: spot the synthetic (15 min)

Curate 6 clips in advance: 3 real, 3 AI-generated (use clearly licensed material). Class votes. Reveal answers. Most modern AI clips will fool most students. That is the lesson.

Pacing - Day 3 (60 min): Provenance, watermarking, and the law

| Time | Segment | Notes | | ----------- | ------------------------------------------ | ------------------------------------------------ | | 0:00 – 0:20 | Mini-lesson - C2PA content credentials | How provenance fights back. | | 0:20 – 0:35 | Mini-lesson - the legal landscape | Right of publicity, NCII laws, election laws. | | 0:35 – 0:55 | Activity - design a school AI media policy | Pairs draft 5 rules they'd ship to their school. | | 0:55 – 1:00 | Quiz / exit ticket | |

Day 3 - Mini-lesson: C2PA (20 min)

C2PA is an open content-credentials standard backed by Adobe, Microsoft, BBC, Sony, the New York Times, Truepic, and many AI providers. It works like this:

A camera (or AI tool) cryptographically signs an image at creation: who/what made it, when, with what tool.
Every edit appends a new signed entry.
Anyone can inspect the chain in tools like the Content Credentials viewer at contentcredentials.org.

It's not a silver bullet - adversaries can strip credentials, and absence of credentials does not prove fakery. But it's the most credible cross-industry effort to date. ChatGPT image outputs, Microsoft's tools, Adobe Firefly, and several major news outlets now ship C2PA by default.

Show one in class: download a recent AI image with credentials, paste into the inspector, walk the chain.

Day 3 - Mini-lesson: the legal landscape (15 min)

Cover four anchors at a high level:

Right of publicity - state laws (varies widely) protecting a person's name/voice/likeness. Stronger in CA, NY, TN.
Federal NCII (non-consensual intimate imagery) laws - multiple states criminalize AI-generated NCII; federal law has expanded.
Election deepfake laws - multiple states ban AI-generated political ads near elections.
Copyright - unsettled; current case law is evolving. AI output is generally not copyrightable to the prompter; using copyrighted training data is being litigated.

Teach this as the rules are being written right now, because they are.

Day 3 - Activity: design a school AI media policy (20 min)

Each pair drafts the 5 rules they would ship to their school for AI-generated media. They focus on:

Consent (whose face/voice/likeness can appear).
Disclosure (when AI use must be labeled in student work).
Categorical bans (NCII, identifiable peers, threats, harassment).
Educational exceptions (when synthetic media in class projects is fine).
Reporting (who to tell when they encounter misuse).

Best ones land on the wall. Several have been adopted, with light edits, by actual schools running this curriculum.

Differentiation, IEP, and 504 supports

Trauma-aware: deepfake content is a sensitive topic. Pre-flag the unit. Allow alternate paper-based work for any student who opts out of synthetic-media demos.
Read-aloud students: every demo can be narrated with description.

Assessment & evidence

Formative: image-gen iteration artifact, spot-the-synthetic scoring, AI media policy.
Summative: quiz (12 questions). The school AI media policy is the major artifact.

What's next

Unit 9 widens out to the systemic risks: hallucination, bias, prompt injection, jailbreaks, data leakage, copyright, and how to think clearly about AI safety without being either a doomer or a denier.

Ready to use this in class?

Unlock the full AI Fluency edition.

All teacher guides, worksheets, scenarios, quizzes, answer keys, and the in-browser presenter for every module in the track. Site-license pricing for schools and districts. Free review copies for verified educators.

View pricing Request review copy See full curriculum