Student presentation — Week 10
AI systems that process and generate multiple types of data (text, images, audio, video)
Let's discuss these in turn.
GANs are a type of generative model that uses two neural networks:
For images, the generator network takes a random noise vector and generates an image. The discriminator network takes an image and tries to determine if it's real or fake.
CLIP is a multimodal model that understands the relationship between text and images.
It's not a generative model - it allows you to measure the similarity between text and images
LatentScape: Explore a Latent Image Space
Why would a model that measures distance between text and images be useful in generating images?
Think back to the GAN 'discriminator' network.
Having a measurable relationship between an image and text means you can start training a generator, and automatically score if it's generating good or bad images.
You can use that as a reward to train your generator.
DALL-E 1 is a text-to-image generation model that uses CLIP and a trained transformer model.
Subsequent versions of DALL-E were 'diffusion' models, which generate 'noise' that is iteratively refined into an image.
Images: an SDXL fine-tune
Launched March 2025 — a new axis for thinking about image generators:
Midjourney optimizes for aesthetic quality GPT-4o optimizes for instruction precision
Two different tools for different creative goals.
March 2025: GPT-4o image generation launched → millions of users generated images in the style of Studio Ghibli
Became a cultural flashpoint about:
An unusual name for a powerful model family — the codename stuck:
Video went from "impressive demo" to production tool in 2025:
Key shift: In 2024 these were party tricks. By 2025, filmmakers were shipping work made with these tools.
Sora launched publicly December 2024 (previously preview-only) — a production product by 2025... and killed by 2026.
This is an AI literacy class. How is this something that we end up using?
Unlike with text where generative models proved to be extremely useful, with visual language the generative capabilities tend to be more fun or specialized (e.g. art and film; stock photography).
Visually, interpretive capabilities tend to be more broadly practically useful.
Vision Transformers (ViT; Dosovitskiy et al. 2020) are a type of transformer model that are specifically designed to process images.
DALL-E was a transformer-based *autoregressive* (i.e. decoder-only generative model) model - here, we're talking about an *encoder* model - which can be used to understand images.
The Vision Transformer approach makes clear that images can be treated the same way as text, so why not use them together?
GPT-4V (OpenAI, 2023); Claude 3 Vision (Anthropic, 2024); Gemini (Google, 2023); LLaMA 3 (Meta, 2024)
What makes this composition particularly compelling is the juxtaposition of the dog's natural solemnity against the playful absurdity of being dressed as the very food item that shares its colloquial name.
Order on Jonathan Sergeant, Treasurer, to pay £17 19s. 10 1/2 d. to Richard Scott, for work done about the college, according to the within account.
... And just like that...
Short Exercise: Try GPT Voice Mode
Artbreeder Splicer
Courts are actively working out the rules — expect continued uncertainty.
Emerging key principle: competitive substitution
Consider:
Generate and critique AI art — combining image generation and vision capabilities.
Use an image generator to create and iterate on AI-generated images. Push past the "digital art" defaults — try different styles, remixes, subjects.
Some options:
Contribute your favorite image to the class gallery. Include a text box with your prompt and tool.
Use a vision-capable model (ChatGPT, Claude, Gemini) to write a short curator's statement (under 50 words) for someone else's image in the gallery.
You're curating — the AI writes the feedback, but you direct its voice and select what you like. Make it silly, serious, or surprising. Don't be mean.
Post your statement as a Post-It note beside their image.
References