Week 2 · In-class exercise

Vector Space Model: A Physical Exercise

Goal: build a document vector space using nothing but word counts, plot documents as points in 3D, and see what "similarity as distance" actually means.

The setup

Three words are our axes. Each document's coordinate on an axis is how many times that word appears.

X axis — data (floor, 0–6) — marked with tape or post-it notes
Y axis — people (floor, 0–6) — marked with tape or post-it notes
Z axis — library (height, 0–3) — marked with twine or a post-it held overhead

Each person takes one document. Read your paragraph, count the three words, then walk to your point on the floor. Raise the twine (or post-it) to the right height for library. You are the document vector.

Part 1 — Count your document

Count data, people, and library in your paragraph. Case-insensitive; exact word matches only. Then place yourself at your point.

Doc 1 — "The Reading Room"

The library was quiet on Sunday afternoon. A few people sat by the window with books open on their laps. The library smelled of old paper and floor polish. Every library I have known has had that same smell.

Doc 2 — "Census Report"

The census data arrived in March. Analysts cleaned the data, cross-checked it against prior data, and published a summary showing how many people had moved between counties. The data suggested that younger people were leaving rural areas, while older people stayed. More data is expected next quarter.

Doc 3 — "Digital Library Project"

The library launched a new data portal this spring. Patrons can now search library catalog data and download open data sets. The library hopes the data will help people doing local history research. Early usage data looks promising.

Doc 4 — "The Protest"

Thousands of people gathered in the square. Young people chanted while older people held signs. Reporters asked people why they had come. People answered that they were tired of being ignored, and that people deserved to be heard.

Doc 5 — "Machine Learning Intro"

Machine learning models are trained on large amounts of data. The quality of the data matters more than the quantity. Clean data produces better models than messy data. Most practitioners spend the majority of their time preparing data rather than tuning algorithms. Bad data, bad model.

Doc 6 — "Community Librarian"

The library hired a community librarian last fall. Her job is to meet people where they are — at senior centers, at schools, at the farmers' market. She brings books, sign-up forms, and a little usage data back to the library each week. People seem to like her.

Your doc	data (X)	people (Y)	library (Z)

Look around. Who's close to you? Does the geometry match your sense of which documents are about the same thing?

Part 2 — Run some queries

A query is a tiny document — just a bag of words. We place it in the same space and look for the nearest person.

Query	data	people	library	Nearest doc?
Q1: `data data data`
Q2: `library people`
Q3: `data library`
Q4: `people people people people`
Q5: `data people library`

Discussion: Q1 is "data data data" and lands at (3, 0, 0). But Doc 5 (ML Intro) is at (6, 0, 0). They are about the exact same thing, yet they're far apart by Euclidean distance. Why? What's wrong with raw counts?

Part 3 — Normalize to unit length

The problem: longer documents look "further away" even when they point in the same direction. A doc that says "data" six times and a query that says "data" three times should be a perfect match — same direction, different magnitude.

The fix: every document gets the same length of twine. Nobody stretches further from the origin than anyone else — you just keep your angle and walk inward (or outward) until you're at the end of your twine.

||v|| = √(x² + y² + z²) → v̂ = v / ||v||

Now every document sits on the surface of a sphere of radius 1. What matters is the direction you point, not how far you'd gone.

Re-run the queries with normalized vectors. Does Q1 now match Doc 5 exactly?