Latent Semantic Analysis: Re-plotting in Topic Space
Goal: see what changes when our axes are no longer raw words but latent topics. Synonyms collapse onto the same axis. Documents that share meaning — but not vocabulary — finally end up close to each other.
The setup
Same room, same grid, same twine. But the axes are different.
- X — Topic A: information / records (words like
data,dataset,records,numbers,statistics) - Y — Topic B: community / patrons (words like
people,folks,community,residents,patrons) - Z — Topic C: institution / library (words like
library,archive,stacks,branch,collection)
A topic isn't a single word — it's a cluster of words that tend to mean the same kind of thing. LSA discovers these clusters automatically from a large corpus by noticing which words keep showing up in the same documents. We'll skip the math and pretend it's already been done.
Each person gets one document. Your doc has been pre-scored on each topic (0–3). Walk to your point. Raise your twine to your Topic C value.
Part 1 — Place yourself
The same six documents from the previous exercise, plus three new ones. The new ones use different vocabulary for the same ideas. Notice that the raw-word exercise would have placed them at the origin or near it — but in topic space, they land somewhere meaningful.
| Document | A: information | B: community | C: institution |
|---|---|---|---|
| Doc 1 — Reading Room | 0 | 1 | 3 |
| Doc 2 — Census Report | 3 | 2 | 0 |
| Doc 3 — Digital Library Project | 3 | 1 | 3 |
| Doc 4 — The Protest | 0 | 3 | 0 |
| Doc 5 — ML Intro | 3 | 0 | 0 |
| Doc 6 — Community Librarian | 1 | 2 | 2 |
| Doc 7 (NEW) — "The Archive" | 0 | 1 | 3 |
| Doc 8 (NEW) — "Town Survey" | 3 | 2 | 0 |
| Doc 9 (NEW) — "Branch Open House" | 1 | 2 | 2 |
The new documents
Doc 7 — "The Archive"
The old archive in the basement holds bound newspapers from the 1890s. A few patrons sit at the wooden tables turning brittle pages. The stacks smell of dust and varnish.
Doc 8 — "Town Survey"
The town survey collected records from every household this fall. Numbers from the survey suggest that residents are aging in place. Statistics will be released next month after researchers finish cross-checking the survey responses.
Doc 9 — "Branch Open House"
The new branch opened its doors on Saturday. Folks from the neighborhood toured the collection, met staff, and signed up for cards. Early sign-up numbers from the open house look promising.
Part 2 — Discussion
Doc 7 ("The Archive") and Doc 1 ("Reading Room") use almost no overlapping vocabulary — Doc 7 doesn't contain the word "library" or "people" at all. In the raw-word exercise, Doc 7 would have been at the origin. Now it's standing on top of Doc 1. Why?
Doc 8 ("Town Survey") never says "data" or "people." But it lands next to Doc 2 ("Census Report"). What did the topic axes do for us that raw word counts couldn't?
Doc 9 ("Branch Open House") never says "library" or "people" either, yet it ends up right next to Doc 6. This is the synonymy problem solved.
What does it cost us? Where does this go wrong? (Hints: who decided what the topics are? what if a word means two different things?)