< Back
Week 2 · In-class exercise (LSA)

Latent Semantic Analysis: Re-plotting in Topic Space

Goal: see what changes when our axes are no longer raw words but latent topics. Synonyms collapse onto the same axis. Documents that share meaning — but not vocabulary — finally end up close to each other.

The setup

Same room, same grid, same twine. But the axes are different.

  • X — Topic A: information / records   (words like data, dataset, records, numbers, statistics)
  • Y — Topic B: community / patrons   (words like people, folks, community, residents, patrons)
  • Z — Topic C: institution / library   (words like library, archive, stacks, branch, collection)

A topic isn't a single word — it's a cluster of words that tend to mean the same kind of thing. LSA discovers these clusters automatically from a large corpus by noticing which words keep showing up in the same documents. We'll skip the math and pretend it's already been done.

Each person gets one document. Your doc has been pre-scored on each topic (0–3). Walk to your point. Raise your twine to your Topic C value.

Part 1 — Place yourself

The same six documents from the previous exercise, plus three new ones. The new ones use different vocabulary for the same ideas. Notice that the raw-word exercise would have placed them at the origin or near it — but in topic space, they land somewhere meaningful.

DocumentA: informationB: communityC: institution
Doc 1 — Reading Room013
Doc 2 — Census Report320
Doc 3 — Digital Library Project313
Doc 4 — The Protest030
Doc 5 — ML Intro300
Doc 6 — Community Librarian122
Doc 7 (NEW) — "The Archive"013
Doc 8 (NEW) — "Town Survey"320
Doc 9 (NEW) — "Branch Open House"122

The new documents

Doc 7 — "The Archive"

The old archive in the basement holds bound newspapers from the 1890s. A few patrons sit at the wooden tables turning brittle pages. The stacks smell of dust and varnish.

Doc 8 — "Town Survey"

The town survey collected records from every household this fall. Numbers from the survey suggest that residents are aging in place. Statistics will be released next month after researchers finish cross-checking the survey responses.

Doc 9 — "Branch Open House"

The new branch opened its doors on Saturday. Folks from the neighborhood toured the collection, met staff, and signed up for cards. Early sign-up numbers from the open house look promising.

Part 2 — Discussion

Doc 7 ("The Archive") and Doc 1 ("Reading Room") use almost no overlapping vocabulary — Doc 7 doesn't contain the word "library" or "people" at all. In the raw-word exercise, Doc 7 would have been at the origin. Now it's standing on top of Doc 1. Why?

Doc 8 ("Town Survey") never says "data" or "people." But it lands next to Doc 2 ("Census Report"). What did the topic axes do for us that raw word counts couldn't?

Doc 9 ("Branch Open House") never says "library" or "people" either, yet it ends up right next to Doc 6. This is the synonymy problem solved.

What does it cost us? Where does this go wrong? (Hints: who decided what the topics are? what if a word means two different things?)