Grooped · Roie Shalom

The design problem

The hardest part isn't the tech

Word puzzles sound simple. Four groups of four words. But the design space is deceptively deep.

The real challenge is intentional misdirection. Take IRON, PRESS, CURL, BENCH, all things you do at the gym. But PRESS also fits a medieval weapons category. IRON fits things you can flip. CURL fits hair styling. That web of almost-right answers is what makes a puzzle feel clever instead of arbitrary. Designing it is the closest thing I've found to level design.

Every word on the board has to earn its place. Not just as a correct answer, but as a convincing wrong one.

How it's made

An AI pipeline with a human editor at the center

Every puzzle starts with a prompt. I built an internal tool called Puzzle Editor 3000 that generates a full puzzle: four categories, four words each, plus intentional decoys designed to blur the lines between groups.

The generator doesn't just make puzzles. It tracks 28 connection mechanics across four tiers, each with a cooldown period so the same trick never runs two days in a row. A fill-in-blank category can appear weekly. A first-letter acrostic gets saved for once every six weeks. The system reads the last 60 published puzzles before generating a new one, checks which mechanics are underused, and steers toward variety automatically.

When I export a puzzle, it's appended to the live file on GitHub. The next generation reads that file first. The system has memory.

AI generates, and I review every puzzle before it ships and regenerate the parts that don't work for me. The system learns from what I keep.

Connection mechanic tiers

Tier 1

Workhorses

cooldown: 4 puzzles

Taxonomy
Found in scene
Prefix blank
Suffix blank
Synonyms

Tier 2

Regulars

cooldown: 7 puzzles

Things that verb
Can be verbed
Shared hidden property
Metaphor substitutes
Ways to verb
Idiom completion
Ordered set member
Works by one maker
Characters in one work
Facets of a named subject

Tier 3

Specials

cooldown: 21 puzzles

Hidden word
Homophones
Compound
Add/Drop letter
Eponyms
Cross language
Abbreviation expansion

Tier 4

Treats

cooldown: 45 puzzles

Anagram of one source
Acrostic first letters
Chain through hub
Portmanteau
Onomatopoeia

System prompt View source on GitHub ↗

The quality of the output lives or dies on the prompt. Writing it was a design problem: precise enough to produce consistent results, flexible enough to surprise me. This is the actual system prompt the generator uses, loaded live from the source code.

Generation pipeline

Roie

AI

Generate triggered

Fetch live puzzle history from GitHub

Scan last 21 puzzles, compute mechanic cooldowns

Identify underused Tier 2 and Tier 3 mechanics

Pick spine mechanic, prefer underused, respect cooldowns

Generate puzzle with cross-pull decoys

Verify hidden words letter-by-letter

Check all 16 words, duplicates and 60-day repeat rule

Inject mechanic + tier into each category

Strip scratchpad fields, save clean draft

Editor review, regenerate weak categories, edit words

Export triggered

Append to puzzle history, push to GitHub

Puzzle is live

puzzle_generator.py

Loading prompt...

The tool

Puzzle Editor 3000

The editor is the interface between AI output and human judgment. Generate a full puzzle, regenerate individual categories, swap words, rename groups, or type your own category and let the AI fill in the words. Every category shows which connection mechanic the model used and which tier it belongs to, so I can see at a glance if the puzzle is mechanically varied or just four versions of the same trick.

The dashboard at the top tracks mechanic usage across the last 21 puzzles. If Tier 2 is dominating, the next generation will steer toward Tier 3 automatically. I can see what's been overused and nudge the system before it gets repetitive.

Puzzle Editor 3000 showing four categories with refresh and ban controls

Puzzle Editor 3000: generate, curate, and track puzzle mechanics in one place.

The game

What the player sees

None of the pipeline is visible to the player. They get 16 words and four attempts. The mechanic, the tier, the cooldowns. All invisible. The only thing that matters is whether the puzzle feels satisfying to solve.

1

16 words, 4 groups

A fresh puzzle. The grid shows 16 words with no hints about what connects them.

2

Spot a pattern

Four words selected, ready to submit. Is it the right group?

3

First group found

A correct guess collapses into a color block. Twelve words remain.

4

Puzzle complete

All four groups revealed. Come back tomorrow for a new one.

1

16 words, 4 groups

A fresh puzzle. The grid shows 16 words with no hints about what connects them.

2

Spot a pattern

Four words selected, ready to submit. Is it the right group?

3

First group found

A correct guess collapses into a color block. Twelve words remain.

4

Puzzle complete

All four groups revealed. Come back tomorrow for a new one.

Iteration with data

Session recordings over surveys

I connected Microsoft Clarity to track how people actually play. People play fast and lose patience faster. A couple of wrong guesses and some players just leave. That insight pushed me to make puzzles more solvable.

"I believe in seeing what people do, not asking what they would do."

Microsoft Clarity session recording: tracking real player behavior.

A daily word puzzle,built and curated with AI

The hardest part isn't the tech

An AI pipeline with a human editor at the center

Puzzle Editor 3000

What the player sees

Session recordings over surveys

A daily word puzzle,
built and curated with AI