LLMs as UX Testers.md

LLMs as UX Testers

You built a developer tool that does something really cool. You use it every day. The learning curve is steep, but in the course of developing this tool, you have become intimately familiar with the interface, so it's intuitive to you. This is a post about turning that tool into something that can find broad utility among developers, not just N=1 utility for you.

Many of the things that make tools good and useful for humans also make tools that are good and useful for LLMs. It's that Venn diagram intersection that I want to focus on, because that's where LLMs can help you improve your command line tool's UX.

The Problem

Every tool has a complexity budget. Users will only devote so much time to understanding what a tool does and how before they give up. You want to use that budget on your innovative new features, not on arbitrary or idiosyncratic syntax choices.

Here's the problem: you can't look at it with fresh eyes and analyze it from that perspective. Finding users for a new CLI tool is hard, especially if the user experience isn't polished. But to polish the user experience, you need users, you need user feedback, and not just from a small group of power users. Many tools fail to grow past this stage.

The Solution

What you really need is a vast farm of test users that don't retain memories between runs. Ideally, you would be able to tweak a dial and set their cognitive capacity, press a button and have them drop their short term memory. But you can't have this, because of 'The Geneva Convention' and 'ethics'. Fine.

LLMs get confused easily, but this is exactly what you want: the data about where they get confused is useful signal. Think about it like this: if you can make your tool intuitive and easy to understand for even the cheapest commodity LLMs, that's a strong signal that your users (even the ones who just rolled into the office and haven't had a coffee yet) will be able to understand it.

This is a post about how I did exactly this, how it improved my open source command line file system inspection tool, and what I learned along the way.

How I used this pattern

A few years ago I built a filesystem exploration tool called detect (link) using expression languages to support complex queries. It was mainly a way to show off my implementation of multiphase evaluation: for each potential match candidate, it would evaluate predicates in order from cheap to expensive, starting with file path and ending with full-content search, attempting to short circuit at each stage:

It started like this:

detect 'filename(detect) && executable() || filename(.rs) && contains(map_frame)'

I then rewrote it using a selector operator RHS model, instead of the above pseudo-function-call model.

`detect '@size > 1024 && @name ~= main && (@extension == ts || @extension == js)'`

That's where it stayed for the last year. I used it for small tasks (way more intuitive than find), but the syntax was still clunky. Automated LLM UX testing helped me take it from that to:

detect 'size > 1kb && path.name == main && path.ext in [ts, js]'

How I ran this UX testing

First, initial tightly coupled testing: I used Claude Code to iterate on the CLI interface (not the core eval engine), with sub-agents testing the tool as it went. These agents had full insight into the codebase, and were (by analogy) similar to a human dev testing a tool as they worked on it.

Second, I set up a sandbox repo: I cloned a large open source project, wrote a series of test scenarios, and spawned Claude Code instances to enact them. These instances were instructed to use traditional command line tools (find, grep, etc) and detect, and to compare the two.

Here's an example scenario:

DX1: Code Comment Quality Review

Situation: New developers are struggling to understand the codebase.

Your Task:

Find complex functions/classes (>10KB files) lacking documentation comments
Identify files with TODO/FIXME/HACK comments older than 3 months
Find misleading comments in recently modified files
Check for commented-out code that should be removed (look for // followed by code patterns)
Locate areas with excessive comments (more than 30% comments might indicate code smell)
Focus on service and resolver files as they contain business logic

Deliverable: Documentation improvement priority list

And a sample tool interaction

⏺ Bash(detect 'path.name ~= "(payment|product|customer|order|channel|promotion|shipping)" AND path.name contains
      "service" AND path.ext == ts' | head -10)
  ⎿ /Users/inannamalick/dev/detect_bench/vendure/packages/dev-server/example-plugins/product-bundles/services/produc
    t-bundle.service.ts
    /Users/inannamalick/dev/detect_bench/vendure/packages/dev-server/example-plugins/product-bundles/services/produc

I then had a swarm of LLM agents run each scenario, write up a report (including UX issues, quality of life improvements, feature requests, etc). I then reviewed the UX reports they produced, and fed a subset of them back to the Claude Code instance working on the CLI interface.

Here's some feedback I got on a recent run:


  Pain Points 🔴

  1. Regex Syntax Inconsistencies
  # This failed - unclear escaping rules
  detect 'path.name ~= \.(service|resolver)$ AND path.ext == ts'

  # This worked - but why quotes around the whole regex?
  detect 'path.name ~= "(service|resolver)" AND path.ext == ts'
  Fix: Clearer regex escaping documentation with more examples.

I would frequently reset the memory of these automated UX testers, or downgrade them from Opus (general purpose, more expensive) to Sonnet (more limited, much cheaper) to see how they did at different capability levels.

What they found

Initially, they produced a wealth of actionable UX/bug reports. I've sorted them into categories as follows:

Parser flexibility

adding aliases like modified for mtime, ext for extension, etc
parser suport for floating-point file sizes when paired with suffixes (0.5mb, 0.25kb, etc)
parser support for '0.5kb', '5Kb', '5 KB' instead of just '5kb'
support AND/OR/NOT in addition to '&&/||/!'
removing the @ prefix on selectors, to go from @size to just size (this one was obvious, but I had spent too long using it to notice the friction)

New features

glob support (it kept trying to use them and getting confused)
fallback to PCRE2 if the builtin rust regex parser failed
'in set' operators eg path.ext in [js, ts]
hierarchical path selectors for specific components
- path.full, path.name, path.ext, etc

Bugs

Handle SIGPIPE gracefully instead of panicking

Eventually, the UX bug reports started moving towards not-quite-feasible feature requests:

complex correlation between files in queries
selectors specific to individual programming language features (eg imports)
semantic features that would require the detect tool to embed LLM support

That's when I knew I was done. Each change was small, but the overall result was a noticeably improved experience.

Conclusion

Why this matters: this was fairly cheap. I ran it in the background, using a fairly small subset of a Claude Max plan. It took my 'detect' tool from a prototype to something I'll happily suggest others use.

You can do the same. This method applies to any CLI tool, but is especially useful for CLI tools with complex syntax.

inanna-malick/LLMs as UX Testers.md