Garrett Mooney GarrettMooney

Generating Synthetic Data for LLM Evaluation

Summary

Use your application extensively to build intuition about failure modes
Define 3-4 dimensions based on observed or anticipated failures
Create structured tuples covering your priority failure scenarios
Generate natural language queries from each tuple using a separate LLM call
Scale to more examples across your most important failure hypotheses (we suggest at least ~100)
Test and iterate on the most critical failure modes first, and generate more until you reach theoretical saturation

See rune2e.sh for info on how to run the experiment.

Anti-hype LLM reading list

Goals: Add links that are reasonable and good explanations of how stuff works. No hype and no vendor content if possible. Practical first-hand accounts of models in prod eagerly sought.

Foundational Concepts

Pre-Transformer Models

Program Analysis Resources

(draft; work in progress)

	# make sure you have `tac` [1] (if on on macOS) and `atuin` [2] installed, then drop the below in your ~/.zshrc
	#
	# [1]: https://unix.stackexchange.com/questions/114041/how-can-i-get-the-tac-command-on-os-x
	# [2]: https://github.com/ellie/atuin

	atuin-setup() {
	! hash atuin && return
	bindkey '^E' _atuin_search_widget

	export ATUIN_NOBIND="true"

	gpu_info = !nvidia-smi
	gpu_info = '\n'.join(gpu_info)
	if gpu_info.find('failed') >= 0:
	print('Not connected to a GPU')
	else:
	print(gpu_info)

	import torch
	import torch.utils.dlpack
	import jax
	import jax.dlpack

	# A generic mechanism for turning a JAX function into a PyTorch function.

	def j2t(x_jax):
	x_torch = torch.utils.dlpack.from_dlpack(jax.dlpack.to_dlpack(x_jax))
	return x_torch

	library(tidyverse)
	library(tictoc)
	library(arrow)

	tic()
	con <- DBI::dbConnect(duckdb::duckdb(), "data/pbp_db.duckdb")

	nfl_pbp <- tbl(con, "nflfastR_pbp")
	toc()