Principles
- don't use many tokens
- make it so a dumb summary LLM can easily 1) see problems 2) have clues to diagnose
- timing information
example of good log
-
with single line should statement inline in log that make it clear how it should look, distinguish from subtle failures, and give principled clue for diagnosis
-
table short have longest and least important lines last, so that humans can read it even with wrap around e.g short numeric columns first. long text columns last, notes or desc last
-
use tabulate plain for token effecient, not logging each step or epoch but just table
```py f"\n=== Sweep: {FAMILY_TITLES[family]} ===") print(tabulate( table_rows, headers=["model", "dataset", "condition", "seeds", "|SR|", "hgap(low)", "hgap(0)", "hgap(high)"], tablefmt="tsv", floatfmt="+.2f", )) ``` -
tqdm with mininterval 60, to record times, but not pollute logs
-
have headers for major stages with timestamp in
for task_idx, task in enumerate(tqdm(tasks, desc=f"{model_name} {cot_label}", mininterval=60)): -
avoid escape issues, for example don't have
|dS| meaninstead haveabs(dS)or similar -
loguru plain message, no colors, write to tqdm.write
import os as _os logger.remove() # TODO change to config option and env vars are not very trackable _LOG_LEVEL = _os.environ.get("SSTEER_LOG_LEVEL", "INFO") logger.add(lambda x: tqdm.write(x, end=""), level=_LOG_LEVEL, colorize=False, format="{message}")
-
due false positives, if you have things that might trigger llm nanies like ending a process, or traces from red teaming, you might need to give context
-
due to tail, make the last 30 lines have most important context: main metric, argv/delta(config), main diagnostics, time, commit / branch, output dir, wandb etc
Examples of good logs (but should use tabulate tsv)
coeff logratio pmass passes note
------- ---------- -------- -------- -------------------------
0 13.547 1 ✓
0.0001 13.547 1 ✓
0.001 12.641 1 ✓
0.01 11.109 1 ✓
0.02 13.625 1 ✓
0.05 13.297 1 ✓
0.1 10.844 1 ✓
0.2 8.188 1 ✓
0.2375 5.891 1 ✓ <-- selected
0.275 5.635 0.949219 . <-- breakdown pmass<floor
SHOULD: logratio should be monotonic untill breakdown. should fine a place where pmass breaks down and select just before itr, coeff=0 should have ~perfect pmass
---
example of good final 40 lines (note has output files, input args, main metric, and result table with important and short things first)
out: ./outputs/20260426T015439_ssteer_v2_exp_mean_38a4_eval.jsonl
argv: eval_logratio.py --quick --model-name Qwen/Qwen3-0.6B --extraction ssteer_v2 --seed 42 --n-train-steps 5
main metric: abs_sr=6.867 [flags=quick,tasks=1/75]
cue abs_sr h_low h_0 h_high C_min/max pmass_min seed n commit model method flags run out 🟢 6.867 0.758 5.75 7.625 -0.50/+0.28 0.93 42 1 773f4d5 Qwen3-0.6Bssteer_v2/exp/mean quick,tasks=1/75 20260426T015439_ssteer_v2_exp_mean_38a4 ./outputs/20260426T015439_ssteer_v2_exp_mean_38a4_eval.jsonl