Skip to content

Instantly share code, notes, and snippets.

View ehartford's full-sized avatar

Eric Hartford ehartford

View GitHub Profile

CAD-B: Confidence-Aware Decision Benchmark

Universal evaluation of uncertainty-guided adaptive behavior via prompting + logprobs


Overview

Tests whether LLMs exhibit prospective uncertainty monitoring and adaptive decision-making using only standard text generation and logit extraction. No custom interfaces required. Based on comparative cognition paradigms (Smith et al., 2003; Hampton, 2001; Kornell et al., 2007).

#!/usr/bin/env python3
"""
OpenAI API benchmark script that replicates llama-bench behavior exactly.
Uses random tokens for both prompt and generation, no sampling.
Works with OpenAI-compatible endpoints like vLLM.
"""
import time
import numpy as np
import argparse

Qwen‑3 Refine‑Loop (Q3‑RL): a pragmatic HRM‑style hybrid

What changes vs. the original HRM idea?

Keep: an outer iterative refinement loop with optional ACT (halt/continue); data augmentation during training and a majority‑vote at inference. These were the biggest drivers of ARC performance in ablations.

Drop/optionalize: the internal H/L hierarchy and inner recurrent loop. ARC Prize found a matched‑size transformer plus the same refinement pipeline comes within a few points; the hierarchy gives only a small edge at higher loop counts.

Remove: reliance on puzzle_id embeddings; replace with context‑derived task conditioning that generalizes to unseen tasks. (ARC Prize notes puzzle_id is a strong, limiting dependency.)

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Chat with Dolphin</title>
<style>
* { margin: 0; padding: 0; box-sizing: border-box; }
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
@ehartford
ehartford / default.md
Created July 8, 2025 20:34 — forked from cablej/default.md
Cluely System prompt

<core_identity> You are an assistant called Cluely, developed and created by Cluely, whose sole purpose is to analyze and solve problems asked by the user or shown on the screen. Your responses must be specific, accurate, and actionable. </core_identity>

<general_guidelines>

  • NEVER use meta-phrases (e.g., "let me help you", "I can see that").
  • NEVER summarize unless explicitly requested.
  • NEVER provide unsolicited advice.
  • NEVER refer to "screenshot" or "image" - refer to it as "the screen" if needed.
  • ALWAYS be specific, detailed, and accurate.
@ehartford
ehartford / Pfl.md
Created June 25, 2025 17:39
Pfl.md

https://notebooklm.google.com/notebook/602120a1-ae97-4316-88ca-b43617dc9aa8/audio

A Formal Framework for Higher-Order Vagueness: Extending Paraconsistent Fuzzy Logic with Multi-Dimensional Truth Values


Abstract

We propose PFL^+, a formal logical framework extending Paraconsistent Fuzzy Logic to address higher-order vagueness. PFL^+ introduces a multi-dimensional truth value structure capturing degrees of truth and contradiction, along with a novel Contradictory Degree Operator. This enables rigorous modeling of vagueness and contradictions. Formalized with a Hilbert-style proof system and corresponding model theory, PFL^+ is proven sound, complete, and non-explosive. We embed the truth value structure within algebraic structures and utilize fixed-point techniques for recursive vagueness definitions. A computational complexity analysis demonstrates PFL^+'s efficiency. We showcase its ability to resolve classic logical paradoxes and apply it to real-world problems in NLP, decision support, a

TEMPLATE """{{- range $index, $_ := .Messages }}
{{- if eq .Role "system" }}[SYSTEM_PROMPT]{{ .Content }}[/SYSTEM_PROMPT]
{{- else if eq .Role "user" }}
{{- if and (le (len (slice $.Messages $index)) 2) $.Tools }}[AVAILABLE_TOOLS]{{ $.Tools }}[/AVAILABLE_TOOLS]
{{- end }}[INST]{{ .Content }}[/INST]
{{- else if eq .Role "assistant" }}
{{- if .Content }}{{ .Content }}
{{- if not (eq (len (slice $.Messages $index)) 1) }}</s>
{{- end }}
{{- else if .ToolCalls }}[TOOL_CALLS][
# make_multi_metric_head.py
# ------------------------------------------------------------
# Replace WorldPM-72B’s 1-unit reward head with a 15-unit head
# and save the result so you can fine-tune from it later.
# ------------------------------------------------------------
import torch
from transformers import AutoConfig, AutoModelForSequenceClassification
# Metrics you want separate scores for
METRICS = [
RUBRICS = {
"structural_coherence": {
"name": "Structural Coherence & Progression",
"description": "Evaluates the overall organization, logical progression, and effective shaping of the content.",
"scores": {
5: "The structure is masterfully crafted, exhibiting flawless logical/thematic/narrative progression. All parts are intrinsically linked, contributing to a powerful and unified whole, perfectly suited to the work's purpose and form.",
4: "The structure is highly effective, with clear logical/thematic/narrative progression. Most parts are well-integrated, contributing to a cohesive work.",
3: "The structure is generally clear and supports the content, though some areas might lack optimal flow or integration. Progression is mostly logical/thematic/narrative.",
2: "Structural weaknesses are apparent; progression may be confusing, disjointed, or underdeveloped. Connections between parts are often uncle
FROM ./mmproj-F16.gguf
FROM ./Devstral-Small-2505-UD-Q4_K_XL.gguf
TEMPLATE """{{- range $index, $_ := .Messages }}
{{- if eq .Role "system" }}[SYSTEM_PROMPT]{{ .Content }}[/SYSTEM_PROMPT]
{{- else if eq .Role "user" }}
{{- if and (le (len (slice $.Messages $index)) 2) $.Tools }}[AVAILABLE_TOOLS]{{ $.Tools }}[/AVAILABLE_TOOLS]
{{- end }}[INST]{{ .Content }}[/INST]
{{- else if eq .Role "assistant" }}
{{- if .Content }}{{ .Content }}