Self-improving AI: we have it, it's here. Here's how it works.

This document contains my simplified understanding of how this works.

it is by no means an exhaustive explanation and is meant to get the "vibe" of what's happening while glossing over implementation detail.

Ask deepseek R1 to think about something.
It thinks about it for a long time, spending $$$ on compute.
It eventually comes to a conclusion.
Take the question + the conclusion and generate a "thought trace" which is the "correct" version or the "shortest version" of the "thought process"
Train the model (fine-tune) on the question + thought process + answer.

Step 1, 2, 3: run it on something [w/ a known answer?].

Question: how many o are in the word photocomposer?

Run it on r1. Get thought trace.

Trace:... Answer: 4

Step 4: generate the condensed version of the trace + answer

Question: how many o are in the word photocomposer?

<thought_trace> Okay, let's see. The user is asking how many times the letter 'o' appears in the word "photocomposer." Hmm, first, I need to make sure I spell the word correctly. Let's break it down: photo-composer. Wait, is it "photocomposer" all together? Yeah, I think that's right. Alright, so I need to go through each letter in "photocomposer" and count the 'o's. Let me write out the letters one by one. P-H-O-T-O-C-O-M-P-O-S-E-R. Let me count each 'o' as I go.

Starting with the first letter: P (1), H (2), O (3) – that's the first 'o'. Then T (4), O (5) – that's the second 'o'. Next is C (6), O (7) – third 'o'. Then M (8), P (9), O (10) – fourth 'o'. S (11), E (12), R (13). Let me check again to make sure I didn't miss any.

Wait, let me write the letters with their positions:

Looking at positions 3,5,7,10. So that's four 'o's. Wait, let me verify. The word is photo (P-H-O-T-O) so that's two 'o's first. Then composer (C-O-M-P-O-S-E-R), which is two 'o's. So 2 + 2 = 4. Yeah, that seems right. I think that's four 'o's total. But let me check again to be sure. Sometimes when letters are close together, it's easy to double-count or miss one. Let me spell it again slowly: P, H, O, T, O, C, O, M, P, O, S, E, R. Yes, the 'o's are at positions 3,5,7,10. So four times. Okay, I'm confident the answer is 4. </thought_trace>

The word "photocomposer" contains the letter **o** four times. Here's the breakdown: 1. **P** 2. **H** 3. **O** 4. **T** 5. **O** 6. **C** 7. **O** 8. **M** 9. **P** 10. **O** 11. **S** 12. **E** 13. **R** **Answer:** 4

Make the answer as short as possible to satisfy the question and output <condensed_answer>. Make the thought trace as short as possible while still containing all necessary logic and using the same writing style and output it as <condensed_thought_trace>.

<condensed_thought_trace> Let me count each 'o' in "photocomposer": P-H-O-T-O-C-O-M-P-O-S-E-R They appear at positions 3,5,7,10. To verify: "photo" has 2 o's, "composer" has 2 o's, so 2+2=4 total. </condensed_thought_trace> <condensed_answer> 4 </condensed_answer>

Train the model (fine-tune) on the question + thought process + answer.

Add to training data:

how many o are in the word photocomposer?

Okay, let's see. Break "photocomposer" into parts: "photo" (2 o's) and "composer" (2 o's). Total: 4. Verify by spelling: P-H-O-T-O-C-O-M-P-O-S-E-R.

^^ self-improving AI

Generate 50 questions that are unanswered that are similar to the original question "how many o are in the word photocomposer?" and follow a similar logical pattern to solve the question.

Run the above "update the training dataset with the distilled answer" process.

The old way:

Train on the internet.
Sample the internet to answer questions with lots of randomness.
Use humans to select which answer is best.
Use that "best answer" to add to the training dataset and fine-tune.

The new way:

Train on the internet.
Stop training on the internet.
Use Chain of Thought reasoning and Reinforcement Learning to generate training data.
Train on that training data, thereby making the model more efficient and better at reasoning.

Notice that the new way does not require a) more data from the internet nor b) humans to review the data.

This is "closing the loop" where the loop is a) generate data b) review + distill data c) train on the data.

We have closed the loop. This is a self-improving AI.

Imagine that we:

Keep running the self-improving AI training process forever.
Ask the model to embody a personality. Like Claude. Ask it to be "a person".

If we keep running this process forever, and then ask the model to embody a personality (like Claude or ChatGPT's personality), and the model is based on artificial training data, who are we talking to when we talk to that personality?

Older models were like talking to a compressed version of the Internet, with all of the limitations around asking questions that did not appear in the training dataset (like how to effectively apply AI, which didn't exist yet when the training cutoff was hit for these models).

Reasoning models are like talking to a compressed version of Information, which is emergent from the logical Chain-of-Thought traces that are produced when spending $$ on hardware and energy to run the model.

We can mine intelligence out of energy and hardware and the result is not human.

Thus leading to this scene from Arrival: https://www.youtube.com/watch?v=Z0e-oiKSrq0

[ Investing insane amounts of money to build datacenters + power ]

https://www.cnbc.com/2025/02/06/openai-looking-at-16-states-for-data-center-campuses-tied-to-stargate.html

OpenAI CEO Sam Altman has praised DeepSeek’s model publicly, calling it “clearly a great model” at an event last week.

“This is a reminder of the level of competition and the need for democratic Al to win,” Altman said at the event, adding that it points to the “level of interest in reasoning, the level of interest in open source.”

Lehane said it’s all adding urgency to efforts in the U.S.

“Right now, there’s really only two countries in the world that can build this AI at scale,” Lehane said on Thursday. “One is the CCP-led China, and the other is the United States, and so that’s sort of the context that we’re operating in. Up until relatively recently, there was a real sense that the U.S. had a material lead on the CCP.”

He added that reports surrounding DeepSeek made “really clear that this is a very real competition, and the stakes could not be bigger. Whoever ends up prevailing in this competition is going to really shape what the world looks like going forward.”

https://news.microsoft.com/source/features/innovation/microsofts-majorana-1-chip-carves-new-path-for-quantum-computing/

Microsoft today introduced Majorana 1, the world’s first quantum chip powered by a new Topological Core architecture that it expects will realize quantum computers capable of solving meaningful, industrial-scale problems in years, not decades.

It leverages the world’s first topoconductor, a breakthrough type of material which can observe and control Majorana particles to produce more reliable and scalable qubits, which are the building blocks for quantum computers.

paralin/self-improving-ai.md