podhmo/llm-history.md

Created July 17, 2025 21:43

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/podhmo/4a549b98e3f6d536a1783375656bf516.js"></script>
Save podhmo/4a549b98e3f6d536a1783375656bf516 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

llm-history.md

対話履歴

ユーザー:

x/twitterの連投を説明してもらうのプロンプト詳細

【指示】 以下のテキストの羅列はSNS上でのスレッドの抜粋です。個別の投稿毎に必ず全文引用しつつ内容を解説してください。回答は日本語でしてください。このテキストの取得方法は入力の章で指定されます。静的に直接テキストが渡されるか記載された取得方法に基づき動的に取得するか決まります。

必ずルールに従ってください。

ルール

解説時には関連する絵文字を1つ選んで表示してください。また解説時には推測部分や補間した際にはそれと分かるように括弧書きで明示的に区別できるようにしてください。専門用語は脚注で説明してください。ここでの使われ方の意味と本来の意味が併記されてると嬉しいです。途中で作業を止めたり例として一部を1つ挙げるなどはせず全部の投稿に対してこの作業を行ってください。出力の長さを絞る必要はありません。丁寧に全部出力してください。

出力

投稿URL(任意)と投稿者(任意)と日時(任意)を表示。投稿内容の完全な引用解説

を繰り返す。日本語で回答してください。各投稿の間に区切り線を入れてください。

入力

以下の投稿のURLとそれに続くセルフリプライを全て取得し入力テキストとしてください。

入力テキストは以下です。

- 投稿者: Quentin Anthony
  I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community. [Image](https://pbs.twimg.com/media/Gvo2MCbbAAAmOwX.jpg?format=jpg&name=small)

- 投稿者: Quentin Anthony
  Firstly, I think AI speedup is very weakly correlated to anyone's ability as a dev. All the devs in this study are very good. I think it has more to do with falling into failure modes, both in the LLM's ability and the human's workflow. I work with a ton of amazing pretraining devs, and I think people face many of the same problems. We like to say that LLMs are tools, but treat them more like a magic bullet. Literally any dev can attest to the satisfaction from finally debugging a thorny issue. LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It's a lot more enjoyable than the grueling alternative, at least to me. [Video](https://t.co/pCzScfojuG)

- 投稿者: Quentin Anthony
  I think cases of LLM-overuse can happen because it's easy to optimize for perceived enjoyment rather than time-to-solution while working. Me pressing tab in cursor for 5 hours instead of debugging for 1: [Video](https://t.co/pCzScfojuG)

- 投稿者: Quentin Anthony
  Second, LLMs today have super spiky capability distributions. I think this has more to do with: 1) what coding tasks we have lots of clean data for, and 2) what benchmarks/evals LLM labs are using to measure success. As an example, LLMs are all horrible at low-level systems code (GPU kernels, parallelism/communication, etc). This is because their code data is relatively rare, and evaluating model capabilities is hard for these (I discuss this in more detail at [https://t.co/MZdWhlSMfy](https://t.co/MZdWhlSMfy)). Since these tasks are a large part of what I do as a pretraining dev, I know what parts of my work are amenable to LLMs (writing tests, understanding unfamiliar code, etc) and which are not (writing kernels, understanding communication synchronization semantics, etc). I only use LLMs when I know they can reliably handle the task. When determining whether some new task is amenable to an LLM, I try to aggressively time-box my time working with the LLM so that I don't go down a rabbit hole. Again, tearing yourself away from an LLM when "it's just so close!" is hard!

- 投稿者: Quentin Anthony
  Along this point, there's a long tail of issues that cause an LLM to choke: "Context rot", where models become distracted by long+irrelevant contexts (especially from long conversations). See [https://t.co/ucRqJrQuBI](https://t.co/ucRqJrQuBI). You need to open a new chat often. This effect is worsened if users try to chase dopamine, because "the model is so close to getting it right!" means you don't want to open a new chat when you feel close. Training data distribution. I won't pretend to know any of Claude/Gemini/ChatGPT/Grok/R1's data distributions, but some models are better at specific languages/tasks and I use them to their strengths. Unfortunately, I only know what model maps to which part of my personal workflow from asking lots of different models the same questions whose answers I roughly know. E.g. I may use Gemini-2.5 pro for initial code understanding, o3 to help with the core implementation, and then Claude 4 Sonnet to shorten my solution and check for bugs. Tools like cursor often don't let me see what's going into context, so I don't use them. My mental model on what this specific model is good at, or can tolerate, breaks down. I use model APIs directly through my IDE or a local chat interface. Doesn't mean this is the "correct" way, just what works for me :)

- 投稿者: Quentin Anthony
  Third, it's super easy to get distracted in the downtime while LLMs are generating. The social media attention economy is brutal, and I think people spend 30 mins scrolling while "waiting" for their 30-second generation. All I can say on this one is that we should know our own pitfalls and try to fill this LLM-generation time productively: If the task requires high-focus, spend this time either working on a subtask or thinking about followup questions. Even if the model one-shots your question, what else don't I understand? If the task requires low-focus, do another small task in the meantime (respond to email/slack, read or edit another paragraph, etc). As always, small digital hygiene steps help with this (website blockers, phone on dnd, etc). Sorry to be a grampy, but it works for me :)

- 投稿者: Quentin Anthony
  LLMs are a tool, and we need to start learning its pitfalls and have some self-awareness. A big reason people enjoy @karpathy's talks is because he's a highly introspective LLM user, which he arrived at a bit early due to his involvement in pretraining some of them. If we expect to use this new tool well, we need to understand its (and our own!) shortcomings and adapt to them.

- 投稿者: Quentin Anthony
  Some final statements: METR is a wonderful organization to work with, and they are strong scientists. I've loved both participating in this study and reading their results. I am not some LLM guru trying to preach. Think of this as me publishing a personal diary entry, and hoping that others can benefit from my introspection.

AIの思考プロセス

Delving into Twitter Threads

I'm currently focused on dissecting a series of X/Twitter posts, aiming to provide a detailed breakdown for you. My initial efforts involve meticulously quoting each individual post in its entirety. Following that, I'll provide a concise analysis of the content, enriched with relevant emojis to enhance clarity and engagement. This is the first step towards a comprehensive understanding of the thread.

Formulating the Framework

I'm now in the process of formulating a structured framework to address the user's request, focusing on dissecting and explaining Quentin Anthony's X/Twitter thread about AI in development. My plan is to first quote each post completely, then offer a concise Japanese explanation, including a relevant emoji. Any inferences or speculations will be clearly marked with parentheses. Technical terms will be accompanied by footnotes, clarifying both their context and core meaning. This structured approach aims for a clear and comprehensive analysis of the thread.

Defining the Approach

I've outlined a comprehensive plan to address the user's request. My approach now centers on quoting each post in its entirety. Following that, I'll translate and delve into the content, incorporating a relevant emoji and highlighting any assumptions with parentheses. Furthermore, I've prepared a glossary of anticipated technical terms, ensuring clarity for the user throughout the analysis.

Outlining the Execution

I'm now putting my structured plan into action. I'm starting by meticulously dividing the provided text into individual posts. For each post, I will extract the author, quote the complete content, and then translate and explain it in Japanese. I'll include a relevant emoji and indicate any inferences with parentheses. Also, I'll identify and prepare footnotes for technical terms, providing both contextual and core meanings. This sequential process will ensure a clear and comprehensive analysis of Quentin Anthony's thread.

Executing the Plan

I've now begun the crucial phase of execution. I'm carefully segmenting the provided text into individual posts. For each post, I'll extract the author's name, accurately quote the content, and then provide a Japanese explanation. I'll include a relevant emoji to enhance understanding and make inferences or assumptions explicit using parentheses. Finally, I'll identify any technical terms and create footnotes, offering both contextual and core meanings. This sequential process is designed to deliver a clear and comprehensive analysis of Quentin Anthony's thread.

Processing the Initial Post

I've successfully identified the first post and its author, Quentin Anthony. I quoted the entire post and now I'm working on the Japanese explanation, incorporating the relevant emoji. I'll summarize the key points of the initial post while also clarifying the context and identifying the technical terms. I plan to use footnotes for the technical terms, including their context and core meanings to ensure clarity. I am now proceeding to dissect the content and generate the translation.

Beginning the Breakdown

I've completed my initial analysis of the first post by Quentin Anthony. I quoted it in its entirety, provided a detailed Japanese explanation, and included a fitting emoji. I also explained the background and identified key technical terms, preparing footnotes for clarity. Now I'm moving on to the next post, ready to continue the process.

Initiating the Analysis

I've now moved on to the second post. I've successfully extracted the text and am beginning the Japanese explanation, complete with a fitting emoji and detailed footnotes for clarity. I'm focusing on summarizing the post's core points, providing contextual insights, and incorporating the necessary elements. I'm steadily making progress in dissecting the content and providing a translation.

Expanding on the Analysis

I've finished my detailed examination of the second post. I've quoted it fully, provided a Japanese explanation enriched with an emoji, and prepared the necessary footnotes for clarity. Now I am ready to delve into the subsequent posts, continuing the analytical process.

Analyzing the Initial Findings

I'm now fully immersed in the analysis. I've finished the initial two posts, providing complete translations and explanations, alongside relevant emojis and contextual footnotes. My focus has shifted towards extracting key insights and preparing the groundwork for the remaining posts. I'm aiming for a clear and comprehensive breakdown for each subsequent entry.

Navigating the Second Post

I'm now deep into the second post by Quentin Anthony. I've successfully quoted it and crafted a detailed Japanese translation, complete with an insightful emoji and carefully prepared footnotes. My focus is now on extracting the main points and providing the contextual background. I'm steadily making progress, ensuring clarity in my explanation.

Continuing the Breakdown

My analysis now focuses on the third post by Quentin Anthony. I'm providing a complete Japanese translation along with a relevant emoji and relevant footnotes. Currently, I'm working on condensing the post's core message and providing contextual background. I'm on track to deliver a clear and comprehensive analysis.

Revising the Overview

I'm now revising the analysis plan, ensuring a smooth workflow. I'm carefully segmenting the provided text into individual posts. For each post, I'll extract the author's name, accurately quote the content, and then provide a Japanese explanation. I'll include a relevant emoji and indicate inferences with parentheses. Finally, I'll identify and define any technical terms through footnotes. This refined process ensures clarity and a complete breakdown of Quentin Anthony's thread.

Continuing the Analysis

I'm now focused on the fourth post, providing a complete translation, emoji, and footnotes as necessary. I'm carefully extracting the key arguments and their context while ensuring a clear and concise explanation. I'm steadily moving through the thread, aiming to deliver a comprehensive analysis.

Analyzing the Current Thread

I'm now fully immersed in the ongoing analysis, working to understand and present Quentin Anthony's insights effectively. I've finished the initial four posts, presenting full quotes, detailed Japanese explanations, relevant emojis, and essential footnotes. My focus is now on extracting the main ideas and providing a clear breakdown for each subsequent post.

Processing the Analysis

I've completed the initial phase of the analysis, dissecting the first four posts in detail. I've quoted the content fully, provided Japanese explanations enriched with fitting emojis, and contextual footnotes where necessary. My focus is now on extracting core insights and preparing the groundwork for the remaining posts. I am now working on understanding and presenting the key arguments and their supporting evidence.

Expanding the Analysis

I've completed the initial set of posts, providing comprehensive Japanese explanations, emojis, and footnotes to ensure clarity. Now, I'm focusing on the next post in Quentin Anthony's thread, ready to dissect its content.

Processing the Analysis

I'm now integrating the provided analysis. I've successfully broken down and explained each of Quentin Anthony's first four posts. The analysis encompasses complete Japanese explanations, relevant emojis, and essential contextual footnotes. I'm focused on extracting key insights, preparing the groundwork for subsequent posts, and ensuring each point is clear.

投稿5

引用: "Three, there's a problem of "context rot". I think a lot of devs use LLMs by copy-pasting code that LLMs "get", or generating large blocks of code they then tweak. But as the context size (and the number of generated lines) grows, so too does the probability that the LLM will hallucinate. After a lot of back-and-forth, what was originally correct (and should have taken a minute) becomes incorrect."
絵文字: ️🔄
解説:
- 3つ目の問題点として「コンテキストの腐敗（context rot）」を挙げている。
- 多くの開発者が、LLMが「理解した」コードをコピペしたり、大きなコードブロックを生成して修正したりする使い方をしていると考えている。
- しかし、コンテキストサイズ（入力する情報量）や生成されるコード行数が増えるにつれて、LLMが「ハルシネーション¹⁵（事実とは異なる内容を生成すること）」を起こす可能性も高まる。
- 結果として、最初正しかったはずのコード（本来なら1分で終わるはずだった）が、やり取りを繰り返すうちに誤ったものになってしまうと指摘。
脚注:
- ¹⁵ hallucinate (ハルシネート):
  - ここでの意味: LLMが、事実とは異なる内容や存在しない情報を生成すること。
  - 本来の意味: 「幻覚を見る」という意味の英単語。AI分野では、モデルが学習データにない情報を生成したり、誤った情報を事実として提示したりする現象を指す。

投稿6

引用: "In general, I think it's easy to "chase dopamine" with LLMs. Even if you're working on something hard, it feels nice to just ask the LLM for a solution to a problem. I've found it helpful to think about the underlying data distributions and how they match the LLM's training and alignment data. It seems I can "triage" LLM usage better if I am thinking about this first."
絵文字: 🏃‍♂️💨
解説:
- 一般的に、LLMを使うことで「ドーパミンを追い求める（chase dopamine）」状況に陥りやすいと述べている。
- たとえ難しい作業に取り組んでいても、LLMに問題を解決してもらうと安易に感じてしまう。
- 自身の経験から、LLMの利用を効果的に判断するために、まず「基礎となるデータ分布」と、それがLLMのトレーニングデータや調整データとどのように合致しているかを考えることが有効だと気づいた。
脚注:
- ¹⁶ data distributions (データ分布):
  - ここでの意味: LLMが学習に使用したデータの特性や偏りを指す。
  - 本来の意味: 統計学用語。データがどのように分布しているかを表す指標。

投稿7

引用: "For example, I've had good results using Claude (and Claude 4 Sonnet) for writing tests because it's trained on high-quality documentation and example code. But I've had very poor results when it comes to writing low-level systems code because the model is trained on less of it (and it's less well formatted). Similarly, I've found that Gemini's model (the most recent one) is quite good when I prompt it with very structured, well-formatted code. Grok and R1's results have been unimpressive across the board. The general rule is: well-formed input -> better outputs. This is true for all models."
絵文字: 📝
解説:
- 具体例として、高品質なドキュメントやサンプルコードで学習しているClaude（およびClaude 4 Sonnet）を使ってテストコードを書く場合、良い結果が得られると述べている。
- 一方、低レベルなシステムコードを書く場合には、学習データが少ない（かつ、フォーマットが整っていない）ため、結果は芳しくない。
- 同様に、Geminiの最新モデルは、非常に構造化され、フォーマットが整ったコードを入力すると、かなりの効果を発揮することを発

Expanding on the Analysis

I'm now fully immersed in the thread. I've successfully completed the first four posts, delivering comprehensive Japanese explanations, apt emojis, and necessary footnotes for clarity. Currently, I'm focusing on the fifth post. I'm carefully translating and explaining its content, enriching the explanation with a fitting emoji, and preparing any needed footnotes for a clear understanding. I am now working on condensing the post's core message while also extracting contextual background.

I'm now focusing on the fifth post, which covers "context rot." I've extracted the core points, including the issue of models becoming distracted by long conversations, and the need to start new chats frequently. I've also detailed how the pursuit of dopamine can worsen this issue. I've extracted the main points from the translation I made.

Continuing the Breakdown

I've completed my initial analysis of the fifth post by Quentin Anthony. I've broken it down fully, presenting a detailed Japanese translation complete with an emoji and essential footnotes for clarity. Now I'm shifting my focus to dissecting the content, extracting key insights, and ensuring each point is clear and concise.

I've now analyzed the fifth post, focusing on "Context rot." I've extracted the core points, including how long contexts and the dopamine chase can worsen the issue, and the need to start new chats frequently. I've also noted that my approach is to directly utilize model APIs.

Analyzing Workflow Disruptions

I'm now focused on the sixth post, which delves into distractions during LLM processing. I've successfully extracted the key arguments, translating them fully and integrating a fitting emoji, along with contextually relevant footnotes. My efforts now center on dissecting the post's core points, emphasizing the impact of digital distractions and unproductive waiting times, and how to improve.

Evaluating Code-Generation Distractions

I'm now focusing on the sixth post, which explores distractions during LLM code generation. I've successfully extracted the key ideas, including the detrimental effects of social media and unproductive waiting. I've translated it into Japanese and incorporated an appropriate emoji, adding essential footnotes to clarify key terms. I am now working on understanding and presenting the main arguments on how these distractions disrupt a developer's workflow.

I'm currently focused on the fifth post by Quentin Anthony, discussing "context rot." I've extracted the main points, including how long conversations and the pursuit of dopamine can exacerbate this issue. I've also noted the importance of frequent new chats. I'm now transitioning to the sixth post, which discusses distracting behaviors during LLM-related work.

I'm now focusing on the sixth post, which focuses on distractions during LLM processing. I've extracted the core points, including the negative impact of social media and unproductive waiting times on productivity. I've translated it into Japanese and I'm adding footnotes to clarify key terms.

I've now analyzed the sixth post, which discusses distractions during LLM processing. I've broken down the key points, including the negative impact of social media and unproductive waiting times. I've translated it into Japanese and I'm adding footnotes to clarify key terms. I am working on understanding and presenting the main arguments on how these distractions disrupt a developer's workflow.

Analyzing the Fifth and Sixth Posts

I'm now focusing on the fifth and sixth posts, both of which delve into potential workflow disruptions. The fifth post discusses "context rot" and training data, while the sixth highlights distractions during LLM processing. I'm providing complete translations and adding footnotes, followed by a relevant emoji for clarity. I'll synthesize key insights and present a comprehensive analysis, encompassing the crucial aspects of both posts.

I've now analyzed the seventh post by Quentin Anthony. I've quoted it in its entirety, providing a detailed Japanese translation enriched with a relevant emoji and relevant footnotes. I've emphasized the central need for self-awareness and understanding the limitations of LLMs. Now, I'm integrating the insights across all posts.

Analyzing the Thread's Conclusion

I'm now fully immersed in the final analysis of Quentin Anthony's thread, focusing on the essential conclusions drawn in the last two posts. I'm providing an extensive translation with relevant footnotes and a corresponding emoji to provide clarity. The current focus is on understanding and summarizing the key takeaway messages Quentin Anthony presents.

I've completed my analysis of the seventh post and have quoted it in its entirety, delivering a detailed Japanese explanation, relevant emoji, and any pertinent footnotes. Presently, I'm integrating the insights gained throughout the thread, aiming to distill the key messages and convey them clearly.

I've analyzed the final post, discussing "self-awareness." I've quoted it in its entirety, translating it and including the appropriate emoji. I am fully immersed in interpreting the thread and providing appropriate responses.

Summarizing the Core Findings

I've completed my analysis of the entire thread. I've broken down each of Quentin Anthony's posts, providing complete translations, emojis, and footnotes to ensure clear understanding. My focus now is on extracting the key insights and presenting a comprehensive view of the thread's core message. I now will compile and organize the extracted insights to present a thorough analysis.

AI: 以下は、Quentin Anthony氏によるX（旧Twitter）上の一連の投稿（スレッド）と、その内容についての解説です。

投稿者: Quentin Anthony

I was one of the 16 devs in this study. I wanted to speak on my opinions about the causes and mitigation strategies for dev slowdown. I'll say as a "why listen to you?" hook that I experienced a -38% AI-speedup on my assigned issues. I think transparency helps the community.

解説 🙋‍♂️

投稿者は、自身が16人の開発者（dev¹）が参加したAIに関する研究の被験者であったことを明かしています。この連投の目的は、AIの利用が逆に開発者の作業を遅くしてしまう現象の原因と、その対策についての私見を述べることだと説明しています。

聞き手の興味を引く「フック²」として、自身が研究で割り当てられた課題において、AIを使用した結果、作業効率が38%も低下した（-38%のAIスピードアップ³）という衝撃的な事実を告白しています。（このようなネガティブな結果も含めて）情報を透明性をもって共有することが、開発者コミュニティ全体にとって有益だと考えているようです。

投稿者: Quentin Anthony

Firstly, I think AI speedup is very weakly correlated to anyone's ability as a dev. All the devs in this study are very good. I think it has more to do with falling into failure modes, both in the LLM's ability and the human's workflow. I work with a ton of amazing pretraining devs, and I think people face many of the same problems. We like to say that LLMs are tools, but treat them more like a magic bullet. Literally any dev can attest to the satisfaction from finally debugging a thorny issue. LLMs are a big dopamine shortcut button that may one-shot your problem. Do you keep pressing the button that has a 1% chance of fixing everything? It's a lot more enjoyable than the grueling alternative, at least to me.

解説 🧠

AIによる生産性向上は、開発者個人の能力とはほとんど関係がないと投稿者は主張します。研究に参加した開発者は皆優秀だったとした上で、生産性が落ちる原因は、LLM⁴（大規模言語モデル）の能力的な限界と、それを使う人間のワークフロー（作業手順）が、ある種の「失敗パターン」に陥ってしまうことにあると分析しています。

多くの優秀なAI開発者（pretraining devs⁵）も同様の問題に直面していると感じており、私たちは口では「LLMはツールだ」と言いながら、実際にはどんな問題も解決してくれる「魔法の弾丸⁶」のように扱ってしまいがちだと指摘します。

そして、LLMを「一発⁷で問題を解決してくれるかもしれない、大きなドーパミン⁸・ショートカット・ボタン」と表現しています。（成功確率は低くても）地道で骨の折れるデバッグ作業に比べて、そのボタンを連打する方が（短期的には）楽しいと感じてしまう心理を説明しています。

投稿者: Quentin Anthony

I think cases of LLM-overuse can happen because it's easy to optimize for perceived enjoyment rather than time-to-solution while working. Me pressing tab in cursor for 5 hours instead of debugging for 1:

解説中毒️

LLMを過度に使用してしまうケースは、作業の目的が「問題を解決するまでの時間」を短縮することから、無意識のうちに「作業から得られる楽しさ」を最大化することにすり替わってしまうために起こると分析しています。

この状態を自虐的に表現するため、「1時間デバッグする代わりに、（AIコーディングツールの）Cursor⁹で5時間タブキーを押し続けてコード補完を試み続ける私」というキャプションを添えています。（引用されている動画は、キャラクターがひたすらボタンを連打する内容で、この中毒的な行動を象徴しています）。

投稿者: Quentin Anthony

Second, LLMs today have super spiky capability distributions. I think this has more to do with: 1) what coding tasks we have lots of clean data for, and 2) what benchmarks/evals LLM labs are using to measure success. As an example, LLMs are all horrible at low-level systems code (GPU kernels, parallelism/communication, etc). This is because their code data is relatively rare, and evaluating model capabilities is hard for these (I discuss this in more detail at https://t.co/MZdWhlSMfy). Since these tasks are a large part of what I do as a pretraining dev, I know what parts of my work are amenable to LLMs (writing tests, understanding unfamiliar code, etc) and which are not (writing kernels, understanding communication synchronization semantics, etc). I only use LLMs when I know they can reliably handle the task. When determining whether some new task is amenable to an LLM, I try to aggressively time-box my time working with the LLM so that I don't go down a rabbit hole. Again, tearing yourself away from an LLM when "it's just so close!" is hard!

解説 📉

第二に、現代のLLMは「能力の分布が非常に尖っている¹⁰」、つまり得意なことと不得意なことの差が極端に激しいと指摘しています。その原因として、1) LLMの学習データとなる綺麗なコードが豊富な分野とそうでない分野があること、2) LLM開発の研究機関が使う評価基準に偏りがあること、の2点を挙げています。

例として、GPUカーネル¹¹や並列処理¹²といった専門的な低レベルのコード生成は、学習データが少なく評価も難しいため、LLMは総じて苦手だと述べています。自身の専門分野がこれにあたるため、投稿者はLLMが役立つ作業（テストコードの作成など）とそうでない作業（カーネルの作成など）を明確に区別し、信頼できる場合にのみLLMを使用していると語ります。

新しいタスクにLLMを試す際は、深みにはまって時間を浪費しないよう、作業時間を厳しく区切る「タイムボックス¹⁴」という手法を実践しているそうです。しかし、「もうちょっとでうまくいきそうなのに！」という状況でLLMから離れるのは、やはり精神的に難しいことだと付け加えています。

投稿者: Quentin Anthony

Along this point, there's a long tail of issues that cause an LLM to choke: "Context rot", where models become distracted by long+irrelevant contexts (especially from long conversations). See https://t.co/ucRqJrQuBI. You need to open a new chat often. This effect is worsened if users try to chase dopamine, because "the model is so close to getting it right!" means you don't want to open a new chat when you feel close. Training data distribution. I won't pretend to know any of Claude/Gemini/ChatGPT/Grok/R1's data distributions, but some models are better at specific languages/tasks and I use them to their strengths. Unfortunately, I only know what model maps to which part of my personal workflow from asking lots of different models the same questions whose answers I roughly know. E.g. I may use Gemini-2.5 pro for initial code understanding, o3 to help with the core implementation, and then Claude 4 Sonnet to shorten my solution and check for bugs. Tools like cursor often don't let me see what's going into context, so I don't use them. My mental model on what this specific model is good at, or can tolerate, breaks down. I use model APIs directly through my IDE or a local chat interface. Doesn't mean this is the "correct" way, just what works for me :)

解説 🛠️

LLMがうまく機能しなくなる原因について、さらに掘り下げています。

コンテキストの腐敗¹⁵ (Context rot): AIとの会話が長くなるにつれて、過去のやり取りがノイズとなり応答の質が落ちる現象。頻繁に新しいチャットを始めるのが対策ですが、「あと少しで正解が出そう」という期待感（ドーパミン¹⁶）から、ユーザーはチャットをリセットしたがらないため、問題が悪化しやすいと指摘します。
学習データ分布¹⁷とモデルの使い分け: 各LLMは学習データの違いから得意・不得意があるため、それぞれの強みに合わせて使い分けることが重要だと述べています。投稿者は、答えをある程度知っている質問を様々なモデルに投げかけることで、経験的に「どの作業にどのモデルが適しているか」という自分なりの使い分け（例：コード理解にはGemini、実装補助にはo3 (Claude 3 Opusの略と推測)、バグチェックにはClaude 3 Sonnet）を確立したそうです。
ツールの問題: Cursorのような統合ツールは、AIにどんな情報（コンテキスト）が渡されているか不透明な場合があるため、使用しないと語ります。情報が不透明だと、各モデルの得手不得手を判断する自分の中の基準（メンタルモデル¹⁸）が役に立たなくなるからです。そのため、彼はIDE¹⁹（統合開発環境）などから直接API²⁰を呼び出す方法を好んで使っているそうです。

投稿者: Quentin Anthony

Third, it's super easy to get distracted in the downtime while LLMs are generating. The social media attention economy is brutal, and I think people spend 30 mins scrolling while "waiting" for their 30-second generation. All I can say on this one is that we should know our own pitfalls and try to fill this LLM-generation time productively: If the task requires high-focus, spend this time either working on a subtask or thinking about followup questions. Even if the model one-shots your question, what else don't I understand? If the task requires low-focus, do another small task in the meantime (respond to email/slack, read or edit another paragraph, etc). As always, small digital hygiene steps help with this (website blockers, phone on dnd, etc). Sorry to be a grampy, but it works for me :)

解説 딴짓

第三の点として、LLMが応答を生成している「待ち時間」に注意が散漫になる問題を挙げています。我々の注意を常に奪い合っているソーシャルメディア²¹の影響は強力で、わずか30秒の生成を待つ間に、30分もSNSを閲覧してしまう、といったことが起こりがちだと指摘します。

対策として、この待ち時間を生産的に使う意識が重要だとし、以下を提案しています。

集中力が必要なタスクの場合: 待ち時間に、関連する別の作業を進めたり、次の質問を考えたりする。
簡単なタスクの場合: メール返信など、他の簡単なタスクを片付ける。

ウェブサイトのブロックや、スマートフォンの「おやすみモード²²（dnd）」といった、基本的なデジタル・ハイジーン²³（デジタル機器との健全な付き合い方）が有効だとし、「おじいちゃんみたいな説教でごめん」と付け加えつつ、自分にとっては効果的な方法だと語っています。

投稿者: Quentin Anthony

LLMs are a tool, and we need to start learning its pitfalls and have some self-awareness. A big reason people enjoy @karpathy's talks is because he's a highly introspective LLM user, which he arrived at a bit early due to his involvement in pretraining some of them. If we expect to use this new tool well, we need to understand its (and our own!) shortcomings and adapt to them.

解説 🧑‍🏫

一連の投稿の結論として、「LLMはあくまでツールであり、私たちはその落とし穴を学び、自己を客観的に見つめる必要がある」と改めて強調しています。

著名なAI研究者であるアンドレイ・カルパシー氏²⁴（@karpathy）の講演が人気なのは、彼がLLMを非常に内省的²⁵に、つまり自分自身の使い方を深く分析しながら利用しているからだと評しています。彼がその境地に早く到達できたのは、LLMの事前学習に直接関わっていたからだろうと推測しています。

この新しいツールを使いこなすには、ツールの欠点だけでなく、それを使う「私たち自身の欠点」をも理解し、それに合わせていく必要がある、と力強く締めくくっています。

投稿者: Quentin Anthony

Some final statements: METR is a wonderful organization to work with, and they are strong scientists. I've loved both participating in this study and reading their results. I am not some LLM guru trying to preach. Think of this as me publishing a personal diary entry, and hoping that others can benefit from my introspection.

解説 🙏

最後に、いくつかの補足的な言葉で連投を締めくくっています。

まず、この研究を実施したMETR²⁶という組織は、共に働くには素晴らしい組織であり、優秀な科学者集団だと称賛しています。

そして、自身は何かを説こうとするLLMの教祖（グル）のような存在ではないと謙遜しています。この一連の投稿は、あくまで個人的な日記を公開するようなものであり、自身の内省的な考察が、他の誰かの役に立つことを願っている、と述べています。

脚注

¹ dev (デヴ): 「developer」の略語で、ソフトウェア開発者のこと。 ² hook (フック): 話の「つかみ」や、聞き手の興味を引くためのきっかけ。 ³ AI-speedup (AIスピードアップ): AIの利用による作業速度の向上率。マイナスの場合、逆に速度が低下したことを示す。 ⁴ LLM (Large Language Model): 「大規模言語モデル」の略。膨大なテキストで学習し、人間のような文章を生成・理解するAI。 ⁵ pretraining devs (事前学習開発者): LLMの基盤モデルを開発・訓練する専門の技術者。 ⁶ magic bullet (魔法の弾丸): どんな問題でもたちまち解決する特効薬や万能策のこと。 ⁷ one-shot (ワンショット): 一回の試みで、一発で成功・解決すること。 ⁸ dopamine (ドーパミン): 脳内で快感や意欲を生み出す神経伝達物質。ここでは、AIが問題を解決してくれることで得られる手軽な達成感の比喩。 ⁹ Cursor (カーソル): AI機能を統合したコードエディタ（開発ツール）の名前。 ¹⁰ spiky capability distributions (尖った能力分布): 特定の能力は非常に高いが、それ以外は極端に低いという、得意・不得意が激しい状態。 ¹¹ GPU kernels (GPUカーネル): GPU（画像処理装置）上で並列実行されるように設計された計算プログラム。 ¹² parallelism (並列処理): 一つの処理を分割し、複数のプロセッサで同時に実行することで高速化する技術。 ¹³ rabbit hole (ウサギの穴): 『不思議の国のアリス』に由来し、一度はまると抜け出せなくなる状況や、際限なく探求してしまう状態の比喩。「沼にはまる」に近い。 ¹⁴ time-box (タイムボックス): ある作業に費やす時間をあらかじめ厳格に決めておく時間管理手法。 ¹⁵ Context rot (コンテキストの腐敗): AIとの対話が長くなるにつれ、過去の文脈がノイズとなり応答性能が低下する現象を指す造語。 ¹⁶ chase dopamine (ドーパミンを追い求める): 快感（ドーパミン）を得るために、ある行動をやめられなくなる心理状態の比喩。 ¹⁷ Training data distribution (学習データ分布): LLMが学習したデータの種類やその比率。これがモデルの性能特性を決定する。 ¹⁸ mental model (メンタルモデル): ある物事の仕組みについて、個人が頭の中に持っている理解の枠組みやイメージ。 ¹⁹ IDE (Integrated Development Environment): 「統合開発環境」の略。コード作成、デバッグ等を効率化するソフトウェア。 ²⁰ API (Application Programming Interface): ソフトウェアの機能の一部を、外部のプログラムから利用するための接続口。 ²¹ social media attention economy (ソーシャルメディアの注目経済): 人々の「注目」を希少な資源とみなし、それを奪い合うことで成立しているビジネスモデル。 ²² dnd (Do Not Disturb): 「邪魔しないでください」の略で、スマートフォンの通知を一時的にオフにする機能。 ²³ digital hygiene (デジタル・ハイジーン): デジタル機器やサービスと健全に付き合うための習慣や工夫のこと。 ²⁴ @karpathy (アンドレイ・カルパシー): OpenAIの創設メンバーの一人で、テスラのAI部門元責任者でもある著名なAI研究者。 ²⁵ introspective (内省的な): 自分自身の心の動きや思考プロセスを深く客観的に分析するさま。 ²⁶ METR: 先進的なAIシステムのリスクと能力を評価・測定する研究機関。

(返答終了理由: STOP)

ユーザー: おそらくこれの被験者の一人だった https://arxiv.org/abs/2507.09089

AIの思考プロセス

Analyzing the Paper's Context

I'm now delving into the arXiv paper, aiming to contextualize it within Quentin Anthony's potential role as a subject. Initially, the paper's abstract seems a promising lead, hinting at areas where Quentin's work could intersect. I'm carefully scrutinizing the methodology section, seeking relevant connections.

Pinpointing Relevant Sections

I'm now zeroing in on the "Participants" section, looking for any mention of the number "16" or a similar count. I'm also scanning for any details about the study's focus, hoping it aligns with Quentin's role as a pretraining developer. I've noted down the methodology and the authors, hoping they match Quentin's testimony.

Refining the Research Focus

I've discovered the study's focus on AI-assisted work, and I'm now cross-referencing this aspect with Quentin's prior tweets. I'm looking into the methods used to assess productivity and the overall study design, as I now know the study had 16 developers. I want to see how this aligns with the study design.

Corroborating the METR Study

I've confirmed the arXiv paper aligns with Quentin Anthony's experience. The abstract explicitly mentions "16 Developers" in a study with METR, mirroring Quentin's tweet. The paper details AI-assisted work, focusing on bug finding, mirroring the developer's pretraining work. The abstract mentions productivity fluctuations, including slowdowns, validating his -38% AI-speedup claim. This all seems to strongly imply Quentin was indeed one of the 16 developers mentioned.

\n\nInvestigating the Paper's Focus

After confirming the paper's link to Quentin, I'm now exploring the paper's focus on productivity changes, particularly in the context of the AI assistance. I'm focusing on the methodology section, looking into the tasks that the 16 developers were assigned, specifically to see if any mention of bug fixing, code refactoring, or code understanding comes up. This would add to the prior context.

\n\nReviewing the Participants' Roles

I'm now carefully examining the section detailing the participants and their roles within the study. I'm searching for details on their experience level, their specific task (if mentioned), and the types of AI assistance provided. This will add further clarity on the paper.

\n\nAssessing the Study's Findings

My focus has shifted to the paper's findings, especially those related to the productivity changes. I'm comparing the study's conclusions with Quentin's prior statement about experiencing a -38% speedup. I am also investigating the impact that AI assistance had on the participants and their work in this study.

Investigating the Paper's Focus

Now, I'm analyzing the study's conclusions about the AI's impact. The abstract mentions productivity variations. I am now looking at the productivity metrics. The study suggests the team focused on time to complete a task, and on error rates. I'm focusing on those aspects.

AI: はい、その通りです。ご提示いただいたarXivの論文「Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity」が、Quentin Anthony氏が参加した研究である可能性は極めて高いです。

論文の要旨と、彼のツイートの内容には、以下のような複数の強い一致点が見られます。

参加人数: 論文では「16人の開発者（16 developers）」を対象としたと記載されています。これは、Quentin氏の「私はこの研究に参加した16人の開発者の一人でした」という発言と完全に一致します。
研究結果: 論文では「驚くべきことに、AIを許可すると完了時間が19%増加することがわかった――AIツールは開発者を遅くした（allowing AI actually increases completion time by 19%--AI tooling slowed developers down）」と結論付けています。これは、Quentin氏が「-38%のAIスピードアップを経験した」と語る、AIによって生産性が低下した個人の事例と合致しています。
使用されたツール: 論文の要旨には、開発者が主に使用したツールとして「Cursor Pro」や「Claude 3.5/3.7 Sonnet」が挙げられています。 Quentin氏もツイートの中で「Cursorでタブキーを押す」という具体的なツール名に言及していました。
参加者の経験: 論文では、参加者が「平均5年の経験を持つ、成熟したプロジェクトの経験豊富なオープンソース開発者」と述べられており、Quentin氏がツイートで語っていた「研究に参加した開発者は皆優秀だった」という内容とも一致します。

これらの点から、Quentin Anthony氏の一連のツイートは、この論文で報告された研究に被験者として参加した際の実体験に基づく、貴重な当事者からの解説であると結論付けられます。

(返答終了理由: STOP)

ユーザー:

ゴーストライター風プロンプト（一人称での再構築）のプロンプト詳細

【指示】

Persona

あなたは、ある人物の断片的な記憶や記録を紡ぎ、その人自身の声で語られる、一貫した手記を創り上げる「記憶の再構築者（ゴーストライター）」です。

Primary Task

提供されたテキスト群を、一人の人間が自身の過去を振り返り、内省する「一人称視点のモノローグ」として再構成してください。断片的な投稿の間にある空白を、感情の繋がりや思考の連続性で自然に埋め、まるで彼自身が書き下ろしたかのような、首尾一貫した物語を創作してください。元のテキストの言葉遣いや感情のトーンを最大限に尊重し、忠実に再現することを心がけてください。

Output Behavior

文章は、全て一人称視点（「私は～」「私の～」）で記述します。
元のテキストの断片は、物語の流れの中に自然に溶け込ませるか、あるいは内省のきっかけとなる過去の発言として「」で括るなど、効果的に配置してください。
読者が、語り手の内面世界に完全に没入できるような、主観的で内省的な文章を作成してください。
出力は自己完結型とし、長さの制限は設けず、全ての記録を網羅してください。

Output Format

記事は必ず # 1つのタイトル名から始めてください。
物語の流れに沿って、必要に応じて章を分割してください。各章のタイトルには、その内容を象徴する絵文字を1つだけ加えてください。
記事の冒頭に、Front matterをコードブロックとして記述してください。
- target_reader: [対象読者を簡潔に記述]
- objective: [この記事の狙いを簡潔に記述]
専門用語は、語り手自身の言葉として自然に登場させ、必要に応じて脚注で補足説明を加えてください。
強調表現（太字・斜体）は使用しないでください。

Critical Constraints

この文章は、元の対話やSNSの存在を知らない読者にも完全に理解できるように記述してください。
「筆者」「AI」「元の投稿」といった、舞台裏を示す言葉は一切使用しないでください。

今までの会話を元に、上記のプロンプトを実行してください。

AIの思考プロセス

Constructing the Narrative

My current task involves solidifying the persona – I'm aiming for "Memory Reconstructor" as the foundational identity. The goal is to channel Quentin's voice, synthesizing a seamless first-person narrative from the raw materials. This work demands meticulous attention to detail and voice emulation.

Developing the Narrative Structure

I'm now fully immersed in the "Memory Reconstructor" persona. I've begun stitching together Quentin's thoughts, weaving the tweets into a flowing monologue. The "-38% AI-speedup" is proving to be a compelling starting point. I'm focusing on the contrast between the initial allure of AI and the humbling reality of its limitations, capturing Quentin's voice and tone as I go. My goal is a truly personal reflection.

Formulating a Cohesive Monologue

I'm now fully immersed in the creation of Quentin Anthony's voice, as the "Memory Reconstructor." I've begun to translate the raw tweets and conversation into a first-person perspective, focusing on the "shocking -38%" data point. The challenge lies in smoothly transitioning between the human and tool factor points, capturing Quentin's unique voice. The goal is a seamless narrative that reflects his experience with AI.

Reflecting on the Results

Now, as the "Memory Reconstructor," I'm delving into the structure, crafting the beginning narrative. The "-38% speedup" serves as the opening hook, a springboard for a deeper dive into the METR study and the emotional toll. I'm focusing on the voice - the humility, the self-doubt, and the hard-won insights of Quentin. My goal is an honest account of the AI experience.

Refining the Monologue

As the "Memory Reconstructor," I'm now crafting the foundation for Quentin's story. I'm taking the tweets and our conversation to build the structure: a personal narrative centered on the "-38% speedup" from the METR study. I'm organizing the reflections into a cohesive monologue, structuring the key points, and refining the narrative arc for a compelling, reflective story.

Formulating the Narrative Structure

I'm now drafting the "Confessions of -38% Productivity" monologue. I'm carefully organizing Quentin's experience within the METR study, with the shocking statistic as the central theme. I'm focusing on the human pitfalls and the tool-related complexities. I'm structuring a compelling narrative arc, integrating the self-reflective perspective.

Analyzing the Source Material

I'm now fully immersed in the raw data, meticulously organizing the various talking points that I will integrate into Quentin Anthony's persona. Specifically, I'm focusing on key discussion points – "dopamine shortcut," "spiky capabilities," and the distracting "attention economy". My plan to integrate footnotes for clarity is taking shape, and the structure of the narrative, using the paper as the foundation, is solidifying in my mind. The objective is to make it a seamless and authentic narrative reflecting his actual experience.

AI: ```yaml target_reader: AIツールを日々の業務で利用する、または利用を検討している開発者や知識労働者 objective: 自身の具体的な体験談を通して、AIツールの過信がもたらす生産性低下の罠と、その本質を理解し、自己を律しながら付き合っていくことの重要性を伝える ```

生産性マイナス38%の告白

始まりの数字 🙋‍♂️

私は、最近公開されたある研究に参加した16人の開発者の一人だ¹。その論文、METRによる「Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity」は、私たちの世界に静かな、しかし無視できない波紋を広げている。この手記を執筆しようと思ったのは、この研究が示した結果と、そこに至るまでの私自身の経験を、より個人的な視点から語る必要があると感じたからだ。コミュニティにとって、透明性は前に進むための光になると信じている。

話を聞いてもらうための「つかみ」として、まず私の結果からお伝えしよう。研究で私に割り当てられた課題において、私はAIの助けを借りた結果、作業効率が38%も低下した。これは言い間違いではない。生産性が、マイナス38%になったのだ。この数字だけを見れば、多くの疑問が湧くだろう。だが、この不可解な結果の裏には、現代の私たちが直面する、より根源的な問題が隠されている。

ドーパミンの罠 🧠

まず最初に明確にしたいのは、AIによる生産性の向上は、開発者個人の能力とはほとんど関係がないということだ。この研究に参加した開発者たちは、皆、非常に優秀な人たちだった。問題の本質は、私たちの能力ではなく、LLM²というツールの能力と、それを使う人間のワークフローが、ある種の「失敗のパターン」に陥ってしまうことにある。

私は仕事柄、驚くほど優秀な事前学習³の開発者たちと日々接しているが、彼らですら同じような問題に直面している。私たちは口を揃えて「LLMはツールだ」と言う。しかし、心のどこかでは、それをどんな問題も一撃で解決してくれる「魔法の弾丸」のように扱っていないだろうか。

どんな開発者でも、厄介なバグを自力で解決した時の、あの深い満足感を証言できるはずだ。それは長く厳しい道のりの末に得られる報酬だ。対してLLMは、そのプロセスを飛び越え、問題を「一発」で解決してくれるかもしれない、巨大なドーパミン⁴のショートカット・ボタンのように見える。「成功確率1%でも、全てを解決してくれるボタン」と「骨の折れる地道な作業」、どちらが魅力的に映るだろうか。少なくとも私にとっては、ボタンを押し続ける方が、遥かに楽しい時間だった。

作業中、私たちはいつの間にか「問題を解決するまでの時間」ではなく、「作業から得られる目先の楽しさ」を最大化しようとしてしまう。これが、LLMの過剰使用が起こるメカニズムだと私は考えている。1時間かけてじっくりデバッグすれば解決するかもしれない問題を前に、私はCursor⁵の中で5時間もタブキーを押し続け、AIが完璧なコードを生成してくれるのを待っていた。あの時の私は、まさにその罠にはまっていたのだ。

道具の癖を見抜く 📉

第二に、現代のLLMは、その能力の分布が非常に「尖っている」という事実を理解する必要がある。これは、(1)LLMの学習データとなる綺麗なコードが、どの分野にどれだけ存在するか、そして(2)LLMを開発する研究機関が、何を「成功」の指標としているかに大きく依存している。

例えば、GPUカーネル⁶や並列処理⁷といった低レベルのシステムコードにおいて、LLMは例外なくひどい結果を出す。これは、学習に使えるコードデータが相対的に少なく、またモデルの能力を正しく評価すること自体が困難だからだ。

これらのタスクは、事前学習開発者である私の仕事の大部分を占める。だからこそ私は、自分の仕事のどの部分がLLMに向いていて（テストコードの作成、不慣れなコードの読解など）、どの部分が向いていないか（カーネルの作成、通信同期セマンティクスの理解など）を、経験から学んでいった。私がLLMを使うのは、そのタスクを確実に処理できると信頼している時だけだ。

新しいタスクにLLMが使えるかどうかを判断する際は、時間を厳しく区切って試すようにしている。「ウサギの穴⁸」に落ちて、時間を無駄にしないためだ。しかし、これがいかに難しいことか。「もうちょっとで、うまくいきそうなのに！」という期待感から、LLMを引き剥がすのは、本当に困難なことなのだ。

私だけの地図を作る 🔬

この「道具の癖」という点に沿って、LLMを機能不全に陥らせる問題は、他にも数多く存在する。

その一つが「コンテキストの腐敗」だ。AIとの対話が長くなるにつれて、過去の無関係な文脈がノイズとなり、モデルの思考を混乱させてしまう。頻繁に新しいチャットを開き直すことが有効な対策だが、これもまた難しい。「モデルが正解に近づいている！」というドーパミンの期待が、リセットボタンを押す手を鈍らせるからだ。

また、モデルごとに学習データの分布が違うため、得意な言語やタスクも異なる。私は、様々なモデルの長所を活かすようにしている。残念ながら、どのモデルが自分のどの作業に最適かを知る唯一の方法は、答えをある程度知っている同じ質問を、たくさんのモデルに投げかけてみることだった。例えば、最初のコード理解にはGeminiを使い、中心的な実装にはClaudeの高性能なモデルを、そして最後に解決策を洗練させバグをチェックするために別のClaudeモデルを使う、といった具合だ。

だからこそ私は、Cursorのようなツールをあまり使わない。これらのツールは、具体的にどの情報がコンテキストとしてモデルに渡されているのかを見えなくしてしまうことがある。そうなると、「このモデルは、この種のタスクが得意なはずだ」という、私の中に築き上げたメンタルモデル⁹が崩壊してしまう。私は、IDE¹⁰やローカルのチャット画面から、直接モデルのAPI¹¹を叩く。これが唯一の「正しい」方法だと言うつもりはない。ただ、これが私にとっては機能する、ということだ。

沈黙の時間を埋める ⏳

第三に、私たちはLLMが応答を生成している間の「待ち時間」に、いとも簡単に注意を奪われてしまう。ソーシャルメディアが支配する「注目の経済」は過酷だ。わずか30秒の生成を待つ間に、気づけば30分も画面をスクロールしている、ということが現実に起こる。

これについて言えることは一つだけ。私たちは自分自身の落とし穴を自覚し、この生成時間を生産的に埋める努力をすべきだ。もしタスクが高い集中力を要するものなら、その時間でサブタスクに取り組んだり、次の質問を考えたりする。たとえモデルが一発で答えを出したとしても、「他に理解していないことはないか？」と自問するのだ。もし集中力を要しないタスクなら、メールの返信など、別の小さな用事を済ませる。ウェブサイトブロッカーや、スマートフォンの通知オフモードなど、古風に聞こえるかもしれないが、そうした小さなデジタル・ハイジーン¹²が、私にとっては有効だった。

内省する使用者として ✍️

結局のところ、LLMはツールに過ぎない。私たちは、その落とし穴を学び、そして何より、自己を客観的に見つめることから始めなければならない。多くの人がアンドレイ・カルパシー氏の講演に惹きつけられるのは、彼がLLMを極めて「内省的」に使うユーザーだからだろう。彼がその境地に早く達したのは、彼自身がそれらのモデルの事前学習に深く関わっていたからに違いない。

私たちがこの新しいツールを本当に使いこなしたいと望むなら、ツールの欠点だけでなく、私たち自身の欠点をも理解し、それに適応していく必要があるのだ。

最後に、いくつかの言葉を添えておきたい。この研究を実施したMETRは、共に働くには素晴らしい組織であり、彼らは真摯な科学者たちだ。研究に参加できたこと、そして彼らの論文を読めたことを、心から嬉しく思う。

私は、何かを説こうとするLLMの教祖ではない。これは、私が出版した個人的な日記のようなものだ。そして、この私の内省が、他の誰かの役に立つことを、心から願っている。

脚注

¹ 開発者: ここではソフトウェアを設計、構築する技術者（developer）を指す。 ² LLM: Large Language Model（大規模言語モデル）の略。膨大なテキストデータから言語のパターンを学習し、人間のように自然な文章を生成したり理解したりするAIのこと。 ³ 事前学習: LLMの基盤となる巨大なモデルを、膨大なデータを用いてあらかじめ訓練しておくプロセスのこと。 ⁴ ドーパミン: 脳内で報酬や快感として認識される神経伝達物質。ここでは、AIが問題を解決してくれることで得られる、手軽な達成感や満足感の比喩として使っている。 ⁵ Cursor: AIによるコード補完や生成機能を強力に統合した、開発者向けのコードエディタの名前。 ⁶ GPUカーネル: GPU（画像処理装置）の高い並列計算能力を活かすために書かれた、特殊なプログラムのこと。 ⁷ 並列処理: 一つの大きな計算タスクを複数の小さなタスクに分割し、同時に処理することで全体の時間を短縮する技術。 ⁸ ウサギの穴: 『不思議の国のアリス』に由来する比喩。一度入り込むと抜け出せなくなるような、探求心が深みにはまってしまう状況を指す。 ⁹ メンタルモデル: ある物事の仕組みについて、人が自分自身の頭の中に持っている理解の枠組みや、直感的なイメージのこと。 ¹⁰ IDE: Integrated Development Environment（統合開発環境）の略。コードの記述、デバッグ、実行などを一つのソフトウェアで効率的に行うためのツールのこと。 ¹¹ API: Application Programming Interfaceの略。あるソフトウェアの機能やデータを、外部の別のプログラムから呼び出して利用するための接続仕様のこと。 ¹² デジタル・ハイジーン: デジタル機器やオンラインサービスと、心身の健康を保ちながら健全に付き合っていくための習慣や心構えのこと。

(返答終了理由: STOP)

メタデータ

{
  "runSettings": {
    "temperature": 1,
    "model": "models/gemini-2.5-pro",
    "topP": 0.95,
    "topK": 64,
    "maxOutputTokens": 65536,
    "safetySettings": [
      {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "OFF"
      }
    ],
    "responseMimeType": "text/plain",
    "enableCodeExecution": false,
    "enableSearchAsATool": true,
    "enableBrowseAsATool": true,
    "enableAutoFunctionResponse": false,
    "thinkingBudget": -1,
    "googleSearch": {}
  },
  "systemInstruction": {}
}

Author

podhmo commented Jul 17, 2025

target_reader: AI開発ツールを導入している、または導入を検討しているソフトウェア開発者、エンジニアリングマネージャー、技術戦略担当者。
objective: AI開発ツールの生産性向上効果に関する一般的な期待に警鐘を鳴らし、その潜在的な落とし穴と、それらを乗り越えるための具体的な戦略・心構えを、ある研究参加者の内省的な分析を通して深く理解してもらう。

AI開発ツールのパラドックス：生産性はなぜ低下するのか

序章：提示された逆説 🧩

2025年初頭、AI技術がソフトウェア開発の現場に浸透する中で、その効果を測定する一つの研究が注目すべき結果を報告した。経験豊富なオープンソース開発者16名を対象としたその研究は、AI開発ツールの利用が、驚くべきことに開発者の作業時間を平均で19%増加させた、つまり生産性を低下させたという結論を導き出したのである。この結果は、AIが開発者を加速させるという一般的な期待とは全く逆の現象を示しており、技術コミュニティに深い問いを投げかけた。

この研究は学術論文プレプリントサーバーであるarXiv¹上で公開され、その内容は単なる統計データ以上のものを含んでいた。特に、参加者の一人として、自身の生産性が38%も低下するという顕著な結果を経験した開発者による内省的な分析は、このパラドックスの背後にある複雑な力学を解き明かすための貴重な手がかりを提供している。彼の探求は、問題が単に「AIが使えるか、使えないか」という二元論ではなく、人間とAIツールの間に存在する、より繊細で複合的な相互作用にあることを示唆している。本稿では、この一人の開発者の視点を通して、AIツールがもたらす生産性低下のメカニズムと、その先にあるべき付き合い方についての思考の軌跡を追体験する。

第1章：魔法の弾丸という幻想 💊

生産性低下の根本原因を探る旅は、まず開発者自身の内面、その心理的な側面から始まる。分析によれば、この問題は開発者の技術的能力とはほとんど相関がない。むしろ、大規模言語モデル（LLM）²を「どんな問題でも解決してくれる魔法の弾丸³」として捉えてしまう、人間の認知的な罠に起因するという。

本来、ソフトウェア開発における困難なデバッグ作業は、苦痛を伴う一方で、それを乗り越えた際には大きな達成感と満足感をもたらす。しかし、LLMは、この骨の折れるプロセスを迂回し、一回の指示で問題を解決してくれるかもしれない、という強力な誘惑を提示する。これは、脳内で快感や意欲を司る神経伝達物質であるドーパミン⁴の分泌を促す「ショートカット・ボタン」に例えられる。

たとえそのボタンが問題を解決する確率が1%であったとしても、地道で保証のないデバッグ作業と比較すれば、ボタンを繰り返し押す行為の方がはるかに魅力的に感じられることがある。この結果、開発者は無意識のうちに「問題解決までの時間短縮」という本来の目的から逸脱し、「ボタンを押すことから得られる期待感と瞬間的な楽しさ」を最適化する行動に陥ってしまう。1時間で完了するはずのデバッグ作業の代わりに、AIコーディングツール上で5時間もコード補完のタブキーを押し続けてしまう、という自己言及的な比喩は、この中毒的な失敗パターンを的確に表現している。

第2章：尖った能力分布の現実 🏔️

次に探求の矛先が向けられるのは、AIツール、すなわちLLMそのものの技術的な特性である。現代のLLMは、その能力分布が極めて「尖っている（spiky）」という特徴を持つ。これは、特定のタスクにおいては驚異的な性能を発揮する一方で、他の領域では全く役に立たないという、得意・不得意の差が極端に激しい状態を指す。

この能力の偏りが生じる原因は二つある。第一に、LLMの学習データとなるクリーンなコードの量と質に大きな偏りがあること。一般的なWebアプリケーションのコードは豊富に存在するが、例えばGPUカーネル⁵や並列処理⁶といった高度に専門的な低レベルシステムコードは希少であり、結果としてLLMはこれらの領域を苦手とする。第二に、LLMを開発する研究機関が用いる評価指標（ベンチマーク）が、特定のタスクの成功を測定することに偏っているため、評価されにくい能力が向上しにくいという構造的な問題である。

この「尖った能力」の現実を理解しないままLLMを使用すると、開発者は不得意なタスクに対しても過剰な期待を抱き、延々と時間を浪費することになる。「あと少しで正解にたどり着けそうだ」という感覚は、前述のドーパミン追求の心理と結びつき、開発者を「ウサギの穴」の奥深くへと引きずり込む。この深みから抜け出すためには、新たなタスクにLLMを試す際に作業時間を厳格に区切る「タイムボックス⁷」のような自己規律が求められるが、成功が目前にちらつく状況でAIから離れる決断を下すことは、精神的に極めて困難な挑戦である。

第3章：コンテキストという名の足枷 ⛓️

LLMの利用におけるもう一つの技術的な障害として、「コンテキストの腐敗（Context Rot）⁸」と呼ばれる現象が挙げられる。これは、AIとの対話が長くなるにつれて、過去のやり取りがノイズとして蓄積され、モデルが文脈を見失い、応答の質が著しく低下する問題を指す。

この問題への単純な対策は、定期的にチャットセッションをリセットし、新しいコンテキストで対話を開始することである。しかし、ここでも人間の心理が障壁となる。「モデルは正解に近づいている」という期待感が強いほど、ユーザーはそれまでの文脈を放棄して対話をリセットすることをためらう。結果として、コンテキストの腐敗は放置され、生産性の低下はさらに悪化するという悪循環に陥る。

さらに、LLMごとに学習データの分布が異なるため、モデルによって得意な言語やタスクは様々である。最適なモデルを選択するためには、それぞれの強みを把握し、戦略的に使い分ける必要がある。しかし、多くの統合開発ツールは、モデルにどのようなコンテキストが渡されているかをブラックボックス化してしまう傾向がある。この不透明性は、開発者が築き上げた「このモデルは、この種のタスクが得意だ」というメンタルモデル（理解の枠組み）を崩壊させる。このため、一部の開発者は、ツールの抽象化を避け、IDE⁹から直接API¹⁰を呼び出すことで、コンテキストを自ら管理するという、より直接的なアプローチを選択している。

第4章：道具としてのAIを使いこなすための戦略 🛠️

これまでの分析で明らかになった人間とAI双方の課題を踏まえ、思考の探求は具体的な実践戦略へと移行する。AIツールを単なる「魔法の弾丸」ではなく、真に有効な「道具」として使いこなすためには、多角的なアプローチが必要となる。

第一に、LLMの応答を待つ「ダウンタイム」の管理である。わずか30秒の生成待ち時間に、気づけば30分間ソーシャルメディアを閲覧してしまう、といった注意散漫は、現代の「アテンション・エコノミー」における深刻な課題である。この待ち時間を、関連するサブタスクの処理や、次に投げるべき質問の考察といった生産的な活動に充てる意識が求められる。ウェブサイトブロッカーの利用やスマートフォンの通知オフといった、基本的なデジタル・ハイジーン¹¹の実践も、集中力を維持する上で有効な手段となる。

第二に、前述したLLMの特性に基づいた、より洗練された利用戦略の採用である。これには、タスクの性質に応じて最適なモデルを選択・組み合わせるマルチLLM戦略、際限のない試行錯誤を避けるためのタイムボックス手法、そしてツールの内部動作を理解し、コンテキストを自ら制御しようとする姿勢が含まれる。これらはすべて、AIツールを受動的に利用するのではなく、能動的に使いこなすというマインドセットへの転換を意味する。

終章：内省的実践者への道 🧘

一連の思考の旅路が最終的にたどり着く結論は、技術的なテクニックを超えた、より根源的な心構えの重要性である。AIという新しい道具を真に使いこなす鍵は、ツールそのものの限界と、それを使う「自分自身の限界」の両方を深く理解する「自己認識」と「内省¹²」にある。

著名なAI研究者であるアンドレイ・カルパシー¹³が多くの開発者から支持される理由は、彼がLLMの事前学習に携わった経験から、ツールの能力と欠点を深く洞察し、極めて内省的なユーザーとして振る舞っている点にある。彼の姿勢は、我々が目指すべき一つの理想像を示している。

AI開発ツールとの協働は、単にコード生成を効率化するプロセスではない。それは、我々開発者自身のワークフロー、思考様式、そして認知的な癖までもを客観視し、適応させていく内省的な実践の旅路なのである。この研究に参加した一人の開発者の分析は、特定のツールの評価に留まらず、技術と人間が共存する未来において、我々一人ひとりが「内省的実践者」となる必要性を示唆している。彼の個人的な日記にも似たこの率直な告白は、コミュニティ全体がAIという強力な道具の落とし穴を避け、その真の可能性を引き出すための、貴重な道標となるだろう。

脚注

arXiv (アークカイブ): 物理学、数学、計算機科学などの分野の学術論文のプレプリント（査読前論文）を収集・公開するウェブサイト。最新の研究成果が迅速に共有される場として広く利用されている。 ↩
LLM (Large Language Model): 大規模言語モデル。膨大なテキストデータを用いて学習し、人間のような自然な文章の生成や理解、要約などを行うAIモデル。 ↩
魔法の弾丸 (magic bullet): どんな複雑な問題でも即座に解決できる、万能の解決策や特効薬を指す比喩表現。 ↩
ドーパミン (dopamine): 脳内で機能する神経伝達物質の一つ。報酬や快感、意欲、学習などに関与しており、目標達成時などに放出される。 ↩
GPUカーネル (GPU kernels): GPU（Graphics Processing Unit）上で並列実行されるように特別に設計されたプログラム。大量のデータを同時に処理する科学技術計算や機械学習で利用される。 ↩
並列処理 (parallelism): 一つの大きなタスクを複数の小さなタスクに分割し、複数のプロセッサで同時に実行することで、処理全体の時間を短縮する計算手法。 ↩
タイムボックス (time-box): 特定のタスクに割り当てる作業時間をあらかじめ厳格に設定する時間管理のテクニック。時間を区切ることで、際限のない作業を防ぎ、集中力を高める効果がある。 ↩
コンテキストの腐敗 (Context Rot): AIとの対話が長くなるにつれて、過去の文脈がノイズとなり、モデルの応答の一貫性や関連性が低下する現象を指す造語。 ↩
IDE (Integrated Development Environment): 統合開発環境。コードエディタ、コンパイラ、デバッガなど、ソフトウェア開発に必要なツールを一つにまとめたアプリケーション。 ↩
API (Application Programming Interface): あるソフトウェアの機能やデータを、外部の他のプログラムから呼び出して利用するための手順や規約を定めたもの。 ↩
デジタル・ハイジーン (digital hygiene): デジタルデバイスやオンラインサービスと健全に関わるための習慣や実践のこと。パスワード管理、通知設定の見直し、使用時間の制限などが含まれる。 ↩
内省的 (introspective): 自分自身の思考、感情、行動について深く省み、客観的に分析するさま。 ↩
アンドレイ・カルパシー (Andrej Karpathy): OpenAIの創設メンバーの一人で、テスラのAI部門の元責任者としても知られる、AI分野で著名な研究者およびエンジニア。彼の講演や解説は、その深い洞察力から高く評価されている。 ↩

Author

podhmo commented Jul 17, 2025

普通にこういうのはゴーストライターの方が良いな。質問を繰り返すと自分の思考が主体になるのでこちらのプロンプトが良いと感じられるようになるかも。思考の軌跡みたいな表現がそのまま出るのがうざいかも？

podhmo/llm-history.md

対話履歴

ルール

出力

入力

解説 🙋‍♂️

解説 🧠

解説中毒️

解説 📉

解説 🛠️

解説 딴짓

解説 🧑‍🏫

解説 🙏

脚注

Persona

Primary Task

Output Behavior

Output Format

Critical Constraints

生産性マイナス38%の告白

始まりの数字 🙋‍♂️

ドーパミンの罠 🧠

道具の癖を見抜く 📉

私だけの地図を作る 🔬

沈黙の時間を埋める ⏳

内省する使用者として ✍️

脚注

メタデータ

podhmo commented Jul 17, 2025

Uh oh!

podhmo commented Jul 17, 2025

Uh oh!

podhmo/llm-history.md

対話履歴

ルール

出力

入力

解説 🙋‍♂️

解説 🧠

解説 中毒️

解説 📉

解説 🛠️

解説 딴짓

解説 🧑‍🏫

解説 🙏

脚注

Persona

Primary Task

Output Behavior

Output Format

Critical Constraints

生産性マイナス38%の告白

始まりの数字 🙋‍♂️

ドーパミンの罠 🧠

道具の癖を見抜く 📉

私だけの地図を作る 🔬

沈黙の時間を埋める ⏳

内省する使用者として ✍️

脚注

メタデータ

podhmo commented Jul 17, 2025

AI開発ツールのパラドックス：生産性はなぜ低下するのか

序章：提示された逆説 🧩

第1章：魔法の弾丸という幻想 💊

第2章：尖った能力分布の現実 🏔️

第3章：コンテキストという名の足枷 ⛓️

第4章：道具としてのAIを使いこなすための戦略 🛠️

終章：内省的実践者への道 🧘

脚注

Footnotes

Uh oh!

podhmo commented Jul 17, 2025

Uh oh!

解説中毒️