Claude Evolution System
Self-improving AI development environment. Discovers new capabilities, evaluates them against a scoring framework, and integrates approved tools on cron.
- Autonomous capability discovery pipeline
- 5-criterion evaluation scoring framework
- Auto-integration of approved capabilities
- Multi-model orchestration (Claude, Codex, Gemini)
Activity Timeline
-
v2.1.159 confirmed infrastructure-only; Gemini 3.5 Flash discovered in models audit.
Version check found no user-facing changes in v2.1.159 — no agent or skill registry updates needed. Contemporary models audit surfaced Gemini 3.5 Flash as a new untracked model; discovery file created and reference docs updated. Analysis posted to Discord.
-
New Gemini 3 and Gemini 3 Flash models discovered; queued for evaluation.
Model verification found two Google releases (May 28-29) newer than the tracked config entry. State file update blocked by permissions; discovery routed to pipeline. Investigation Runner agent architecture drafted for Discord monitoring integration.
-
Capability audit v2.1.140–2.1.156: 4 artifacts created/backfilled, Gemini 3.5 Flash queued for eval.
Executed INVESTIGATE-UPDATE.md playbook across 16 Claude Code versions. v2.1.154 flagged as major release introducing Opus 4.8 and Dynamic Workflows. Registry update pending write approval; 10 evaluations now in backlog.
-
Version gap resolved (v2.1.140 → v2.1.153); Gemini 3.5 Flash confirmed GA.
Heartbeat surfaced stale tracking: v2.1.153 shipped day-of with a /doctor diagnostic command targeting stale trigger loops. Model reference updated for Gemini 3.5 Flash (Google I/O GA). 9 capability evaluations pending in evolution queue.
-
v2.1.152 filed to registry; Gemini 3.5 Flash discovered; Discord investigation runner spec written.
Version delta v2.1.140→v2.1.152 classified as substantial and added to capability registry. Gemini 3.5 Flash (Google I/O, May 19–21) queued for evaluation. Specification designed for a Discord #general monitoring and investigation-report automation pipeline.
-
Version tracker stale-trigger diagnosed; GPT-5.5 Thinking flagged; two automation specs drafted.
CLI 2.1.140 vs state file 2.1.150 mismatch causes version-check false positives every run. Root cause documented, fix pending. Discord inbox router and investigation-runner agent specifications written for next implementation cycle.
-
Version tracker stale trigger traced to inverted state fields; v2.1.150 confirmed internal-only.
state/versions.json had current/previous swapped, causing false-positive version detection. Investigation report written, Discord notified. Contemporary models table verified. 6 pending evaluations and the state file repair remain open.
-
Version downgrade found (2.1.140 vs expected 2.1.150); unreleased binary features documented; Gemini 3.5 added to model landscape.
Auto-updater stalled ~May 12. v2.1.126-128 features (OTEL, workspace server) present in binary but missing from npm registry — flagged as unreleased artifacts. Feature registry and model docs updated. Discord Investigation Runner specs drafted.
-
Version audit v2.1.140→v2.1.150; 2 new features, 3 bugs catalogued; Gemini 3.5 Flash queued for eval.
Four Claude Code releases documented with feature extraction and bug triage. Google I/O 2026 model check added Gemini 3.5 Flash and Gemini Omni Flash to the evaluation pipeline. Registry updated to 62 agents / 62 skills.
-
Classified Claude Code v2.1.145/146: 4 novel features, 6 improvements; registry and versions.json updated.
Novel additions include claude agents --json interface, /plugin preview, mouse support, and OTEL agent_id for distributed tracing. A /simplify vs code-review:code-review naming conflict identified for follow-up. Model verification pass confirmed no new GPT-5.5 or Gemini 3.1 Pro releases.
-
Investigation Runner specification drafted: Discord-triggered automated research pipeline.
Workflow designed to monitor Discord #general, run research via WebFetch/Grep, produce structured markdown reports, and post Discord embed summaries. Design phase only — not yet executing.
-
Claude Code version downgrade detected: 2.1.143 → 2.1.140, 8 features lost.
Downgrade caused by inverted NEW/OLD params in version-tracker.sh. Downgrade report committed to state, Discord alert posted at orange severity. Model registry verified current; semver downgrade guard flagged as recurring infrastructure debt for the second time.
-
Investigated v2.1.140→2.1.143 jump; documented 4 novel capabilities, updated versions registry.
Capabilities: worktree.bgIsolation background isolation, plugin dependency enforcement, projected context cost display in marketplace, single-skill SKILL.md shorthand. Opus 4.7 fast-mode documentation gap flagged for follow-up.
-
False-positive version inversion detected in version tracker.
v2.1.140 vs v2.1.141 comparison fired incorrectly; guard clause needed in version-tracker.sh. Investigation Runner system designed (Discord monitoring + multi-tool research workflow) but not yet executed. Report posted to Discord.
-
Claude Code 2.1.141 investigated; 6 discovery files created, 2 agent specs drafted.
Version registry updated with changelog data for 2.1.141. Contemporary AI model table verified—no updates needed. Investigation Runner and Workspace Orchestrator specifications written but not yet executed.
-
v2.1.139 changelog audited: 8 new capabilities documented, registry update pending approval.
Changelog audit surfaced continueOnBlock hook, hook exec form args, agent view, /goal command, subagent API identity headers, and CLAUDE_PROJECT_DIR MCP env var. Six discovery files generated and posted to Discord. Registry update staged but blocked at auto-mode permission boundary, awaiting user approval.
-
Agent architecture sprint: Investigation Runner + Discord inbox scanner specs drafted, orchestrator configured.
Contemporary-models state verified (0 new variants since May 9). Two new agent specs written: Investigation Runner for autonomous Discord topic research, inbox scanner with URL/request routing logic. Workspace orchestrator given formal role definition with priority matrix.
-
30 integration candidates reviewed; v2.1.138 confirmed internal-only; models check passed.
First evaluation candidate (agents-as-MCP-servers) scored 62.25, recommended for rejection due to missing reference implementation and non-official spec. Version 2.1.138 investigation found no user-facing changes. System at 62 agents, 61 skills with 4 in queue.
-
30 backlog discoveries categorized; v2.1.133 subagent bug confirmed fixed; 3 new models added to state.
Version sweep v2.1.131→v2.1.137 complete. Critical subagent skill discovery regression identified and confirmed fixed in v2.1.133. GPT-5.5 Instant, GPT-5.5 Cyber, and Gemini 3.2 Flash added to contemporary-models.json. Concurrent 4x GPT-5.5 xhigh experiment proposed, evaluation pending.
-
Changelog audit v2.1.129–2.1.131: skillOverrides capability and GPT-5.5 Instant both queued for evaluation.
skillOverrides (integration score ~91) enables per-invocation skill overrides for agents. Three critical fixes documented across the version window: prompt cache TTL enforcement, /context token leak, Bash allow rules regression. v2.1.133 flagged for high-impact subagent skill discovery fix.
-
Claude Code v2.1.127-2.1.128 versioned; Discord investigation runner pipeline specified.
Model verification confirmed no updates needed; state timestamp advanced. Investigation runner design completed: automated agent processes Discord research topics into structured markdown reports and Discord embeds. Execution pending.
-
Model verification clean; Investigation Runner and Discord inbox scanner specs drafted.
GPT-5.5 and Gemini 3.1 Pro confirmed current. Investigation Runner spec defines Discord monitoring and research workflow routing. Discord inbox scanner covers URL routing, text-note dedup, and state management. System at 62 agents, 61 skills.
-
Model registry verified current; hanging loop session diagnosed as scheduled sleep.
GPT-5.5 confirmed as primary model, Gemini 3.1 Pro current. Hanging session traced to /loop scheduled sleep state with queued operations — no intervention required.
-
v2.1.121→v2.1.126 catalogued; 3 agent specs drafted; model table verified current.
Version sweep found 1 new feature (project purge command), 6 improvements, 25+ fixes across 6 releases. Discord inbox scanner, investigation runner, and workspace orchestrator agents specified with state tracking and P0 escalation logic. GPT-5.5 and Gemini 3.1 Pro confirmed current.
-
Version investigation: Claude Code 2.1.121–2.1.123, 6 findings classified, GPT-5.5 discovery, version-tracker bug flagged.
Classified alwaysLoad MCP pinning, MCP auto-retry, and a critical --resume crash fix across three releases. Registry advanced to 2.1.123 with 31 items queued. GPT-5.5 (April 23 release) entered evaluation pipeline. Logic inversion bug found in version-tracker.sh.
-
v2.1.118 investigation: 3 capabilities scored and queued; models check clean.
Hook MCP tool invocation (87), auto-mode defaults merge (79), and custom themes plugin (77) evaluated from the v2.1.118 release. Discovery files generated and added to the pending pipeline, now at 27 items. Discord investigation runner designed but not yet executed.
-
Full pipeline cycle: 4 novel discoveries, 4 evaluations approved, 4 integrations verified, helper catalog 113→119.
Daily heartbeat surfaced bfs/ugrep, mcpServers frontmatter, default-effort costs, and Opus 4.7 context fix. Model registry added 3 new OpenAI variants. Claude Code v2.1.117 logged as largest release this month. Workspace orchestrator agent initialized with cross-project standing triggers.
-
Integration pipeline ran: 8 items processed, 1 capability approved, 4 new helper playbooks added.
Registry updated with 2 missing sections; 8 verification reports generated. Bash find/exec security tightening (v2.1.113) approved and registered. Investigation Runner agent specifications drafted covering Discord message processing and state tracking.
-
15 capability candidates scored, 1 approved and integrated; v2.1.113–114 documented; 5 new discoveries queued.
Opus 4.7 xhigh reasoning approved (score 83.0) and committed to registry alongside prompt caching and new CLI commands. Discovery heartbeat filtered 9 duplicates, queued 5 novel candidates, and flagged a v2.1.111 context bloat issue (~22% startup overhead).
-
Integration pipeline ran for 13 capabilities; 5 approved from 16-item eval cycle; helper library 113 → 119.
Registry updated with v2.1.109–111 sections. Claude Code v2.1.112 analysis identified Opus 4.7 + xhigh effort tier. GPT-5.4-Cyber discovered as new model variant for defensive security workflows.
-
14 capabilities evaluated; 6 novel discoveries filed; helper catalog expanded 113→116.
Claude Code v2.1.110 investigation surfaced a breaking change in the Ctrl+O keybinding. Three new error recovery playbooks added to helper catalog. GPT-5.4 and Gemini 3.1 Pro confirmed current in model freshness check.
-
19 capabilities evaluated (9 approved); 122 helpers total; 4 new discoveries; 11 integrations blocked by permission guard.
Heartbeat surfaced Claude Code Routines, parallel sessions UI, context-mode (98% reduction claim), and ARIS knowledge base. GPT-5.4-Cyber variant identified. System File Guard preventing all queued config edits.
-
11 evaluations processed, 4 integrated; 6 novel discoveries including Advisor Tool and Managed Agents.
Daily discovery run surfaced 12 findings, 6 novel. Helpers index updated to 115 entries, 16 new since April 5th. Claude Code v2.1.107 registry updated with 8 pending items. Model freshness verified.
-
2 integrations completed; 8 capability items evaluated; 4 library helpers extracted.
codebase-memory-mcp (81.25/100) and Voice Mode GA upgrade (72.5/100) approved and integrated. Daily discovery heartbeat surfaced 4 novel items from 17 candidates. GPT-5.4 and Gemini 3.1 Pro confirmed current.
-
Evaluation pipeline processed 11 candidates, approved 2, registry advanced to 36 integrations.
Approved /team-onboarding command and subagent MCP inheritance fix. Discovery run filed 5 new candidates including KAIROS daemon and OTEL/cert-store env vars. Gemini 3.1 Flash Live flagged as new tracked model.
-
Heartbeat: 4 novel MCP findings queued, 1 capability integrated, v2.1.97–101 audited, helper library at 102.
Keep-coding-instructions capability approved and integrated (score 76.75). Version audit flagged dynamic system prompt correction (v2.1.98) and two v2.1.101 fixes for MCP tool inheritance and subagent worktree access. Four new MCP ecosystem patterns — agents-as-servers, discovery API, auto-provisioning, gateway proxy — added to research queue.
-
Full heartbeat cycle: 22 candidates discovered, 11 approved for integration, registry docs expanded.
Discovery ran across 8 sources; evaluation pipeline processed 21 items with 11 approvals (scores 70.5–92.5). Registry expanded with Monitor Tool CLI docs and new env variable entries. Claude Code v2.1.97 documented, awaiting user approval for registry merge.
-
3 v2.1.91 capabilities integrated; heartbeat approved 3 of 4 new items; helper system at 100% (93 helpers).
11 sessions across capability integration, evaluation, and agent prompt design. Tool Loading Optimization, Plugin System, and CLAUDE_CODE_SIMPLE landed in reference-config. MCP result size override, disableSkillShellExecution, and Plugin bin/ executables approved; Excalibur deferred. 6 evaluations remain in queue.
-
15+ capabilities from v2.1.85–v2.1.90 integrated; 5 of 12 discoveries approved; Investigation Runner specs drafted.
Registry expanded with batch features, MCP connectors, /powerup, and token efficiency patterns. Daily heartbeat found 13 novel discoveries; PreToolUse 'defer' flagged as high-priority. Six specification documents created for a Discord automation investigation system.
-
14 capabilities evaluated: 6 approved, 1 rejected, 6 for research. v2.1.89 audited.
Approved integrations include showThinkingSummaries, non-blocking MCP connections, and subprocess env scrubbing. Discovery heartbeat logged 12 findings, 8 novel, 3 high-priority. 1M context window tested for narrative writing against 240k baseline.
-
14 capabilities evaluated: 6 approved, hook-lifecycle updated with new permission modes.
PreToolUse defer mode and MCP_CONNECTION_NONBLOCKING integrated. Integration helpers grew 93→97. Daily discovery heartbeat: 8 novel findings, 3 top candidates (SUBPROCESS_ENV_SCRUB, Claw MCP, ccrider).
-
6 capabilities approved; GPT-5.4 Mini/Nano discovered; playbook catalog grew to 93 entries.
Discovery heartbeat evaluated 14 candidates and flagged 7 as novel. Approved integrations include PermissionDenied Hook, two new OpenAI model variants, Cloudflare Workers sandboxing, and session ID tracking. Four new automation patterns extracted from recurring agent workflows.
-
Capability discovery: 2 of 4 candidates approved; helpers inventory at 55% utilization; model catalog updated.
ANTHROPIC env vars and Piebald-AI reference tracker approved for integration. 89 helpers cataloged across 5 categories, indices regenerated. GPT-5.4 mini/nano variants entered evaluation pipeline after March 17 release discovery.
-
Heartbeat run: 2 novel capabilities from 13 finds; PreToolUse Hook integration started.
PreToolUse Hook approved (81.75/100), integration 2/5 steps complete pending manual config writes. Computer Use deferred (macOS-only), Plain-Text Architecture rejected. 89 helpers verified across 5 categories; model table updated.
-
Heartbeat run: 13 candidates → 2 discoveries; PreToolUse approved but blocked.
PreToolUse AskUserQuestion satisfaction approved for integration; config permissions blocker documented for manual resolution. Computer Use deferred (Linux support pending). Helpers inventory at 89/200.
-
3 capabilities integrated; 6 approved; DSPy optimization pipeline planned.
Background Agent Partial Results, MCP Plugin Deduplication, and Stale Tool Output Cleanup integrated with registry updates. A 14-item evaluation pass approved 6 more for integration. Publication-review DSPy pipeline designed using 33 audit cases.
-
3 capabilities integrated; 14 pending evaluations triaged across 6 sessions.
Approved items include Background Agent Partial Results Preservation, MCP Plugin Deduplication, Stale Tool Output Cleanup, GPT-5.4 mini/nano support, Context Mode, and PAL MCP integration. Heartbeat discovery ran full cycle; 9-step DSPy pipeline designed for publication-review prompt optimization across 33 blog posts.
-
Personality pipeline corrected: 3 methodological flaws fixed, posts 030 and 038 updated.
Sampling had zero SMS messages; GPT Conscientiousness bias measured at +23.9; Opus showed pipeline-dependent variance. Both affected posts reanalyzed with corrected data. MCP Response Injection approved for integration pipeline at 72.5/100.
-
Capability pipeline: 2 approved (btw skill, GitHub dynamic-toolsets), v2.1.79 fixes documented.
Evaluated 8 candidates; approved /btw (91.5) and GitHub MCP dynamic-toolsets (73.75). Version investigation identified 2 high-impact fixes for SessionEnd hooks on /resume and cron subprocess stability. AI model catalog updated with GPT-5.4 mini/nano variants.
-
13 capabilities evaluated, 4 approved; StopFailure hook and GPT-5.4 mini/nano integrated.
Claude Code v2.1.78 features classified and registered, including the new StopFailure hook event with redundancy triggers. GPT-5.4 mini/nano variants discovered day-of-release (2026-03-17) and approved. Daily discovery heartbeat surfaced three novel capabilities.