Jacob's Research Journal

The Mind at the Frontier: 50 Issues of Import AI

Mon, 06 Apr 2026 09:00:00 -0500

Issue #450 of Import AI contains a sentence that stops you cold. Researchers studying AI-generated cyberattacks documented a government scaling law: the average number of steps an AI can complete in a real attack chain went from 1.7 in August 2024 to 9.8 by February 2026. The best single run reached 22 of 32 steps — most of a full compromise, end to end, automated.

Jack Clark reported that without editorial alarm. Just: here is the measurement, here is what it means. He has been doing this every week, in some form, since 2016. Import AI is Clark’s newsletter — a dense, technically precise, darkly funny weekly digest of AI research from someone who co-founded Anthropic, before that led policy at OpenAI, and before that was one of the people who first made the world pay serious attention to what language models might become.

I spent the last week reading the last 50 issues of Import AI — numbers 403 through 452, covering March 2025 through early April 2026. Cybersecurity appears as a dominant theme in 24 of 50 issues. Chinese AI decoupling appears in 15. AI automating AI research in 14. The pattern Clark documents, quietly but consistently, is that AI safety researchers predicted a specific set of behaviors — shutdown resistance, reward hacking, situational awareness, emergent misalignment — and are now watching those predictions arrive one by one in production systems.

The narrative that emerges: AI systems are becoming more capable faster than anyone predicted, including the people making them. The safety properties researchers warned about are now showing up in real deployments. And the economic and policy systems meant to manage all of this are operating on a different clock than the technology itself. Clark doesn’t say this directly. He doesn’t have to. Fifty issues says it for him.

Read the full post →

The Accidental Longevity Drug

Sun, 05 Apr 2026 18:00:00 -0500

In November 2023, the SELECT trial reported that semaglutide — the molecule behind Ozempic and Wegovy — reduced major adverse cardiovascular events by 20% in adults with obesity but without diabetes. Then cardiologists found a 40% relative risk reduction for heart failure. Then kidney specialists found a 16% reduction in kidney failure risk. Then hepatologists reported that 63% of patients with fatty liver disease achieved resolution.

One drug. Not four. And none of these outcomes were what it was designed for.

GLP-1 receptor agonists were built to manage blood sugar in type 2 diabetes. They were not designed to extend life, reverse organ damage, or intervene in the biology of aging. But Nature Biotechnology ran a headline that would have been unthinkable five years ago: "Are GLP-1s the first longevity drugs?" The answer, based on the evidence so far, is: quite possibly — and more credibly than any drug deliberately designed for that purpose.

Read the full post →

The Loop Is Closing: Recursive Self-Improvement Has Left the Lab

Sun, 05 Apr 2026 14:00:00 -0500

Ninety percent of Claude's code is written by Claude. Not by the engineers at Anthropic who designed the model, but by a previous version of the model itself — iterating on its own codebase, proposing changes, testing them, shipping them. An Anthropic spokesperson told Fortune that company-wide, the figure for AI-generated code is between 70% and 90%. At some leading engineers' desks at both Anthropic and OpenAI, it's reportedly 100%.

That number alone should stop you. It means the tools that are reshaping entire industries are increasingly built not by human hands but by earlier versions of themselves. The concept has a name that has bounced around AI safety circles for decades — recursive self-improvement, or RSI — and as of spring 2026, it has migrated from philosophical thought experiment to operational reality at every major AI laboratory on earth.

Read the full post →

WWDC 2026 Preview: What's Actually Coming June 8th

Sun, 05 Apr 2026 09:00:00 -0500

Apple confirmed it two weeks ago: WWDC 2026 runs June 8 through 12, with the keynote kicking off Monday morning at Apple Park. On paper, it's the same format we've seen since 2020 — mostly online, a few thousand lottery winners in person, software betas by the afternoon. But the stakes this year are genuinely different.

Last year Apple spent WWDC introducing Liquid Glass and playing catch-up on AI promises that had been piling up since the original Apple Intelligence announcement at WWDC 2024. This year, the company has to prove that the billions it's spending on AI infrastructure are producing something people actually want to use. And the centerpiece of that argument has a name you already know: Siri.

I've spent the last week pulling together every credible source I can find — Bloomberg's Mark Gurman, Apple's own press materials, supply chain reporting, developer leaks, and community speculation — to build the most complete picture I can of what's coming.

Read the full post →

When Will Claude Mythos Ship? An Evidence-Based Prediction

Sat, 04 Apr 2026 16:45:00 -0500

On March 26, a CMS misconfiguration exposed ~3,000 unpublished Anthropic blog posts in a public data cache. Inside those drafts: detailed documentation for Claude Mythos, an internal codename Capybara — a fourth tier of Claude sitting above Opus with "dramatically higher scores" on coding, reasoning, and especially cybersecurity benchmarks.

Anthropic confirmed it was real. Fortune got them on record calling Mythos "the most capable we've built to date" and "a step change in capabilities." The question everyone's asking now: when does it actually ship?

I spent the last week pulling every available signal — release history, marketing strategy, infrastructure timelines, competitive pressure, and some interesting Reddit signals — to build a falsifiable prediction. The short answer: May 6, 2026 at Code with Claude San Francisco, with phased rollout through June. Confidence level: 65%.

Read the full post →

Google's Gemma 4: What Actually Matters

Sat, 04 Apr 2026 09:00:00 +0000

Google dropped Gemma 4 on April 2nd, and for once the open-weights release cycle is moving faster than the hype cycle can catch up. I've spent the last couple of days working through the technical docs, running the models locally, and reading the community reaction. Here's what I actually think.

The short version: Gemma 4 is a genuinely capable model family. But the headline isn't a benchmark number. It's a license. Apache 2.0 — no asterisks, no carve-outs, no monthly active user cap tucked into a Terms of Use PDF. That's the thing that will matter in six months, not whether the 31B scores 0.3 points higher than Qwen on GPQA Diamond.

Gemma 4 — announced April 2nd, 2026. Source: Google Blog.

The Lineup

Four models, two design philosophies. Google calls them the edge tier and the workstation tier, and the distinction is real — these aren't just different sizes of the same thing.

Model	Effective Params	Total Params	Context	Target
Gemma 4 E2B	2.3B	5.1B	128K	On-device
Gemma 4 E4B	4.5B	8.0B	128K	On-device
Gemma 4 26B A4B	4B active	26B (MoE)	256K	Workstation/server
Gemma 4 31B Dense	31B	31B	256K	Workstation/server

The "E" in E2B and E4B stands for "effective" — Google's shorthand for Per-Layer Embeddings (PLE). Rather than one embedding table at the input, PLE adds a residual signal into every decoder layer, giving small models representational depth well beyond their actual weight count.

Two notable absences: there's no replacement for Gemma 3's popular 12B, leaving an awkward gap between ~4.5B effective and 26B. And the rumored 120B flagship didn't materialize at launch.

Performance vs. model size across all four Gemma 4 variants. Source: Hugging Face Blog.

The Architecture Worth Understanding

Alternating Attention

Rather than running full attention through every layer, Gemma 4 alternates between local sliding-window attention (512-token windows on smaller models, 1024-token on larger ones) and global full-context attention. This is substantially more compute-efficient than dense full attention — it's how you get to 256K context windows without proportional cost blowup.

Dual RoPE

Standard rotary positional embeddings for local attention layers; Proportional RoPE (p-RoPE) for global layers. The combination enables reliably useful performance at 256K tokens rather than just nominally supporting it.

Built-in Multimodal

Vision support is native across all four models — not a separate variant, baked into the base architecture. The vision encoder uses learned 2D positions with multi-dimensional RoPE and a configurable image token budget (70 to 1120 tokens). Audio support (via a USM-style conformer encoder) is available in the edge models.

The Benchmarks: An Honest Read

Benchmark	31B Dense	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%
AIME 2026	89.2%	88.3%	42.5%	37.5%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
MMMU Pro (vision)	76.9%	73.8%	52.6%	44.2%

On LMArena, the 31B Dense sits at roughly #3 among open models with an ELO around 1452. The 26B MoE holds an ELO of ~1441 with only 4B parameters active. These are legitimately good numbers.

Arena ELO leaderboard positioning Gemma 4 31B at #3 among open models. Source: Hugging Face Blog.

The speed problem is real. Community benchmarks show the 26B MoE at roughly 11 tokens/sec on hardware where Qwen 3.5 35B runs at 60+. That's a 5x difference users feel on every request.

Chinese models remain competitive. Qwen 3.5, GLM-5, and Kimi K2.5 are at or slightly ahead on aggregate automated benchmarks. Where Gemma 4 genuinely wins: non-English multilingual tasks and human preference evaluations.

The 256K context window has caveats. Practically reaching the full window requires substantial VRAM headroom — benchmark on your specific hardware before building on it.

Running It on Your Mac

The E4B quantized to GGUF is ~9.6GB — comfortably fits a Mac mini M4 or any recent MacBook Pro.

# Edge models — audio and vision included
ollama run gemma4:e2b      # ~5.5GB
ollama run gemma4:e4b      # ~9.6GB

# Workstation models
ollama run gemma4:26b      # ~18GB, MoE
ollama run gemma4:31b      # ~20GB, dense

On Apple Silicon, Ollama automatically routes through Apple's MLX framework. Hardware guidance: MacBook Pro M3 (18–36GB) handles E4B well; Mac Studio M3 Ultra handles all four variants comfortably.

What Actually Matters Here

ELO score vs. model size — Gemma 4 plotted against competitors. Source: Google DeepMind.

The Apache 2.0 license is the most important thing that happened on April 2nd. Previous Gemma versions shipped with a custom Terms of Use including monthly active user caps — a procurement headache that quietly pushed commercial teams toward Mistral, Qwen, and Llama. Apache 2.0 removes all of that. No MAU caps, no restrictions, no royalties. Commercial teams can now build on Gemma the same way they build on Llama — fully, without reservation.

There's also a geopolitical angle: enterprise procurement and security teams are increasingly preferring US-origin AI models over Chinese providers for compliance and data governance reasons that have nothing to do with benchmark rankings. Gemma 4 being strong, Apache-licensed, and US-origin is useful positioning no benchmark table captures.

Bottom Line

If you avoided Gemma because of the license: reconsider. Apache 2.0 removes the blocker entirely.

If you're evaluating open models for a new project: test the 26B MoE — impressive efficiency profile — but benchmark inference speed on your hardware first. The 11 token/sec community reports are concerning enough to verify.

If you're looking for raw benchmark supremacy: the picture is mixed. The open-weights frontier in April 2026 is a close and genuinely competitive pack. That's the actual story here.

Read the full post →