We Got Our Hands on Claude Mythos Preview. Here's What Actually Happened.
You know that feeling when you borrow your friend’s car and it’s so much better than yours that driving your own vehicle afterward feels like operating a horse-drawn carriage? That’s what the last seven days felt like.
We got access to Claude Mythos Preview through Amazon Bedrock’s gated research program. And before you ask — no, we’re not part of the Project Glasswing cool kids club with Apple and Microsoft. We applied through the Bedrock research preview track that opened up to select development teams in mid-April. We wrote what we thought was a mediocre application. Turns out, being an indie team that ships iOS apps and writes honestly about AI gets you further than a 40-page enterprise proposal.
We’ve already covered the news side of Mythos — the leaked codename, the zero-day drama, Project Glasswing, all of it. This post is different. This is what happens when you actually sit down with the thing and point it at real code.
First Impressions: It Thinks Before It Speaks
The first thing you notice isn’t the speed. It’s the silence.
When you give Opus 4.7 a complex task, it starts generating almost immediately — thinking tokens flowing, partial results appearing. Mythos Preview takes a beat. Sometimes two or three seconds of nothing. Then it starts outputting, and what comes out is structured, thorough, and unsettlingly precise.
It reminds me of that coworker who never talks in meetings but when they finally say something, the whole room goes quiet because they just nailed the problem everyone else has been dancing around for twenty minutes.
We started with something simple: pointed it at one of our iOS projects and asked it to do a comprehensive security audit. Not a generic “check for vulnerabilities” prompt — we gave it the full codebase of ThinkBud, our journaling app, and said “find everything that could hurt our users.”
The Bug We Missed for Two Years
Mythos found eleven issues. Nine were the kind of things any decent security scanner would flag — an HTTP endpoint that should’ve been HTTPS, a keychain access attribute that was slightly too permissive, a couple of places where we weren’t sanitizing user input before logging.
But issue number ten stopped us cold.
Deep in our local data sync module, there was a race condition in how we handled concurrent journal entry saves when the app moved between foreground and background states. Under very specific timing conditions — we’re talking milliseconds — it was possible for a partially written entry to overwrite a completed one. Not a security vulnerability in the traditional sense, but a data integrity bug that could silently eat someone’s journal entry.
We’d shipped ThinkBud with this bug for almost two years. No user had reported it, probably because the timing window was so narrow. But Mythos didn’t just find it — it explained the exact sequence of operations that would trigger it, wrote a reproduction test case, and proposed a fix using a serial dispatch queue that was cleaner than what we would’ve written ourselves.
Two years. Thousands of user sessions. And an AI found it in fourteen minutes.
Rewriting Our Security Pipeline
After the ThinkBud audit, we got greedy. We pointed Mythos at our shared networking layer — the code that all four of our apps (ThinkBud, PromptKit, ApplyIQ, and Renovise) use for API communication.
This is where the 93.9% SWE-bench score stops being a benchmark number and starts being a lived experience. We asked Mythos to refactor our certificate pinning implementation. It didn’t just refactor it — it identified that our pinning strategy had a subtle fallback behavior that, under specific MITM conditions, would gracefully degrade to unpinned connections instead of failing hard.
The model then wrote a complete replacement: proper certificate pinning with no silent fallback, clear error handling, and — here’s the part that made our jaws drop — it added OCSP stapling validation that we hadn’t even asked for. When we questioned why, it explained that our app’s target demographic (job seekers using ApplyIQ, people managing renovation projects with Renovise) are likely using public WiFi at coffee shops and coworking spaces, where MITM attacks are most common.
It was making product decisions based on user context we gave it three prompts ago.
The Numbers Don’t Lie (But They Don’t Tell the Whole Story)
Let’s put the benchmarks in perspective with what we actually observed:
Coding quality. On our real-world tasks, Mythos produced code that required editing roughly 40% less than Opus 4.7. Not because 4.7 is bad — we use it daily and it’s excellent. But Mythos outputs felt like they came from a senior developer who’d been working in your codebase for six months, while 4.7 feels like a brilliant contractor on their first week.
Context retention. Over our week of testing, Mythos maintained coherent understanding across sessions in a way that felt almost eerie. It remembered architectural decisions from day one when we were working on day five. It referenced our certificate pinning conversation when we later asked about network error handling, without us explicitly connecting the two.
Speed tradeoff. Mythos is slower. Noticeably. On complex tasks, we saw response times 2-3x longer than Opus 4.7. For an interactive coding session where you’re firing off quick questions, this matters. For deep security analysis or architectural review, the wait is worth it every single time.
For reference, here’s how the public benchmarks stack up:
| Benchmark | Opus 4.6 | Opus 4.7 | Mythos Preview |
|---|---|---|---|
| SWE-bench Verified | 80.8% | 87.6% | 93.9% |
| SWE-bench Pro | 53.4% | 64.3% | 77.8% |
| Terminal-Bench 2.0 | 65.4% | 69.4% | 82.0% |
| GPQA Diamond | — | 94.2% | 94.6% |
| CyberGym | 66.6% | — | 83.1% |
The SWE-bench jump from Opus 4.6 to Mythos — 13.1 percentage points — is the kind of leap that usually takes an entire model generation. Anthropic did it in one step, and then decided the rest of us aren’t ready for it.
The Uncomfortable Question
Here’s the thing nobody in the Mythos conversation wants to say out loud: this model is too good to stay locked up.
Yes, the cybersecurity capabilities are legitimately scary. Anthropic’s red team report shows Mythos finding zero-days in every major OS and browser, including a 17-year-old FreeBSD RCE that gives root to any unauthenticated attacker. The Firefox exploit stats alone — 181 successful exploits versus Opus 4.6’s two — paint a picture of a model that has crossed a meaningful capability threshold.
But the same capabilities that make it dangerous for offensive security make it transformative for defensive work. Our week with Mythos didn’t just find bugs — it fundamentally changed how we think about code quality. The race condition in ThinkBud. The certificate pinning gap. The OCSP stapling addition we never thought to request. These aren’t theoretical improvements. They’re real fixes in shipping apps that real people use.
Anthropic has committed $100 million in Mythos Preview usage credits to Project Glasswing and donated $4 million to open-source security foundations. That’s real money aimed at real problems. But it still means the best AI security tool ever built is available to the companies that need it least — the ones that already have massive security teams.
What This Means for Indie Developers
If you’re an indie developer or a small team like us, the practical takeaway is this: Opus 4.7 is still your best friend, and it’s genuinely great. We switched back to it after our Mythos access window, and it’s still the most capable publicly available model for real development work.
But knowing what Mythos can do changes the conversation. It proves that AI-assisted security auditing isn’t a future promise — it’s a present reality that’s being artificially gatekept. And it raises the question of whether Anthropic’s approach of restricting access is protecting us or just delaying the inevitable.
For now, we’re back to Opus 4.7 in Claude Code, shipping updates to our apps with the fixes Mythos helped us find. ThinkBud’s race condition is patched. PromptKit’s networking layer is hardened. ApplyIQ and Renovise got the OCSP stapling upgrade.
Seven days with the most powerful AI model in the world, and we came back with real, tangible improvements to our products. That’s not a benchmark score. That’s the whole point.
Related Reading
- Anthropic Dropped Opus 4.7. But the Real Story Is Mythos. — Our deep dive into the Mythos leak, Project Glasswing, and what Opus 4.7 actually brings to the table.
- Vibe Coding Built a Social Network in a Weekend. It Leaked Everything by Tuesday. — Why AI-generated code needs security auditing more than ever.
- The Complete Guide to Vibe Coding iOS Apps — How we use Claude Code daily for iOS development.
Share this post
Comments
Leave a comment
NativeFirst Team
EditorialThe NativeFirst team — engineers and designers building native Apple apps and writing the courses we wish we had when we started.