It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.
The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.
My Dad used to make the same joke in the 1980s about how they'd told him in the 1950s that nuclear power would be "too cheap to meter" which I assume is probably where the trope originated.
This is what excited me about Sonnet 4.6. I've been running Opus 4.6, and switched over to Sonnet 4.6 today to see if I could notice a difference. So far, I can't detect much if any difference, but it doesn't hit my usage quota as hard.
The point of the penny-farthing is that you drive the front wheel directly with the pedals, but this seems to have the pedals in a spot where they would drive a chain, although there is no chain?
> Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks
Yeah it's really not. Sonnet still struggles while Opus, even 4.5 succeeds (and some examples show Opus 4.6 is actually even worse than 4.5, all while being more expensive and taking longer to finish).
Flash models are nowhere near Pro models in daily use. Much higher hallucinations, and easy to get into a death sprawl of failed tool uses and never come out
You should always take those claim that smaller models are as capable as larger models with a grain of salt.
Flash model n is generally a slightly better Pro model (n-1), in other words you get to use the previously premium model as a cheaper/faster version. That has value.
They do have value, because they are much much cheaper.
But no, 3.0 flash is not as good as 2.5 pro, I use both of them extensively, especially in translation. 3.0 flash will confidently mistranslate some certain things, while 2.5 pro will not.
Totally fair. Translation is one of those specific domains where model size correlates directly with quality, and no amount of architectural efficiency can fully replace parameter count.
Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.
For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.
I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain
I can see this image shows an *aerial/satellite view of a coastline*. Here are the key features I can identify:
## Geographic Features
- *Ocean/Sea*: A large body of deep blue water dominates a significant portion of the image
- *Coastline*: A clearly defined boundary between land and water with what appears to be a rugged or natural shoreline
- *Beach/Shore*: Light-colored sandy or rocky coastal areas visible along the water's edge
## Terrain
- *Varied topography*: The land area shows a mix of greens and browns, suggesting:
- Vegetated areas (green patches)
- Arid or bare terrain (brown/tan areas)
- *Possible cliffs or elevated terrain* along portions of the coast
## Atmospheric Conditions
- *Cloud cover*: There appear to be some clouds or haze in parts of the image
- Generally clear conditions allowing good visibility of surface features
## Notable Observations
- The color contrast between the *turquoise/shallow nearshore waters* and the *deeper blue offshore waters* suggests varying ocean depths (bathymetry)
- The coastline geometry suggests this could be a *peninsula, island, or prominent headland*
- The landscape appears relatively *semi-arid* based on the vegetation patterns
---
Note: Without precise geolocation metadata, I'm providing a general analysis based on visible features. The image appears to capture a scenic coastal region, possibly in a Mediterranean, subtropical, or tropical climate zone.
Would you like me to focus on any specific aspect of this image?
Strangely enough, my first test with Sonnet 4.6 via the API for a relatively simple request was more expensive ($0.11) than my average request to Opus 4.6 (~$0.07), because it used way more tokens than what I would consider necessary for the prompt.
This is an interesting trend with recent models. The smarter ones get away with a lot less thinking tokens, partially to fully negating the speed/price advantage of the smaller models.