Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Sonnet 4.5, starting at $3/$15 per million tokens.

Are people really willing to pay these prices? The open-weight models are catching up in a rapid pace while keeping the prices so low. MiniMax M2.5, Kimi 2.5 and GLM-5 is dirt cheap compared to this. They may not be sota but they are more than good enough.



At work I'll buy a max subscription for anyone on my team who wants it. If it saves 1-2 hours a month it's worth it, and people get that even if they only use the LLMs to search the codebase. And the frontier models are noticeably better than others, still.

At home I have a $20/month subscription and that's covered everything I need so far. If I wanted to do more at home, I'd seriously look into the open weight models.


It depends on how much you value the gap between “pretty good” and SOTA… I’ve noticed that Opus is more “expensive”,” but an error-filled rabbit hole is expensive too!


Totally unrelated, but I just came across ur comment [0] from last month about indexing ur search history etc, and ik of a couple programs that fill that niche. The first is spyglass [1], but it's no longer in active development, and the second is this python program, knowledge [2], that I have yet to personally set up (but obviously have an open tab for it, as I plan to eventually lol). So u might want to check these out, especially the latter one, as it's currently in development

[0]: https://news.ycombinator.com/item?id=46531526 [1]: https://github.com/spyglass-search/spyglass [2]: https://github.com/raphaelsty/knowledge


I made my own benchmarks, very basic questions, and Claude 4.6 is actually worse than the free Stepfun 3.5 version: https://aibenchy.com

It is smart, but it fails at basic instruction following sometimes.

I remember this is a Claude thing for quite a while, where I kept trying to make it output just JSON (without structured output), and it always kept adding quotes or new lines.


After looking more into it, Claude DOES give the correct answer, just not in the format that it's asked, it always adds more info at the end, even when asked to just give the answer...


The best way to get JSON back is function calling.


What do you mean? You can force JSON with structured output.

It was just an example though, in real-world scenarios, sometimes I have to tell the AI to respond in a specific strict format, which is not JSON (e.g. asking it to end with "Good bye!"). Claude is the one who is the worst at following those type of instructions, and because of this it fails to return to correct answer in the correct format, even though the answer itself is good.


i agree that is annoying but seems like anthropic's stance is that the task/agent should be provided an environment to write the file in the output you provide or provided a skill.md description on how to do that specific task.

personally it's a blurry line. most times i'm interacting with an agent where outputting to a file makes sense but it makes it less reliable when treating the model call as a deterministic function call.


There's definitely many ways to improve the output of the AI, and provide it extra hints. Also, some AIs are made for a specific use-case. Maybe I should rephrase it and say that those benchmarks are more about the single-reply intelligence of a model, and more like an AGI test then for specific use-cases.


1. the UX gap between a task being one-shot or not is huge. 2. if you are doing llm-assisted coding you should naturally prefer a sota model to minimise (definitely not eliminate) the tech debt you are accumulating (as it will usually generate slightly better code, by whatever metric you want to use)


You get what you pay for imo.


Some people will want the models like claude where you don't have to be super-specific and it will infer exactly what you mean.

With the GLM models you have to confirm with it exactly what you want, and not miss any detail.


For most tasks it's not necessary. For hairy tasks, it's often nice to switch and pay 10x the cost to complete the task with 10x less intervention.


I'm toying with a hybrid approach. GLM5 for everything except at the write a implementation plan stage and at the end a pass with opus/sonnet to spot bugfixes.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: