Input: $5/M tokens at <=272K, $10/M tokens above 272K.
Output: $30/M tokens at <=272K, $45/M tokens above 272K.
Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.
Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.
Heck, not giving the person Admin privileges would have sufficed to prevent this. Or better hiring preventing people who install Roblox cheats on work devices...
There is no excuse and no fine line here. Even outside them boasting about SOC 2 Type II, this would be embarrassing for an SME not in the tech sector.
Any security team that gives unrestricted admin privileges to random employees is not a security team. So doing the most basic parts of their job, that would be my proposal.
If specific to my hiring comment, was meant a bit facetious, though I will point out this line in their "compliance" report by "auditor" Delve:
> The organization carries out background and/or reference checks on all new employees and contractors prior to joining in accordance with relevant laws, regulations and ethics. Management utilizes a pre-hire checklist to ensure the hiring manager has assessed the qualification of candidates to confirm they can perform the necessary job requirements.
Maybe those pre-hire checklists should include a question like "Are you a massive idiot, who'd install a game on their work computer, then on top of that be the type of idiot who likes to cheat, then on top of that be the type of idiot to install cheats on your work computer?", maybe that'd prevent this in the future. Or again, just don't give everyone Admin privileges...
Just an addition to the prior comment: To be as generous as possible, I just pulled their audit report [0] and to answer your question, all I propose is that they stick to this (especially the part on minimum permissions, any extended permissions need to be reasonable and reasoned for, etc), which they did not. The fault lies threefold:
First of all with the team members as Context.ai, that either weren't experienced or did not care enough to know that the "all green" they got from Delve straight away couldn't have been accurate.
Secondly, with the people at Delve who, at least in this isolated case, seem to not have fulfilled their obligations and are suspected to have done so in a consistent, repeated and intentionally malicious manner.
Third, the people who, despite claiming to have done their due diligence, being experienced investors and professionals in the field whose own prior companies also had to undergo audits in the past, looked at Delve and were willing to overlook the misdeeds for financial gain.
Odd, they used Delve [0] and a SOC2 compliant company like Context.ai [1] should have an AUP, EDR, etc. that prevents their employees from installing a Roblox cheat on their work computer. Heck, even outside SOC2, I have never worked at a company without endpoint restrictions to prevent unauthorised installs.
It's almost like the denials were in fact false and Delve truly was just selling a sticker, not providing an actual service.
If I were a VC that had funded Delve for a considerable amount of time, I'd be embarrassed that we did not catch that. I'd probably rework my processes, publicly analyse how this alleged fraud got past me and go far and beyond in disclosing my findings to rebuild trust. I'd most certainly not think just cutting funding is sufficient given the situation. Even more so if I'd encouraged other companies funded by me to use their "services". I'd maybe even reevaluate whether a circular approach wherein our funded companies are incentivised to rely on other also by us funded companies leads to the best options being chosen and whether that isn't antithetical to a forward thinking environment and competition. At the same time, I'd also think that maybe such a setup just hides unsuccessful companies and potentially even alleged fraud which once it gets to the broader market, may cause significant harm...
K2.6-code-preview was a minor, but noticeable jump, especially in a long running testing task and prior Moonshot releases have been the only models that I'd consider a suitably competitive replacement for Anthropic models. The way they approach tool calls, task inference and adherence is far closer than any other providers output, similar to how GLM models map far more closely to OpenAIs releases. Whether task adherence, task assessment, task evaluation or task inference, K2.5 got closer to Opus 4.5 than any other model (but was still behind overall).
I will have to test this full release of K2.6 but could see it serve as a very good overall drop-in replacement for Opus 4.5 and Opus 4.6 at 200k across the vast majority of tasks.
I will say however that Opus 4.7 Max 1M has been a very significant jump in performance for me, especially in tasks beyond 120k token where I'd argue it is now the most reliable model in continued task adherence and tool calling without compaction. Ironically, my initial experience was less than pleasant as on XHigh I found task adherence to have regressed even with less than 1/10th of the context window having been used.
Am very interested in K2.6s compaction strategy (which appears to be very simply all things considered) and how it performs beyond 100k tokens. As it stands, only OpenAI models have made compaction for long running tasks work well, though overall, GPT-5.4 is still inferior in my tests regardless of context window over other models such as Opus 4.6 1m and Opus 4.7 1m. Haven't gotten around to testing Opus 4.7 200k and will have to do this to properly assess K2.6 fairly, but I'd be very surprised if K2.6 truly beat Opus 4.7 200k given the jump I have experienced.
Am very much the same, took a bunch private two years ago for multitude of reasons. I can, however, see why no public repos could be a partial indicator and of concern, in conjunction with sudden star growth, simply because it is hard for a person with no prior project to suddenly and publicly strike gold. Even on Youtube it is a rare treat to stumble across a well made video by a small channel and without algos to surface repos on Github in the same way, any viral success from a previously inactive account should be treated with some suspicion. Same the other way, if you never made any PR, etc. sudden engagement is a bit odd.
I don't know what is more, for lack of a better word, pathetic, buying stars/upvotes/platform equivalent or thinking of oneself as a serious investor and using something like that as a metric guiding your decision making process.
I'd give a lot of credit to Microsoft and the Github team if they went on a major ban/star removal wave of affected repos, akin to how Valve occasionally does a major sweep across CSGO2 banning verified cheaters.
The problem is that if this is the game now, you need to play it. I'm trying to get a new open source project off the ground and now I wonder if I need to buy fake stars.
Or buy the cheapest kind of fake stars for my competitors so they get deleted.
For Microsoft this is another kind of sunk cost, so idk how much incentive they have to fix this situation.
Haha, have you tried that? I think in this day and age marketing is much needed activity even for open-source projects providing quality solutions to problems.
I maintain a niche-popular project that I didn't do any marketing for. My understanding is that even for popular projects, the usual dynamic is that there's just one guy doing all the work. So "getting off the ground" just means getting people to use it, and there shouldn't be any reason to artificially force that.
It depends what your objective is. Many people seem to see their open source projects as a stepping stone into some commercial activity. Putting aside whether that is a good idea or not if that is what they want to do then they will need to market in some way.
The issue with that is, it's a game that never ends. Now you need to inflate your npm/brew/dnf installs, then your website traffic to not make it to obvious, etc.
I am not successful at all with my current projects (admittedly am not trying to be nowadays), so feel free to dismiss this advice that predates a time before LLM driven development, but in the past, I have had decent success in forums interacting with those with a specific problem my project did address. Less in stars, more in actual exchange of helpful contributions.
Honest question, which companies handle the process better given it is a trade-off? Yes, VAC is not as iron-clad as kernel level solutions can be, but the latter is overly invasive for many users. I'd argue neither is the objectively right or better approach here and Valves approach of longer term data collection and working on ML solutions that have the potential to catch even those cheating methods currently able to bypass kernel level anti-cheat is a good step.
On Github stars, I'd argue they are the most suitable comparison, as all the funny business regarding stars should be, if at all, detectable by Github directly and ideally, bans would have the biggest deterrent effect, if they happened in larger waves, allowing the community to see who did engage in fraudulent behaviour.
I've found the latency and pricing make Mercury 2 extremely compelling for some UX experiments focused around automated note tagging/interlinking. Far more than the Gemini Flash Lite I used before, it made some interactions nearly frictionless, very close to how old school autocomplete/T9/autocorrect works in a manner that users don't even think about the processes behind it.
Sadly, it does not perform at the level of e.g. Haiku 3.5 for tool calling, despite their own benchmarks claiming parity with Haiku 4.5, but it does compete with Flash Lite there too.
Anything with very targeted output, sufficient existing input and that benefits from a seamless feeling lends itself to dLLMs. Could see a place in tab-complete too, though Cursors model seems to be sufficiently low latency already.
Thanks for the recommendation and sharing your evals, will take a closer look at them. Yes, the Mimo models are very interesting, end-to-end pricing wise especially, though in my tool call runs, GLM 4.7 Flash did slightly better at roughly equal speed and full run cost. Is of course very task dependent and both are amazing options in the price range, but latency wise, nothing feels like Mercury 2 at the moment.
Yes, nothing to write home about. It's all relative of course, what stack, what goal, what approach on which models perform best, but for regular day-to-day coding, I do not find it usable given alternatives.
Kimi, Mimimax and GLM models provide far more robust coding assistance at sometimes no cost (financed via data sharing) or for very cheap. Output quality, tool calling reliability and task adherence tend to be far more reliable across all three over Mercury 2, so if you consider the time to get usable code including reviews, manual fixes, different prompting attempts, etc. end-to-end you'll be faster.
Only "coding" task I have found Mercury 2 to have a place for code generation is a browser desktop with simple generated applets. Think artefacts/canvas output but via a search field if the applet has been generated previously.
With other models, I need to hide the load behind a splash screen, but with Mercury 2 it is so fast that it can feel frictionless. The demo at this point is limited by the fact that venturing beyond a simple calculator or todo list, the output becomes unpredictable and I struggle to get Mercury 2 to rely on pre-made components, etc. to ensure consistent appearance and a11y.
Despite the benchmarks, cost and speed figure suggesting something different, I have had the best overall results with Haiku 4.5, simply because GPT-5.4-nano is still unwilling to play nice with my approach to UI components. I am currently experimenting with some routing, using different models for different complexity, then using loading spinners only for certain models, but even if that works reliably, any model that I cannot force to rely on UI components in a consistent manner isn't gonna work, so for the time being it'd just route between less expensive and more expensive Anthropic models.
Coding wise, one more exception can be in-line suggestions, though I have no way to fairly compare that because the tab models I know about (like Cursors) are not available via API, but Mercury 2 seems to perform solidly there, at least in Zed for a TS code base.
Basically, whether code or anything else, unless your task is truly latency dependent, I believe there are better options out there. If it is, Mercury 2 can enable some amazing things.
Input: $5/M tokens at <=272K, $10/M tokens above 272K.
Output: $30/M tokens at <=272K, $45/M tokens above 272K.
Cache read: $0.50/M tokens at <=272K, $1/M tokens above 272K.
Significantly more expensive than Opus 4.7 beyond 272K and at least in my tasks, I haven't seen the model that much more token efficient, certainly not to such a degree that it'd compensate this difference. GPT-5.4 had a solid context window at 400k with reliable compaction, both appear somewhat regressed, though still to early to truly say whether compaction is less reliable. Also, I have found frontend output to still skew towards that one very distinct, easily noticeable, card laden, bluesy hue overindulged template that made me skeptical of Horizon Alpha/Beta pre GPT-5s release. Ended up doing amazing at the time for task adherence, which made it very useful for me outside that one major deficit. The fact that GPT-5.5 is still so restricted in that area is weird considering it's supposed to be an entirely new foundation.
reply