> But I have friends who used to self publish some small esoteric fiction. This commonplace theft has basically made them stop
If you're writing for money, maybe. If you're writing for the love of writing, it won't.
More, you hear of authors who encourage their books to be made available without DRM, who know or silently encourage their books to end up on torrent / library sites. They want their books to be read.
This is interesting - I have been working on the same thing, building contextual data, LSP-style.
I saw the tools page where if I understand right, `get-symbol-context` is actually the main useful tool for what you provide? The others seem more metadata it's easy to get already (?) but that tool provides the extra info.
I had been working on exposing mine as more high-level, ie multiple APIs to query different kinds of metadata about symbols, types, etc. But I am still not sure of the best approach, where my thinking was about not overloading the AI with too many different tools. They accumulate quickly.
I definitely share the same sentiment. I don’t want to overload the llm with many tools. Better to have a few opinionated and flexible ones, but yeah, keeping the balance is hard.
I would say the main two tools are get-symbol-context and get-repository-overview. The latter is actually the more complex and sophisticated one. I’m running some graph algorithms to rank the symbols in terms of relative importance based on centrality metrics, I.e. how well connected they are in the symbol graph.
The idea behind that is to allow the llm to infer the general structure and architecture of the project with just one tool call.
I guess you could reach a similar thing if you had some good Agents.md or docs detailing that for your project, but this was more meant to reach that on the fly.
The symbol-context tool is basically a graph query tool (without a dsl or cipher support yet), but yeah here the question is also whether it makes more sense to give the ai the possibility to run cipher queries itself or abstract it away in a thinner api.
The main underlying factor of my tool is however the graph that I’m building and the metadata which can be extracted from that (connections, type of connection, etc. ) :)
Metadata: I feel like LSP focuses on human-style things (like locating a symbol) which are useful, but not necessarily exactly what a LLM needs. Instead I want to do things like show the inheritance chain. Is a virtual method overriding something, being overridden later? What is the class / polymorphic situation? My feeling is that this will help understand the shape, plus, help some bugs.
So a query on a symbol would:
* Return its type declaration, not (just) location (and I'm considering some kind of summary version where it pulls in the ancestors too, so you directly see everything it has available not just the actual declaration, because leaf nodes in inheritance often don't add much and the key behaviour is elsewhere)
* Return info about inheritance, the shape of how this modifies other code and other code modifies it.
With variations when the symbol is a variable, a type, etc etc. I'm currently using treesitter for this, to bypass LSP and (for the language I'm working on) build a full symbol table and more, to get something closer to the LSP info you mention in your blog but not limited to what LSP makes available. I don't want to rely on a LSP server; I think first-class support per language is better. It's probably possible to generate this with a set of LSP calls, perhaps, but it might take some heuristics and guesswork and... :/
I do have a graph of file-level dependencies, but not yet a graph of what calls what at the symbol or type or method level. And while I build an index of all symbols I haven't yet sorted that by count.
I get the sense we're thinking along similar lines, with slightly different approaches?
Edit: if you would like to chat on this, I'm up for it! You can find me at my username at gmail (easy to lose emails there due to volume and spam!) or my profile has my website which has my LinkedIn (horribly, more reliable :D)
That sounds great, thanks for sharing your thoughts!
It sure sounds like we have similar things in mind. I basically try to build the proper graph representation of the code during runtime, so all caller/callee relationships plus type inheritance chains etc.
This is basically what I call a semantic code graph in the blog post.
From the things I tried with tree-sitter I think I would have a hard time achieving the same because by nature tree-sitter can only make educated guesses on real connections and will run into problems, if things are named ambiguously.
But yeah, will definitely reach out and am looking forward to chatting :) Hope I find the time during this week!
Clearly an AI-written blog (or at least AI edited, not all of it was bad) but it seems to be trying to say something about a MS library that allows anonymous (?) access? I couldn't make myself wade through the LLM prose after reading too many triples (a, b, c) to grok it all. It doesn't actually introduce what it is about to people who don't know what it's about, like me.
> We audited a 27.6% subset of the dataset that models often failed to solve and found that at least 59.4% of the audited problems have flawed test cases that reject functionally correct submissions, despite our best efforts in improving on this in the initial creation of SWE-bench Verified.
Is this saying a quarter* of the questions and answers were wrong, this whole time?!
If so, how was this ever, in any way, a valid measurement?
And what was the process for creating this benchmark and how did it end up with such an extraordinarily poor set of data? (There is a description later of how, which seems to be a high standard and I struggle to understand how it aligns with the other results they discuss.) Kudos to them for highlighting the issues, but I am left with questions.
[*] Not one in four, but one in six, thanks commenters for the correction; leaving the original since, eh, my bad, and it lets replies make sense. I feel the broad point still stands!
> Is this saying a quarter of the questions and answers were wrong, this whole time?!
No, they're saying 59.4% of the 27.6% subset had flawed test cases I think.
> If so, how was this ever, in any way, a valid measurement?
Benchmarks essentially aren't, for practical concerns anyways. They don't represent your use case, and they don't represent any and all use cases, they're valid for measuring exactly what's included in the benchmarks, nothing more and nothing less.
I don't understand the ecosystems obsession with using public benchmarks, they hardly ever tell you anything of value. Ok, Qwen 3.5 is 50% better on Benchmark X than Qwen 2.5, does that mean it'll be 50% better for what you're using it for? Very unlikely.
I've been running my own private benchmarks, with test cases I never share anywhere, for the specific problems I'm using LLMs for. Some are based on real, actual cases where a LLM went wrong and I had to adjust the prompt, and over time I've built up a suite.
Most of the times when a new update comes out to a model, it moves maybe 2-3% in my own benchmarks, meanwhile they tout 30-40% increase or something ridiculous in public benchmarks, and we're supposed to believe the models' training data isn't contaminated...
I'm not sure people are really trying to interpret this kind of benchmark as being accurate in gauging the magnitude of improvement. It seems pretty obvious that doubling your score on some benchmark where 100% means "correctly answered all of these specific problems" doesn't translate directly to performing twice as well on all problems. I think what people want from these benchmarks—and what they do get to some extent—is answering the question of "is model A better than model B", especially the subset of "is this local model better than last year's frontier online model".
The marketing departments touting each model do want to claim superiority on the basis of slivers of percentage points, and that's probably always a stronger claim than the test results can reasonably support. And the benchmarks are obviously susceptible to cheating and overfitting. But when the scores aren't saturated and do show a big discrepancy, that kind of result usually seems to align with what people report from actually trying to use the models in the relevant problem space.
the ecosystem obsession with public benchmarks comes from the fact that running benchmark costs, and labs don't test on any given private benchmark
but yeah you're correct anyone optimizing for public-bench rank instead of their own task-distribution eval has been pointing at the wrong thing for a while
still I guess useful signal to know which one model to consider, negative signal is still signal, assuming everyone is gaming benchmark in certain ways, lack of performance do result in a real workload effect
Imagenet is one of the most popular datasets on the planet. Turns out, a significant fraction of its images are mislabeled. In the limit case the model would have to fit towards wrong answers to get higher than a certain percentage.
The answer is “it works because ML wants to work.” It’s surprising how far you can get with something flawed. It’s also why such huge breakthroughs are possible by noting flaws others haven’t.
> It’s also why such huge breakthroughs are possible by noting flaws others haven’t.
I do these sort of breakthroughs at home all the time! My wife would say the computer is doing something strange, and instead of just randomly clicking around, I read the error messages slowly and out loud, then follow what they say. Anyone can do this, yet it seems like a magical ability every time you employ it to help people.
To be useful for identifying which model is better, benchmark scores only need to correlate with true performance, for which it's enough that the majority of tasks are scored correctly. You could have a terrible benchmark where 49% of the labels are wrong and a model that always answers correctly gets a score of 51%, but as long as it's higher than the always-wrong model at 49%, it's still directionally correct.
Most machine-learning benchmarks have a fairly large fraction of incorrect labels, but when you just want to distinguish between different models, the time you'd need to ensure perfect scoring would usually be better spent on collecting a larger benchmark dataset, even if it ends up having more errors.
You're right - I did not apply the math. (I won't edit, in order to let the parent comment still make sense, and thankyou for the correction.)
So not one in four, but one in six problems have problems.
That is extraordinarily high and the point still stands: is this truly saying a [large proportion] of the questions and answers were wrong, this whole time, and if so how was it ever a valid measurement?
> Curiously Opus 4.7 claims to have a 87.6% pass rate and Mythos claims to have a 93.9% pass rate... leading to the conclusion that it's actually possible to "solve" the problems that OpenAI claims are incorrect.
Huh, that is very curious and interesting indeed. If that's indeed true, that Anthropic claims that pass rate while OpenAI claims the test cases are flawed and broken, then clearly one of them aren't telling their whole side...
Oops, sorry, moved this portion of the comment to a top level comment simultaneously with you replying. Since the part of the comment that was replying to GP was addressed well in a simultaneous comment.
I really hate that VGA looking font. That IBM went with serifs by default on the MDA and all subsequent PC fonts is a disgrace. They had a much nicer one in their mainframe and mini terminals.
How awesome to see this on the front page! I've been writing a wrapper for this repo. Right now I'm running Turbo Vision -- this repo -- under .Net on macOS. It's a magical feeling.
The wrapper gives a higher level API, and solves some of the things like the rather antique palette API (or wraps it), is adding layout, etc.
(This is Oxygene, currently compiling to .Net. Can be used from any .Net language of course, even Go or Swift with our toolchain, but as an assembly, consumed by anything. Using PInvoke for the native TV binaries.)
Heavily in progress :D The repo is still private and I'm working on things like basing palettes on the surfaces controls are placed on today and tomorrow. Todos are cleaning up layouts, adding a few more basic (for today) missing controls, etc.
I had experimented with libraries like Terminal.GUI, which was (still is?) in the middle of a v2 transition and really difficult to get behaving without bugs. And Claude is a great lesson about TUIs and libraries that have been built without real terminal consideration -- lots of what not to do. I found myself missing Turbo Vision and thinking, why not just have a modern version? Then I found this repo, saw it was updated for Unicode, etc... I am very grateful to the author.
For those who aren't familiar[1], it's part of RemObjects' Elements[2] suite, which allows you to use and mix several different popular languages in addition to the Pascal-based Oxygene across Windows, MacOS, Linux, Android etc.
We've thought of it. Most customers are Win & Mac based. The vast majority of the IDE is shared code with only thin native UI wrappers (genuinely native, done the Right Way, aka non-Electron :D) and I admit to having been tempted to point an AI at the source and a copy of Qt and asking it to add a third one, a Linux GUI. We don't normally use AI for most of our coding, but I think with two prior examples and a thin layer only, it'd be a pretty cool experiment.
If you do have a go at that and want someone to test the end result, you're welcome to ping me to try it out? :)
(email addr is in my HN profile)
--
As a data point, I develop with Go professionally and use Linux (exclusively) both at home and at work. Some of my developer team mates at work use macOS, and others use Windows.
Many of the other places I've worked have been fairly Linux (and/or macOS) heavy too. Not so much Windows heavy in the last few years, and I personally expect the trend of less Windows usage to continue due to Microsoft continuing to degrade its user experience.
I've done some work with this tvision port as well. Every time I use a new TUI framework, I'm disappointed. Invariably, Turbo Vision is better.
I'm actually working on my own .NET wrapper, too. I don't think I'm as far as you, though. I'm mimicking the Windows Forms API as closely as possible and I want to have a drag-and-drop TUI designer.
I did most of the hard integration work on the C++ side: https://github.com/brianluft/terminalforms/tree/main/src/tfc... -- exporting simple C functions that I can call with P/Invoke so that the C# side is mainly about organizing into classes. It took a couple tries to find a design that didn't fall apart when I got into more complicated stuff. Initially I went too hard into "everything that you can do in C++ should be possible in C#"--this was maddeningly complex. I was using placement new to stick the C++ objects into C# buffers, you could effectively subclass the C++ classes from the C# side, it was getting way too involved. I switched to a much more direct and less flexible approach. I decided the flexibility should be on the C# side.
Nice! Mimicking that API is a great idea - a very different approach to mine and I love seeing it :) We shall clearly be in great competition in the future!
(Joking aside, I actually hope great cooperation, not competition... it's what open source is for. Seeing someone else is making a .Net wrapper as well is just plain awesome and I wish you the best. I really like the different API style.)
Definitely. Reach out to me or magiblot if you ever need any help. I love this stuff, and magiblot is extremely responsive. They have helped me many times and even made changes to tvision to support my use cases.
Also, I assume you know about the Turbo Vision Pascal and C++ books. They're really helpful. I transcribed both books to Markdown for easy searching if you want it.
That editor is really impressive. I'll see if I can wrap it too! Can I ask what your editor integration is for, what project?
I know the books exist, but I haven't read them. It's possible I even had a copy as a child - we had many of them but I never wrote Turbo Vision code, just Turbo Pascal in plain text or graphics modes. Then I moved on to Delphi. I would love to find both in MD - are they publicly available?
I realised I never answered your question about P/Invoke: I wrote (as in got an AI to generate) a flat C wrapper using the handle pattern, and then reimplemented classes back in Oxygene Pascal. I'm experimenting with new controls now, eg a tab/page view and others. Each is just a class that has its own view, plus some management extras, such as what max rendering it supports (I'm adding DOS character sets through to Powerline in rendering tabs, borders, etc.) It's all churning quite a bit as I do something, realise it may not be the best approach matching the TV model, rethink - learning a lot about TV's architecture at the time - so no more solid answer than that currently simply because it may be outdated tomorrow :)
Rather than focusing heavily on inheritance, I'm leaning more towards soft interfaces (duck typing) and wrapping via composition. This is just personal preference. But the concept is, if something looks like it is able to hold controls, it is treated as though it can hold controls, for example.
I used it for https://tmbasic.com to build an IDE in the style of Turbo Pascal for my toy language. Originally, I was going to implement the TV wrapper framework and UI builder in this project instead, so you could write TUIs in TMBASIC, but that ultimately felt like a waste for a toy language. With .NET someone might actually use it :P
I've sent you an email with the Turbo Vision books in Markdown, if you didn't receive it, hit me up via the email in my profile.
Must be serendipity as I’ve been working on a TUI drama nd drop designer like a terminal Visual Basic 4. I’ve tried Visual Basic for dos and it was as streamlined as the windows version
I did not use TV directly but I loved the design so I used basic ideas for my own DOS UI library for Turbo Pascal. It was completely graphical and implemented using built in assembly. In order to save RAM my Window manager would save invisible regions to disk.
Happy to share. It's really not in a shareable state right now, so that's 'happy in a few days maybe' if that's ok :) But you can ping me and I can let you know when it is? Drop me an email, username at gmail.
If you're writing for money, maybe. If you're writing for the love of writing, it won't.
More, you hear of authors who encourage their books to be made available without DRM, who know or silently encourage their books to end up on torrent / library sites. They want their books to be read.
reply