Hacker Newsnew | past | comments | ask | show | jobs | submit | schopra909's commentslogin

Honestly never considered the forking use case; but it makes a ton of sense when explained

Congrats on the launch. This is cool tech


Really cool to see innovation in terms of quality of tiny models. Great work!


thanks a lot. small model quality is improving exponentially. This 15M is way better than the 80M model from our previous launch (V0.1).


Very cool work! We spend a lot of time thinking about "robust representations" in the video space.

Are there any alternative ideas to JEPA right now, when it comes to speech encoding that couples meaning and sound? Curious to learn more about journey from the problem space to solution space (JEPA).

For context, in our domain video-JEPA hasn't proved to be as helpful as one would have hoped. It's decent at high level semantics (e.g. action detection) but doesn't capture enough "detail" (intentionally so) to be used as a powerful enough encoder (or regularizer). Might be just because the research models are too small / haven't been trained on sufficiently large volumes of data, yet.


Honest question, why were folks posting AI generated comments in the first place? There's such a high inertia to comment. I only comment when I have something to contribute OR find something incredibly interesting.

So I'm just baffled, why anyone was using AI to generate comments. Like what was the incentive driving the behavior?


One trend I noticed here and, annoyingly, in my co-op, is that people will take a really dense and complex topic that's either currently engaged in deep conversation with multiple people or ripe for it, and then post a link to a Chatgpt conversation with a tag like "I didn't have time to get my thoughts together but here's a Chatgpt overview/some suggested solutions!" For me that's the equivalent of "I googled that for you," aka extremely rude.

Thanks, if I wanted Chatgpt's middle-of-the-bellcurve ass response I would have put the five seconds of effort in myself to type the question into its input field.


In addition to "Internet points" mentioned above - influence operations, both from nation states (e.g. the PRC 50 Cent Party, and probably the dozen most powerful nations in general), and from gray/black-market marketing companies.

Influence is valuable, and HN is a place that people who are aware of it trust highly.

(AI generation of random comments helps build "trustworthy" accounts that can then be activated when a relevant issue comes up)

[1] https://en.wikipedia.org/wiki/50_Cent_Party


Ok, those are probably not deterred by guidelines though.


They absolutely are. You ever done any work fighting spam? It's all about making it hard and expensive enough for spam to land that it's no longer economically viable - you den't and can't actually stop all spam. Same thing here.

Sure, the bad actors don't particularly care for the guidelines - until their accounts start losing karma and getting dead'd/banned. Then they do, and that still materially improves the site.


Relevant XKCD https://xkcd.com/810/


On HN, I sometimes used AI to change the tone of my comments - e.g., to add sarcasm or extra-polished corporate-speak for comical effect. OK, now I won't.


If you cant do the sarcasm yourself (and be witty enough), it's just not fun or improved in any way. Use of corporate speak is sarcasms on its own right, of course - but it only makes sense if it's something your are exposed to (and people can relate), instead of being fake.

Also, if you have to mark the sarcasm, then it's proper bad.


Most comments on here are really well-written. I can imagine someone for whom English is a second language (or a first language but aren't as good at writing as they'd like to be) using an LLM to "keep up." Of course, this sometimes works until they decide to post something without those tools.


Although I'm unsure about their purpose, I am fairly certain it is not an English as a second language matter.


Several people at my work do use LLMs for this in code, commit messages, and even on Slack. It may not be everyone or even a majority but it is something that some people legitimately do.

While many here are saying "who cares about your spelling and grammar," they have not been the people whose poor English gets them flagged as being somehow less intelligent or credible. Half the problem with LLMs is that they speak eloquently and we use that as a signal of someone's intelligence and trustworthiness. For someone who is otherwise intelligent but doesn't know English well this can be a major setback.


Same as always: being right about something


Reputation farming -> upvote rings -> black market promotion


Internet points.


which can then translate to real-world money points


How would karma on HN lead to this?


You need a minimum threshold of karma in order to downvote others on HN. Additionally, accounts with more well received activity are harder to identify as shills. That's why there are black markets where social media accounts are bought and sold and the price is typically proportional to the account's karma.


honestly, it's really hard to shorten the feedback loop in this space. For this, we really just did run one experiment at a time and visually inspect the results everywhere. when you're going 0 -> 1, you're looking for "signs of life" to make sure the basic thing is working. when it comes to testing which (of the infinite levers) to the pull, a lot of it comes from intuition (which i know isn't the most fun answer). we spent a week or so just running experiments on the amount of compression we could squeeze out the VAE without significant degradation in the final results). In hindsight, spending a week on that seems like a waste, since we got the 8x spatial, 4x compression within the first 1-2 days. But in the moment, you're often unsure WHAT will be the key unlock. So, when you're in the middle of storm you're running a quick bayesian process in your head, measuring what you might learn from the outcome of the experiment vs. the time/money it would take to run the experiment. And you, hope that your intuitions become stronger over time, as you take more repetitions. More money, might help the problem (e.g. parallel experiments, more detailed explorations). But, I don't think money is a cure-all. At some point, you get lost in the sauce trying to tie the threads between all the empirical findings you have at your finger tips. Maybe one day AI models could help here integrating these all results. As it stands, they still struggle to reason about this stuff, in context of other research papers and findings (likely because all the context on arxiv is so noisy; you can't trust any particular finding and verifying findings is so hard to do, that it's hard to meta-reason about your experiments correctly).


Hadn’t seen that before! Seems very in line with what with the broader points about regularization. In table 4 they show faster convergence in 200 epochs when used alongside REPA. I’d be curious to see if it ended up beating REPA by itself with full 800 epochs of training — or if something about this new latent space, leads to plateauing itself (learns faster but caps out on expressivity). We’ve seen that phenomena before in other situations (eg UNET learns faster than DiT because of convolutions, but stops learning beyond a certain point).


yep, Apache 2.0! so anyone's welcome to download and hack away


Hi HN, I’m one of the two authors of the post and the Linum v2 text-to-video model (https://news.ycombinator.com/item?id=46721488). We're releasing our Image-Video VAE (open weights) and a deep dive on how we built it. Happy to answer questions about the work!


Great work! I have been wondering what would it take to train with higher image bit depth (10 or 12b) and/or using camera footage only, not already heavily processed images? The usefulness of video generation in most professional use cases is limited because models are too end to end and completely contaminated with stock footage. Maybe quantities of training material needed is simply not there?

Not blaming you, but asking as I don’t usually have access to professionals working with video training.


It’s a great question. In terms of pre-training even if they were was enough data at that quality, storing it and either demuxing it into raw frames OR compressing it with a sufficiently powerful encoder likely would cost a lot of $. But there’s a case to potentially use a much smaller subset of that data to dial in aesthetics towards the end of training. The gotcha there would come in terms of data diversity. Often you see that models will adapt to the new distribution and forget patterns from the old data. It’s hard to disentangle a model learning clarity of detail from concepts, so you might forget key ideas when picking up these details. Nevertheless maybe there is a way to use small amounts of this data in a RL finetuning setup? In our experience RL post training changes very little in the underlying model weights — so it might be a “light” enough touch to elicit the the desired details.


No questions but I appreciate the write-up! Thank you for sharing.


This is very cool. Side note, I really dig the JavaScript animations on the causal block diffusion blog post. Made the concept immediately clear


I think YC just release video on the basics of diffusion, but honestly I don’t have a good end to end guide.

We’re going to write up going 0->1 on a video model (all the steps) over the coming months. But it likely won’t be a class or anything like that.

https://www.linum.ai/field-notes

We want to share our learnings with folks who are curious about the space - but don’t have time to make it a full class experience.

Hopefully karpathy does that with his courses in the future!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: