Linum v2 text-to-video models

Today, we're launching two open-weight, 2B-parameter text-to-video models capable of generating 2-5 second clips at up to 720p. Download them to your local rig and hack away!

Linum v2 sample generations

A cute 3‑D animated baby goat with shaggy gray fur, a fluffy white chin tuft, and stubby curved horns perches on a round wooden stool. Warm golden studio lights bounce off its glossy cherry‑red acoustic guitar as it rhythmically strums with a confident hoof, hind legs dangling. Framed family portraits of other barnyard animals line the cream‑colored walls, a leafy potted ficus sits in the back corner, and dust motes drift through the cozy, sun‑speckled room.

What is Linum?

As of January 2026, we are a team of two brothers that act as a tiny-yet-powerful AI research lab. We train our own generative media models from scratch, starting with text-to-video.

If this is v2, what was v1?

We started working on Linum in the Fall of 2022 (~3 years ago). We had just wound down our last attempt at a startup and were in-between things. With all that free time, we were able to kick back and enjoy our favorite hobby — watching movies.

Growing up, we went to a school focused on the performing arts. I played bass, but my brother Manu spent his time making shorts in video production with his buddies. When Stable Diffusion sent the world into a frenzy that fall, Manu wondered if these models could be used to help directors storyboard more effectively. With that seed of an idea, we applied to YCombinator and were accepted a week later for the W23 batch.

Talking to filmmakers, we learned pretty quickly that storyboarding was too niche, so we took a look at AI video. Back then, there were no generative video models. Instead, folks were algorithmically warping and interpolating AI images to simulate video. We shipped tools that helped musicians generate set visuals and music videos with these techniques, but by the end of the batch it was clear this was also a dead end. We didn't see a clear path from these psychedelic shorts to fully realized stories. So we shifted gears again — this time setting out to build our own text-to-video model.

For Linum v1, we bootstrapped off of Stable Diffusion XL. We extended the model to generate video by doubling its parameters and training on open-source video datasets. This way we transformed the text-to-image model into a 180p, 1 second Discord GIF bot.

Extending an off-the-shelf image model into a video model is too hacky of a solution to work long term. The VAE bundled with an image model doesn't know how to handle video, so generation quality is kneecapped from the start. If you don't have the original image dataset that the model was trained on, it's really hard to smoothly transition to video generation given how different these two distributions are. It costs a lot for a model to unlearn and relearn. At that point, you'll be better off building a model from scratch, with full control of every component of the model and dataset. So that's exactly what we did with Linum v2. Build it from soup to nuts.

Turns out, it's really hard to train a foundation model from scratch with just two people. You own every part of a process that usually takes half a dozen PhDs and several dozen people (at least). On data, you have to manage procurement, training and deploying VLMs for data filtering, and captioning pipelines for 100+ years of video footage. On compute, you have to benchmark providers, negotiate prices, and then keep your cluster operational. On research, you have to read the constant influx of new papers, figure out how to sift between the semi-true and the bullsh*t, and then run experiments in a reasonable budget to draw conclusions.

It's taken us two years. But, we're really excited to ship a model that's truly ours.

Where do we go from here?

We believe that access to financing is the limiting reagent for narrative filmmaking. It costs a lot of money to make a movie; and it's really hard to raise money to make your movie. If we can reduce the cost of production by an order of magnitude, we can enable a new generation of filmmakers to get off the ground.

Specifically, we're interested in improving the accessibility of animation. We view generative video models as "inverted rendering engines". Traditional animation software like Blender builds up physics from the ground up. As of today, this is the better approach to modeling the real world, but it creates software that is very hard to use. In contrast, generative video models learn lossy, often-inaccurate physics, but offer the possibility of creating more semantically meaningful controls through training.

We believe that by building better text-to-video models we can support high quality animation, while developing much more intuitive creative tools. This should make it easier to go 0 -> 1 and open the doors for a new band of storytellers.

Linum v2 is a huge stepping stone for us, but truthfully we have a long way to go to realize this vision (just take a look at some of the flawed generations below).

Linum v2 failed generations

A vintage teal convertible with chrome accents cruises along a winding coastal highway, its white soft-top folded down. Dramatic cliffs dotted with succulents drop sharply to the right, where sapphire waves crash against jagged rocks sending up plumes of white spray. Late afternoon sun glints off the polished hood as the car hugs a sweeping curve, guardrail posts flickering past rhythmically.

⚠️Inaccurate physics; foreground and background are moving

Over the next few months, we're going to start by addressing issues with physics, aesthetics, and deformations in this tiny 2B model footprint through post-training (and a couple of other ideas). From there, we'll be working on speed enhancements through popular techniques like CFG and timestep distillation. And most importantly, we'll be working on audio capabilities and model scaling.

We're going to be blogging through everything we've done so far and everything that we're going to be working through. So, if you're interested in the sort of writing that sits at the intersection of applied research and engineering, subscribe to Field Notes.

Acknowledgments

Special thank you to our investors and infrastructure partners for their continued support.

YCombinatorAdverb Ventures
CrusoeTogether AICloudflareUbicloud

Get Field Notes

Technical deep dives on building generative video models from the ground up, plus updates on new releases from Linum.