Show HN: LemonSlice – Upgrade your voice agents to real-time video | Hacker News
Chatbots are everywhere and voice AI has taken off, but we believe video avatars will be the most common form factor for conversational AI. Most people would rather watch something than read it. The problem is that generating video in real-time is hard, and overcoming the uncanny valley is even harder.
We haven’t broken the uncanny valley yet. Nobody has. But we’re getting close and our photorealistic avatars are currently best-in-class (judge for yourself: https://lemonslice.com/try/taylor). Plus, we're the only avatar model that can do animals and heavily stylized cartoons. Try it: https://lemonslice.com/try/alien. Warning! Talking to this little guy may improve your mood.
Today we're releasing our new model* - Lemon Slice 2, a 20B-parameter diffusion transformer that generates infinite-length video at 20fps on a single GPU - and opening up our API.
How did we get a video diffusion model to run in real-time? There was no single trick, just a lot of them stacked together. The first big change was making our model causal. Standard video diffusion models are bidirectional (they look at frames both before and after the current one), which means you can't stream.
From there it was about fitting everything on one GPU. We switched from full to sliding window attention, which killed our memory bottleneck. We distilled from 40 denoising steps down to just a few - quality degraded less than we feared, especially after using GAN-based distillation (though tuning that adversarial loss to avoid mode collapse was its own adventure).
And the rest was inference work: modifying RoPE from complex to real (this one was cool!), precision tuning, fusing kernels, a special rolling KV cache, lots of other caching, and more. We kept shaving off milliseconds wherever we could and eventually got to real-time.
We set up a guest playground for HN so you can create and talk to characters without logging in: https://lemonslice.com/hn. For those who want to build with our API (we have a new LiveKit integration that we’re pumped about!), grab a coupon code in the HN playground for your first Pro month free ($100 value). See the docs: https://lemonslice.com/docs. Pricing is usage-based at $0.12-0.20/min for video generation.
Looking forward to your feedback!
EDIT: Tell us what characters you want to see in the comments and we can make them for you to talk to (e.g. Max Headroom)
*We did a Show HN last year for our V1 model: https://news.ycombinator.com/item?id=43785044. It was technically impressive but so bad compared to what we have today.
Currently the conversation still feels too STT-LLM-TTS that I think a lot of the voice agents suffer from (Seems like only Sesame and NVDIA so far have nailed the natural conversation flow). Still, crazy good work train your own diffusion models, I remember taking a look at the latest literature on diffusion and was mind blown by the advances in last years or so since u-net architecture days.
EDIT: I see that the primary focus is on video generation not audio.
reply
But, to your point, there are many benefits of two-way S2S voice beyond just speed.
Using our LiveKit integration you can use LemonSlice with any voice provider you like. The current S2S providers LiveKit offers include OpenAI, Gemini, and Grok and I'm sure they'll add Personaplex soon.
reply
reply
The text processing is running Qwen / Alibaba?
reply
reply
reply
reply
reply
Video Agents
Unlimited agents
Up to 3 concurrent calls
Creative Studio
1min long videos
Up to 3 concurrent generations
Does that mean I can have a total of 1 minute of video calls? Or video calls can only be 1 minute long? Or does it mean I can have unlimited calls, 3 calls at a time all month long?
Can I have different avatars or only the same avatar x 3?
Can I record the avatar and make videos and post on social media?
reply
reply
reply
reply
reply
reply
It's a normal mp4 video that's looping initially (the "welcome message") and then as soon as you send the bot a message, we connect you to a GPU and the call becomes interactive. Connecting to the GPU takes about 10s.
reply
reply
My mind is blown! It feels like the first time I used my microphone to chat with ai
reply
reply
reply
Anyway, big thumbs up for the LemonSlice team, I'm excited to see it progress. I can definitely see products start coming alive with tools like this.
reply
reply
reply
reply
reply
You can also control background motions (like ocean waves, or a waterfall or car driving).
We are actively training a model that has better text control over hand motions.
reply
[0] https://lemonslice.com/privacy
reply
reply
reply
reply
I was thinking why the quality is so poor.
reply
reply
reply
I am double checking now to make 100% sure we return the original audio (and not the encoded/decoded audio).
We are working on high-res.
reply
reply
I think people will just copy it, and we just need to continue moving as fast as we can. I do think that a bit of a revolution is happening right now in real-time video diffusion models. There are so many great papers being published in that area in the last 6 months. My guess is that many DiT models will be real time within 1 year.
reply
reply
reply
reply
reply
reply
I have so many websites that would do well with this!
reply
reply
reply
reply
reply
Take my money!!!!!!
reply
reply
reply
reply
reply
reply
reply
For the fully hosted version, we are currently partnered with ElevenLabs.
reply
reply
reply
reply
reply
reply
reply
reply
reply
reply
reply
I appreciate your concern for the quality of the site - that fact that the community here cares so much about protecting it is the main reason why it continues to survive. Still, it's against HN's rules to post like you did here. Could you please review https://news.ycombinator.com/newsguidelines.html? Note this part:
"Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email [email protected] and we'll look at the data."
reply
even I am surprised with how many opnely positive comments we are getting. it's not been our experience in the past.
reply
