将视觉数据转换为块
Turning visual data into patches
视频压缩网络
Video compression network
时空潜在块
Spacetime latent patches
扩展transformer用于视频生成
Scaling transformers for video generation
可变时长、分辨率、宽高比
Variable durations, resolutions, aspect ratios
语言理解
Language understanding
使用图像与视频提示
Prompting with images and videos
扩展生成的视频 (Extending generated videos)
视频到视频编辑(Video-to-video editing)
视频的无缝连接(Connecting videos)
图像生成能力(Image generation capabilities)
涌现的模拟能力
Emerging simulation capabilities
讨论
Discussion
炎帝大模型
数字人视频生成大模型
作者
林会杰博士
参考文献
References
1. Srivastava, Nitish, Elman Mansimov, and Ruslan Salakhudinov. 'Unsupervised learning of video representations using lstms.' International conference on machine learning. PMLR, 2015. 2. Chiappa, Silvia, et al. 'Recurrent environment simulators.' arXiv preprint arXiv:1704.02254 (2017). 3. Ha, David, and Jürgen Schmidhuber. 'World models.' arXiv preprint arXiv:1803.10122 (2018). 4. Vondrick, Carl, Hamed Pirsiavash, and Antonio Torralba. 'Generating videos with scene dynamics.' Advances in neural information processing systems 29 (2016). 5. Tulyakov, Sergey, et al. 'Mocogan: Decomposing motion and content for video generation.' Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 6. Clark, Aidan, Jeff Donahue, and Karen Simonyan. 'Adversarial video generation on complex datasets.' arXiv preprint arXiv:1907.06571 (2019). 7. Brooks, Tim, et al. 'Generating long videos of dynamic scenes.' Advances in Neural Information Processing Systems 35 (2022): 31769-31781. 8. Yan, Wilson, et al. 'Videogpt: Video generation using vq-vae and transformers.' arXiv preprint arXiv:2104.10157 (2021). 9. Wu, Chenfei, et al. 'Nüwa: Visual synthesis pre-training for neural visual world creation.' European conference on computer vision. Cham: Springer Nature Switzerland, 2022. 10. Ho, Jonathan, et al. 'Imagen video: High definition video generation with diffusion models.' arXiv preprint arXiv:2210.02303 (2022). 11. Blattmann, Andreas, et al. 'Align your latents: High-resolution video synthesis with latent diffusion models.' Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 12. Gupta, Agrim, et al. 'Photorealistic video generation with diffusion models.' arXiv preprint arXiv:2312.06662 (2023). 13. Vaswani, Ashish, et al. 'Attention is all you need.' Advances in neural information processing systems 30 (2017). 14. Brown, Tom, et al. 'Language models are few-shot learners.' Advances in neural information processing systems 33 (2020): 1877-1901. 15. Dosovitskiy, Alexey, et al. 'An image is worth 16x16 words: Transformers for image recognition at scale.' arXiv preprint arXiv:2010.11929 (2020). 16. Arnab, Anurag, et al. 'Vivit: A video vision transformer.' Proceedings of the IEEE/CVF international conference on computer vision. 2021. 17. He, Kaiming, et al. 'Masked autoencoders are scalable vision learners.' Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. 18. Dehghani, Mostafa, et al. 'Patch n'Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution.' arXiv preprint arXiv:2307.06304 (2023). 19. Rombach, Robin, et al. 'High-resolution image synthesis with latent diffusion models.' Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. 20. Kingma, Diederik P., and Max Welling. 'Auto-encoding variational bayes.' arXiv preprint arXiv:1312.6114 (2013). 21. Sohl-Dickstein, Jascha, et al. 'Deep unsupervised learning using nonequilibrium thermodynamics.' International conference on machine learning. PMLR, 2015. 22. Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 'Denoising diffusion probabilistic models.' Advances in neural information processing systems 33 (2020): 6840-6851. 23. Nichol, Alexander Quinn, and Prafulla Dhariwal. 'Improved denoising diffusion probabilistic models.' International Conference on Machine Learning. PMLR, 2021. 24. Dhariwal, Prafulla, and Alexander Quinn Nichol. 'Diffusion Models Beat GANs on Image Synthesis.' Advances in Neural Information Processing Systems. 2021. 25. Karras, Tero, et al. 'Elucidating the design space of diffusion-based generative models.' Advances in Neural Information Processing Systems 35 (2022): 26565-26577. 26. Peebles, William, and Saining Xie. 'Scalable diffusion models with transformers.' Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 27. Chen, Mark, et al. 'Generative pretraining from pixels.' International conference on machine learning. PMLR, 2020. 28. Ramesh, Aditya, et al. 'Zero-shot text-to-image generation.' International Conference on Machine Learning. PMLR, 2021. 29. Yu, Jiahui, et al. 'Scaling autoregressive models for content-rich text-to-image generation.' arXiv preprint arXiv:2206.10789 2.3 (2022): 5. 30. Betker, James, et al. 'Improving image generation with better captions.' Computer Science. https://cdn.openai.com/papers/dall-e-3.pdf 2.3 (2023): 8 31. Ramesh, Aditya, et al. 'Hierarchical text-conditional image generation with clip latents.' arXiv preprint arXiv:2204.06125 1.2 (2022): 3. 32. Meng, Chenlin, et al. 'Sdedit: Guided image synthesis and editing with stochastic differential equations.' arXiv preprint arXiv:2108.01073 (2021).