I tiakina i:
Ngā taipitopito rārangi puna kōrero
Kaituhi matua: Jin, Haopeng
Hōputu: Recurso digital
Reo:Ingarihi
I whakaputaina: Zenodo 2026
Ngā marau:
Urunga tuihono:https://doi.org/10.5281/zenodo.19712490
Ngā Tūtohu: Tāpirihia he Tūtohu
Kāore He Tūtohu, Me noho koe te mea tuatahi ki te tūtohu i tēnei pūkete!
Rārangi ihirangi:
  • <p><strong>MoCha-LD</strong> technical report / preprint.</p><p>Long video generation remains challenging because diffusion models that perform well on short clips often degrade at longer horizons, exhibiting appearance drift, discontinuous motion at segment boundaries, and unfavorable memory and compute scaling. This paper presents a framework for long video generation via temporal chunking in latent space with explicit regularization for cross-chunk continuity. Videos are generated as sequences of overlapping latent chunks rather than single full-length tensors, bounding per-step memory by chunk length while overlap provides structured temporal context across chunks. To reduce semantic drift beyond what overlap alone preserves, the denoiser is conditioned on a compact recurrent memory summarizing previously generated chunks. To improve boundary quality, motion-aware consistency regularization is introduced with two components: a flow-guided latent warping term aligning neighboring chunks under estimated motion, and a velocity-consistency term encouraging smooth temporal evolution across chunk boundaries. Experiments on UCF101, DAVIS, and WebVid-10M subsets demonstrate improved long-horizon coherence relative to chunked latent diffusion baselines. On 128-frame generation with a 16-frame chunk length and 4-frame overlap, the full model reduces FVD from 412.7 to 361.4 on UCF101 and from 498.2 to 436.9 on DAVIS, while reducing a boundary warp error metric by 18.6% and 16.9%, respectively. Ablations show that overlap conditioning primarily reduces visible seams, recurrent memory primarily reduces long-range identity drift, and motion-aware regularization primarily improves boundary dynamics. These results support the claim that scalable long-video generation benefits from explicitly separating local continuity, long-range context, and motion consistency.</p><p>Existing OSF archival DOI: 10.17605/OSF.IO/NUXTH; Existing OSF archival page: https://osf.io/nuxth/.</p><p>Files include the technical report PDF and the LaTeX source tarball when available.</p>