Saved in:
| Main Authors: | Wang, He, Guo, Pengcheng, Wan, Xucheng, Zhou, Huan, Xie, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05466 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading
by: Zhang, Wenhao, et al.
Published: (2024)
by: Zhang, Wenhao, et al.
Published: (2024)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025)
by: Park, Young-Hu, et al.
Published: (2025)
Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking
by: Guo, Xucheng, et al.
Published: (2025)
by: Guo, Xucheng, et al.
Published: (2025)
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
by: Huang, Jiehui, et al.
Published: (2025)
by: Huang, Jiehui, et al.
Published: (2025)
VALLR: Visual ASR Language Model for Lip Reading
by: Thomas, Marshall, et al.
Published: (2025)
by: Thomas, Marshall, et al.
Published: (2025)
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
by: Deria, Ankan, et al.
Published: (2026)
by: Deria, Ankan, et al.
Published: (2026)
Super Encoding Network: Recursive Association of Multi-Modal Encoders for Video Understanding
by: Chen, Boyu, et al.
Published: (2025)
by: Chen, Boyu, et al.
Published: (2025)
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
by: Wu, Linzhi, et al.
Published: (2024)
by: Wu, Linzhi, et al.
Published: (2024)
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention
by: Li, Kai, et al.
Published: (2025)
by: Li, Kai, et al.
Published: (2025)
MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading
by: Rossi, Matteo
Published: (2026)
by: Rossi, Matteo
Published: (2026)
Granulon: Awakening Pixel-Level Visual Encoders with Adaptive Multi-Granularity Semantics for MLLM
by: Mao, Junyuan, et al.
Published: (2026)
by: Mao, Junyuan, et al.
Published: (2026)
VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
by: Maaz, Muhammad, et al.
Published: (2024)
by: Maaz, Muhammad, et al.
Published: (2024)
Hierarchical Semantic Learning for Multi-Class Aorta Segmentation
by: Shi, Pengcheng
Published: (2025)
by: Shi, Pengcheng
Published: (2025)
Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
by: He, Zhihao, et al.
Published: (2025)
by: He, Zhihao, et al.
Published: (2025)
AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement
by: Zhong, Zhizhou, et al.
Published: (2025)
by: Zhong, Zhizhou, et al.
Published: (2025)
YingVideo-MV: Music-Driven Multi-Stage Video Generation
by: Chen, Jiahui, et al.
Published: (2025)
by: Chen, Jiahui, et al.
Published: (2025)
ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
by: Ma, David, et al.
Published: (2025)
by: Ma, David, et al.
Published: (2025)
Enhancing Speech-Driven 3D Facial Animation with Audio-Visual Guidance from Lip Reading Expert
by: EunGi, Han, et al.
Published: (2024)
by: EunGi, Han, et al.
Published: (2024)
OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers
by: Peng, Ziqiao, et al.
Published: (2025)
by: Peng, Ziqiao, et al.
Published: (2025)
PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis
by: Huang, Yihuan, et al.
Published: (2025)
by: Huang, Yihuan, et al.
Published: (2025)
SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion
by: Ma, Junxian, et al.
Published: (2025)
by: Ma, Junxian, et al.
Published: (2025)
Scaling Down Text Encoders of Text-to-Image Diffusion Models
by: Wang, Lifu, et al.
Published: (2025)
by: Wang, Lifu, et al.
Published: (2025)
When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding
by: Fang, Pengcheng, et al.
Published: (2025)
by: Fang, Pengcheng, et al.
Published: (2025)
LawDNet: Enhanced Audio-Driven Lip Synthesis via Local Affine Warping Deformation
by: Junli, Deng, et al.
Published: (2024)
by: Junli, Deng, et al.
Published: (2024)
Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation
by: Guo, Zihong, et al.
Published: (2025)
by: Guo, Zihong, et al.
Published: (2025)
Enhance Multi-Scale Spatial-Temporal Coherence for Configurable Video Anomaly Detection
by: Cheng, Kai, et al.
Published: (2023)
by: Cheng, Kai, et al.
Published: (2023)
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
by: Wang, Yueqian, et al.
Published: (2025)
by: Wang, Yueqian, et al.
Published: (2025)
Enhancing Visual Forced Alignment with Local Context-Aware Feature Extraction and Multi-Task Learning
by: He, Yi, et al.
Published: (2025)
by: He, Yi, et al.
Published: (2025)
DynamicLip: Shape-Independent Continuous Authentication via Lip Articulator Dynamics
by: Chen, Huashan, et al.
Published: (2025)
by: Chen, Huashan, et al.
Published: (2025)
Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language
by: Yeo, Jeong Hun, et al.
Published: (2024)
by: Yeo, Jeong Hun, et al.
Published: (2024)
Guided Depth Map Super-Resolution via Multi-Scale Fusion U-shaped Mamba Network
by: Guo, Chenggang, et al.
Published: (2025)
by: Guo, Chenggang, et al.
Published: (2025)
MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation
by: Ling, Run, et al.
Published: (2025)
by: Ling, Run, et al.
Published: (2025)
Efficient Masked AutoEncoder for Video Object Counting and A Large-Scale Benchmark
by: Cao, Bing, et al.
Published: (2024)
by: Cao, Bing, et al.
Published: (2024)
Learning Multi-view Anomaly Detection with Efficient Adaptive Selection
by: He, Haoyang, et al.
Published: (2024)
by: He, Haoyang, et al.
Published: (2024)
WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval
by: Ren, Junlong, et al.
Published: (2025)
by: Ren, Junlong, et al.
Published: (2025)
Integrating Persian Lip Reading in Surena-V Humanoid Robot for Human-Robot Interaction
by: Abbasi, Ali Farshian, et al.
Published: (2025)
by: Abbasi, Ali Farshian, et al.
Published: (2025)
MSVCOD:A Large-Scale Multi-Scene Dataset for Video Camouflage Object Detection
by: Gao, Shuyong, et al.
Published: (2025)
by: Gao, Shuyong, et al.
Published: (2025)
Breaking the Encoder Barrier for Seamless Video-Language Understanding
by: Li, Handong, et al.
Published: (2025)
by: Li, Handong, et al.
Published: (2025)
M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark
by: Zhang, Huixuan, et al.
Published: (2025)
by: Zhang, Huixuan, et al.
Published: (2025)
Similar Items
-
MTGA: Multi-View Temporal Granularity Aligned Aggregation for Event-Based Lip-Reading
by: Zhang, Wenhao, et al.
Published: (2024) -
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025) -
SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer
by: Park, Young-Hu, et al.
Published: (2025) -
Ev-Layout: A Large-scale Event-based Multi-modal Dataset for Indoor Layout Estimation and Tracking
by: Guo, Xucheng, et al.
Published: (2025) -
UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
by: Huang, Jiehui, et al.
Published: (2025)