Saved in:
| Main Authors: | Jiang, Wentao, Zhang, Yige, Zheng, Shaozhong, Liu, Si, Yan, Shuicheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.08650 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
by: Liu, Yuansen, et al.
Published: (2025)
by: Liu, Yuansen, et al.
Published: (2025)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
by: Li, Hongyu, et al.
Published: (2025)
by: Li, Hongyu, et al.
Published: (2025)
AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation
by: Wu, Ruipu, et al.
Published: (2025)
by: Wu, Ruipu, et al.
Published: (2025)
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
by: Huang, Shaofei, et al.
Published: (2025)
by: Huang, Shaofei, et al.
Published: (2025)
StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)
by: Song, Yiren, et al.
Published: (2026)
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
by: Lin, Honglin, et al.
Published: (2026)
by: Lin, Honglin, et al.
Published: (2026)
ShotVL: Human-Centric Highlight Frame Retrieval via Language Queries
by: Xue, Wangyu, et al.
Published: (2024)
by: Xue, Wangyu, et al.
Published: (2024)
You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception
by: Jin, Sheng, et al.
Published: (2023)
by: Jin, Sheng, et al.
Published: (2023)
SMPLer: Taming Transformers for Monocular 3D Human Shape and Pose Estimation
by: Xu, Xiangyu, et al.
Published: (2024)
by: Xu, Xiangyu, et al.
Published: (2024)
Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation
by: Fei, Yuanchen, et al.
Published: (2026)
by: Fei, Yuanchen, et al.
Published: (2026)
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
by: Fei, Hao, et al.
Published: (2024)
by: Fei, Hao, et al.
Published: (2024)
WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes
by: Yang, Ling, et al.
Published: (2025)
by: Yang, Ling, et al.
Published: (2025)
Dynamic Pattern Alignment Learning for Pretraining Lightweight Human-Centric Vision Models
by: Wang, Xuanhan, et al.
Published: (2025)
by: Wang, Xuanhan, et al.
Published: (2025)
MetaFormer Baselines for Vision
by: Yu, Weihao, et al.
Published: (2022)
by: Yu, Weihao, et al.
Published: (2022)
HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing
by: Bai, Jinbin, et al.
Published: (2024)
by: Bai, Jinbin, et al.
Published: (2024)
GeoMix: Towards Geometry-Aware Data Augmentation
by: Zhao, Wentao, et al.
Published: (2024)
by: Zhao, Wentao, et al.
Published: (2024)
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
EasyInv: Toward Fast and Better DDIM Inversion
by: Zhang, Ziyue, et al.
Published: (2024)
by: Zhang, Ziyue, et al.
Published: (2024)
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)
by: Han, ZhaoYang, et al.
Published: (2025)
RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation
by: Han, Songhao, et al.
Published: (2025)
by: Han, Songhao, et al.
Published: (2025)
Modeling Cross-vision Synergy for Unified Large Vision Model
by: Wu, Shengqiong, et al.
Published: (2026)
by: Wu, Shengqiong, et al.
Published: (2026)
Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving
by: Li, Xiang, et al.
Published: (2025)
by: Li, Xiang, et al.
Published: (2025)
Raw Data Matters: Enhancing Prompt Tuning by Internal Augmentation on Vision-Language Models
by: Li, Haoyang, et al.
Published: (2025)
by: Li, Haoyang, et al.
Published: (2025)
HumanOmni: A Large Vision-Speech Language Model for Human-Centric Video Understanding
by: Zhao, Jiaxing, et al.
Published: (2025)
by: Zhao, Jiaxing, et al.
Published: (2025)
Human-Centric Transformer for Domain Adaptive Action Recognition
by: Lin, Kun-Yu, et al.
Published: (2024)
by: Lin, Kun-Yu, et al.
Published: (2024)
EditWorld: Simulating World Dynamics for Instruction-Following Image Editing
by: Yang, Ling, et al.
Published: (2024)
by: Yang, Ling, et al.
Published: (2024)
Data Augmentation for Text-based Person Retrieval Using Large Language Models
by: Li, Zheng, et al.
Published: (2024)
by: Li, Zheng, et al.
Published: (2024)
A Vision-Centric Approach for Static Map Element Annotation
by: Zhang, Jiaxin, et al.
Published: (2023)
by: Zhang, Jiaxin, et al.
Published: (2023)
@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology
by: Jiang, Xin, et al.
Published: (2024)
by: Jiang, Xin, et al.
Published: (2024)
DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On
by: Zhan, Wengyi, et al.
Published: (2024)
by: Zhan, Wengyi, et al.
Published: (2024)
Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light Images Without Task-Related Data
by: Zhang, Zhao, et al.
Published: (2022)
by: Zhang, Zhao, et al.
Published: (2022)
DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
by: Zhou, Yikang, et al.
Published: (2024)
by: Zhou, Yikang, et al.
Published: (2024)
IPDreamer: Appearance-Controllable 3D Object Generation with Complex Image Prompts
by: Zeng, Bohan, et al.
Published: (2023)
by: Zeng, Bohan, et al.
Published: (2023)
Vision-Centric Activation and Coordination for Multimodal Large Language Models
by: Wang, Yunnan, et al.
Published: (2025)
by: Wang, Yunnan, et al.
Published: (2025)
SynthVLM: Towards High-Quality and Efficient Synthesis of Image-Caption Datasets for Vision-Language Models
by: Liu, Zheng, et al.
Published: (2024)
by: Liu, Zheng, et al.
Published: (2024)
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
by: Liu, Zheng, et al.
Published: (2025)
by: Liu, Zheng, et al.
Published: (2025)
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
by: Hu, Wenbo, et al.
Published: (2024)
by: Hu, Wenbo, et al.
Published: (2024)
VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models
by: Yu, Xinlei, et al.
Published: (2025)
by: Yu, Xinlei, et al.
Published: (2025)
HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
by: Li, Keliang, et al.
Published: (2025)
by: Li, Keliang, et al.
Published: (2025)
Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval
by: Xue, Zhucun, et al.
Published: (2026)
by: Xue, Zhucun, et al.
Published: (2026)
Similar Items
-
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
by: Liu, Yuansen, et al.
Published: (2025) -
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency
by: Li, Hongyu, et al.
Published: (2025) -
AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation
by: Wu, Ruipu, et al.
Published: (2025) -
Revisiting Audio-Visual Segmentation with Vision-Centric Transformer
by: Huang, Shaofei, et al.
Published: (2025) -
StreamingEffect: Real-Time Human-Centric Video Effect Generation
by: Song, Yiren, et al.
Published: (2026)