Saved in:
| Main Authors: | Yang, Yang, Wang, Wenhai, Chen, Zhe, Dai, Jifeng, Zheng, Liang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.13803 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
CoMemo: LVLMs Need Image Context with Image Memory
by: Liu, Shi, et al.
Published: (2025)
by: Liu, Shi, et al.
Published: (2025)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Aligning Object Detector Bounding Boxes with Human Preference
by: Strafforello, Ombretta, et al.
Published: (2024)
by: Strafforello, Ombretta, et al.
Published: (2024)
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)
by: Duan, Yuchen, et al.
Published: (2024)
Distortion-Aware Adversarial Attacks on Bounding Boxes of Object Detectors
by: Phuc, Pham, et al.
Published: (2024)
by: Phuc, Pham, et al.
Published: (2024)
Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers
by: Nokabadi, Fatemeh Nourilenjan, et al.
Published: (2024)
by: Nokabadi, Fatemeh Nourilenjan, et al.
Published: (2024)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)
by: Tao, Chenxin, et al.
Published: (2024)
Out-of-Bounding-Box Triggers: A Stealthy Approach to Cheat Object Detectors
by: Lin, Tao, et al.
Published: (2024)
by: Lin, Tao, et al.
Published: (2024)
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
by: Wang, Weiyun, et al.
Published: (2024)
by: Wang, Weiyun, et al.
Published: (2024)
GenExam: A Multidisciplinary Text-to-Image Exam
by: Wang, Zhaokai, et al.
Published: (2025)
by: Wang, Zhaokai, et al.
Published: (2025)
MICDrop: Masking Image and Depth Features via Complementary Dropout for Domain-Adaptive Semantic Segmentation
by: Yang, Linyan, et al.
Published: (2024)
by: Yang, Linyan, et al.
Published: (2024)
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
by: Cao, Yue, et al.
Published: (2024)
by: Cao, Yue, et al.
Published: (2024)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)
by: Liu, Yangzhou, et al.
Published: (2024)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)
by: Chen, Zhe, et al.
Published: (2023)
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
by: Ren, Yiming, et al.
Published: (2025)
by: Ren, Yiming, et al.
Published: (2025)
Docopilot: Improving Multimodal Models for Document-Level Understanding
by: Duan, Yuchen, et al.
Published: (2025)
by: Duan, Yuchen, et al.
Published: (2025)
Adaptive Dropout: Unleashing Dropout across Layers for Generalizable Image Super-Resolution
by: Xu, Hang, et al.
Published: (2025)
by: Xu, Hang, et al.
Published: (2025)
Significance and Stability Analysis of Gene-Environment Interaction using RGxEStat
by: Qin, Meng'en, et al.
Published: (2026)
by: Qin, Meng'en, et al.
Published: (2026)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)
by: Wu, Jiannan, et al.
Published: (2024)
Object Detectors in the Open Environment: Challenges, Solutions, and Outlook
by: Liang, Siyuan, et al.
Published: (2024)
by: Liang, Siyuan, et al.
Published: (2024)
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
by: Xing, Zhenghao, et al.
Published: (2025)
by: Xing, Zhenghao, et al.
Published: (2025)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
FSSD: Feature Fusion Single Shot Multibox Detector
by: Li, Zuoxin, et al.
Published: (2017)
by: Li, Zuoxin, et al.
Published: (2017)
Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings
by: Wei, Xingguang, et al.
Published: (2025)
by: Wei, Xingguang, et al.
Published: (2025)
SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection
by: Zuo, Xin, et al.
Published: (2025)
by: Zuo, Xin, et al.
Published: (2025)
Beyond Dropout: Robust Convolutional Neural Networks Based on Local Feature Masking
by: Gong, Yunpeng, et al.
Published: (2024)
by: Gong, Yunpeng, et al.
Published: (2024)
Transferable Dual-Domain Feature Importance Attack against AI-Generated Image Detector
by: Zhu, Weiheng, et al.
Published: (2025)
by: Zhu, Weiheng, et al.
Published: (2025)
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
by: Cui, Erfei, et al.
Published: (2023)
by: Cui, Erfei, et al.
Published: (2023)
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
by: Gao, Zhangwei, et al.
Published: (2024)
by: Gao, Zhangwei, et al.
Published: (2024)
Theoretically Achieving Continuous Representation of Oriented Bounding Boxes
by: Xiao, Zi-Kai, et al.
Published: (2024)
by: Xiao, Zi-Kai, et al.
Published: (2024)
MGPC: Multimodal Network for Generalizable Point Cloud Completion With Modality Dropout and Progressive Decoding
by: Liu, Jiangyuan, et al.
Published: (2026)
by: Liu, Jiangyuan, et al.
Published: (2026)
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
by: Wang, Weiyun, et al.
Published: (2025)
by: Wang, Weiyun, et al.
Published: (2025)
OpenBox: Annotate Any Bounding Boxes in 3D
by: Lee, In-Jae, et al.
Published: (2025)
by: Lee, In-Jae, et al.
Published: (2025)
BoxSplitGen: A Generative Model for 3D Part Bounding Boxes in Varying Granularity
by: Koo, Juil, et al.
Published: (2026)
by: Koo, Juil, et al.
Published: (2026)
Bounding-box Watermarking: Defense against Model Extraction Attacks on Object Detectors
by: Koda, Satoru, et al.
Published: (2024)
by: Koda, Satoru, et al.
Published: (2024)
Dropout the High-rate Downsampling: A Novel Design Paradigm for UHD Image Restoration
by: Wu, Chen, et al.
Published: (2024)
by: Wu, Chen, et al.
Published: (2024)
Similar Items
-
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024) -
CoMemo: LVLMs Need Image Context with Image Memory
by: Liu, Shi, et al.
Published: (2025) -
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024) -
Aligning Object Detector Bounding Boxes with Human Preference
by: Strafforello, Ombretta, et al.
Published: (2024) -
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
by: Wang, Weiyun, et al.
Published: (2024)