Saved in:
| Main Authors: | He, Jun, Lin, Yi, Huang, Zilong, Yin, Jiacong, Ye, Junyan, Zhou, Yuchuan, Li, Weijia, Zhang, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.22228 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
by: Huang, Zilong, et al.
Published: (2025)
by: Huang, Zilong, et al.
Published: (2025)
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
by: Yan, Zhiyuan, et al.
Published: (2025)
by: Yan, Zhiyuan, et al.
Published: (2025)
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
by: Huang, Zilong, et al.
Published: (2025)
by: Huang, Zilong, et al.
Published: (2025)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
by: Ye, Junyan, et al.
Published: (2025)
by: Ye, Junyan, et al.
Published: (2025)
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models
by: Ye, Junyan, et al.
Published: (2024)
by: Ye, Junyan, et al.
Published: (2024)
CrossViewDiff: A Cross-View Diffusion Model for Satellite-to-Street View Synthesis
by: Li, Weijia, et al.
Published: (2024)
by: Li, Weijia, et al.
Published: (2024)
GenClaw: Code-Driven Agentic Image Generation
by: Ye, Junyan, et al.
Published: (2026)
by: Ye, Junyan, et al.
Published: (2026)
The Less Meaningful the Understanding, the Faster the Feeling: Speech Comprehension Changes Perceptual Speech Tempo
by: Liangjie Chen, et al.
Published: (2025)
by: Liangjie Chen, et al.
Published: (2025)
Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation
by: He, Jun, et al.
Published: (2026)
by: He, Jun, et al.
Published: (2026)
SatSAM2: Motion-Constrained Video Object Tracking in Satellite Imagery using Promptable SAM2 and Kalman Priors
by: Fan, Ruijie, et al.
Published: (2025)
by: Fan, Ruijie, et al.
Published: (2025)
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis
by: Ye, Junyan, et al.
Published: (2024)
by: Ye, Junyan, et al.
Published: (2024)
Do MLLMs Exhibit Human-like Perceptual Behaviors? HVSBench: A Benchmark for MLLM Alignment with Human Perceptual Behavior
by: Lin, Jiaying, et al.
Published: (2024)
by: Lin, Jiaying, et al.
Published: (2024)
FakeVLM-R1: Internalizing Physical Laws via CoT for Synthetic Image Detection
by: Zhu, Leqi, et al.
Published: (2026)
by: Zhu, Leqi, et al.
Published: (2026)
Analog or Digital In-memory Computing? Benchmarking through Quantitative Modeling
by: Sun, Jiacong, et al.
Published: (2024)
by: Sun, Jiacong, et al.
Published: (2024)
RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension
by: Gao, Tianyi, et al.
Published: (2025)
by: Gao, Tianyi, et al.
Published: (2025)
Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature
by: Shen, Lingdong, et al.
Published: (2024)
by: Shen, Lingdong, et al.
Published: (2024)
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
by: Ye, Junyan, et al.
Published: (2025)
by: Ye, Junyan, et al.
Published: (2025)
RAS-Eval: A Comprehensive Benchmark for Security Evaluation of LLM Agents in Real-World Environments
by: Fu, Yuchuan, et al.
Published: (2025)
by: Fu, Yuchuan, et al.
Published: (2025)
RealGen: Photorealistic Text-to-Image Generation via Detector-Guided Rewards
by: Ye, Junyan, et al.
Published: (2025)
by: Ye, Junyan, et al.
Published: (2025)
3D Question Answering for City Scene Understanding
by: Sun, Penglei, et al.
Published: (2024)
by: Sun, Penglei, et al.
Published: (2024)
A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning
by: Jiang, Siyang, et al.
Published: (2025)
by: Jiang, Siyang, et al.
Published: (2025)
TrueCity: Real and Simulated Urban Data for Cross-Domain 3D Scene Understanding
by: Nguyen, Duc, et al.
Published: (2025)
by: Nguyen, Duc, et al.
Published: (2025)
ManiFeel: Benchmarking and Understanding Visuotactile Manipulation Policy Learning
by: Luu, Quan Khanh, et al.
Published: (2025)
by: Luu, Quan Khanh, et al.
Published: (2025)
LibCity: A Unified Library Towards Efficient and Comprehensive Urban Spatial-Temporal Prediction
by: Jiang, Jiawei, et al.
Published: (2023)
by: Jiang, Jiawei, et al.
Published: (2023)
Reference-based Controllable Scene Stylization with Gaussian Splatting
by: Mei, Yiqun, et al.
Published: (2024)
by: Mei, Yiqun, et al.
Published: (2024)
SCTc-TE: A Comprehensive Formulation and Benchmark for Temporal Event Forecasting
by: Ma, Yunshan, et al.
Published: (2023)
by: Ma, Yunshan, et al.
Published: (2023)
Where am I? Cross-View Geo-localization with Natural Language Descriptions
by: Ye, Junyan, et al.
Published: (2024)
by: Ye, Junyan, et al.
Published: (2024)
SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia
by: Yue, Pengfei, et al.
Published: (2026)
by: Yue, Pengfei, et al.
Published: (2026)
Perceptual-GS: Scene-adaptive Perceptual Densification for Gaussian Splatting
by: Zhou, Hongbi, et al.
Published: (2025)
by: Zhou, Hongbi, et al.
Published: (2025)
Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding
by: De, Anik, et al.
Published: (2025)
by: De, Anik, et al.
Published: (2025)
IntentGrasp: A Comprehensive Benchmark for Intent Understanding
by: Yin, Yuwei, et al.
Published: (2026)
by: Yin, Yuwei, et al.
Published: (2026)
SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation
by: Ye, Junyan, et al.
Published: (2024)
by: Ye, Junyan, et al.
Published: (2024)
Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network
by: Ye, Junyan, et al.
Published: (2024)
by: Ye, Junyan, et al.
Published: (2024)
OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild
by: Guo, Yuncheng, et al.
Published: (2025)
by: Guo, Yuncheng, et al.
Published: (2025)
Understanding Audiovisual Deepfake Detection: Techniques, Challenges, Human Factors and Perceptual Insights
by: Hashmi, Ammarah, et al.
Published: (2024)
by: Hashmi, Ammarah, et al.
Published: (2024)
Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind
by: Li, Qingmei, et al.
Published: (2025)
by: Li, Qingmei, et al.
Published: (2025)
Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
by: Shi, Shuyao, et al.
Published: (2026)
by: Shi, Shuyao, et al.
Published: (2026)
OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding
by: Zhao, Youjun, et al.
Published: (2024)
by: Zhao, Youjun, et al.
Published: (2024)
EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model
by: Li, Sijing, et al.
Published: (2025)
by: Li, Sijing, et al.
Published: (2025)
Similar Items
-
Scene4U: Hierarchical Layered 3D Scene Reconstruction from Single Panoramic Image for Your Immerse Exploration
by: Huang, Zilong, et al.
Published: (2025) -
GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT4o in Image Generation
by: Yan, Zhiyuan, et al.
Published: (2025) -
MajutsuCity: Language-driven Aesthetic-adaptive City Generation with Controllable 3D Assets and Layouts
by: Huang, Zilong, et al.
Published: (2025) -
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024) -
BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception
by: Ye, Junyan, et al.
Published: (2025)