Saved in:
| Main Authors: | Deng, Xueqing, Yu, Qihang, Athar, Ali, Yang, Chenglin, Yang, Linjie, Jin, Xiaojie, Shen, Xiaohui, Chen, Liang-Chieh |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.02589 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
COCONut: Modernizing COCO Segmentation
by: Deng, Xueqing, et al.
Published: (2024)
by: Deng, Xueqing, et al.
Published: (2024)
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
by: Athar, Ali, et al.
Published: (2024)
by: Athar, Ali, et al.
Published: (2024)
Randomized Autoregressive Visual Generation
by: Yu, Qihang, et al.
Published: (2024)
by: Yu, Qihang, et al.
Published: (2024)
A Simple Video Segmenter by Tracking Objects Along Axial Trajectories
by: He, Ju, et al.
Published: (2023)
by: He, Ju, et al.
Published: (2023)
PanDepth: Joint Panoptic Segmentation and Depth Completion
by: Lagos, Juan, et al.
Published: (2022)
by: Lagos, Juan, et al.
Published: (2022)
1.58-bit FLUX
by: Yang, Chenglin, et al.
Published: (2024)
by: Yang, Chenglin, et al.
Published: (2024)
An Image is Worth 32 Tokens for Reconstruction and Generation
by: Yu, Qihang, et al.
Published: (2024)
by: Yu, Qihang, et al.
Published: (2024)
MaskBit: Embedding-free Image Generation via Bit Tokens
by: Weber, Mark, et al.
Published: (2024)
by: Weber, Mark, et al.
Published: (2024)
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
by: Kim, Dongwon, et al.
Published: (2025)
by: Kim, Dongwon, et al.
Published: (2025)
FingerCap: Fine-grained Finger-level Hand Motion Captioning
by: Shen, Xin, et al.
Published: (2025)
by: Shen, Xin, et al.
Published: (2025)
PanORama: Multiview Consistent Panoptic Segmentation in Operating Rooms
by: Gürbüz, Tuna, et al.
Published: (2026)
by: Gürbüz, Tuna, et al.
Published: (2026)
PanSR: An Object-Centric Mask Transformer for Panoptic Segmentation
by: Žust, Lojze, et al.
Published: (2024)
by: Žust, Lojze, et al.
Published: (2024)
PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
by: Pan, Yining, et al.
Published: (2026)
by: Pan, Yining, et al.
Published: (2026)
GroundCap: A Visually Grounded Image Captioning Dataset
by: Oliveira, Daniel A. P., et al.
Published: (2025)
by: Oliveira, Daniel A. P., et al.
Published: (2025)
PanSt3R: Multi-view Consistent Panoptic Segmentation
by: Zust, Lojze, et al.
Published: (2025)
by: Zust, Lojze, et al.
Published: (2025)
CompCap: Improving Multimodal Large Language Models with Composite Captions
by: Chen, Xiaohui, et al.
Published: (2024)
by: Chen, Xiaohui, et al.
Published: (2024)
SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding
by: Yang, Zhiliu, et al.
Published: (2025)
by: Yang, Zhiliu, et al.
Published: (2025)
FSAR-Cap: A Fine-Grained Two-Stage Annotated Dataset for SAR Image Captioning
by: Zhang, Jinqi, et al.
Published: (2025)
by: Zhang, Jinqi, et al.
Published: (2025)
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
by: Zheng, Lihao, et al.
Published: (2026)
by: Zheng, Lihao, et al.
Published: (2026)
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
PanopticPartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation
by: Li, Xiangtai, et al.
Published: (2023)
by: Li, Xiangtai, et al.
Published: (2023)
Panoptic Captioning: An Equivalence Bridge for Image and Text
by: Lin, Kun-Yu, et al.
Published: (2025)
by: Lin, Kun-Yu, et al.
Published: (2025)
Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting
by: Shin, Inkyu, et al.
Published: (2024)
by: Shin, Inkyu, et al.
Published: (2024)
Scene Graph-guided SegCaptioning Transformer with Fine-grained Alignment for Controllable Video Segmentation and Captioning
by: Zhang, Xu, et al.
Published: (2026)
by: Zhang, Xu, et al.
Published: (2026)
VITRIX-CLIPIN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions
by: Wang, Ziteng, et al.
Published: (2025)
by: Wang, Ziteng, et al.
Published: (2025)
VoCap: Video Object Captioning and Segmentation from Any Prompt
by: Uijlings, Jasper, et al.
Published: (2025)
by: Uijlings, Jasper, et al.
Published: (2025)
ProCap: Projection-Aware Captioning for Spatial Augmented Reality
by: Cao, Zimo, et al.
Published: (2026)
by: Cao, Zimo, et al.
Published: (2026)
MC-PanDA: Mask Confidence for Panoptic Domain Adaptation
by: Martinović, Ivan, et al.
Published: (2024)
by: Martinović, Ivan, et al.
Published: (2024)
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
by: Ren, Sucheng, et al.
Published: (2025)
by: Ren, Sucheng, et al.
Published: (2025)
Frequency-Aware Flow Matching for High-Quality Image Generation
by: Ren, Sucheng, et al.
Published: (2026)
by: Ren, Sucheng, et al.
Published: (2026)
Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization
by: Liu, Qihao, et al.
Published: (2024)
by: Liu, Qihao, et al.
Published: (2024)
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Deeply Supervised Flow-Based Generative Models
by: Shin, Inkyu, et al.
Published: (2025)
by: Shin, Inkyu, et al.
Published: (2025)
IG Captioner: Information Gain Captioners are Strong Zero-shot Classifiers
by: Yang, Chenglin, et al.
Published: (2023)
by: Yang, Chenglin, et al.
Published: (2023)
Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
Rational Design Strategies in DNA‐Encoded Libraries for Drug Discovery
by: Xudong Wang, et al.
Published: (2025)
by: Xudong Wang, et al.
Published: (2025)
COCO-OLAC: A Benchmark for Occluded Panoptic Segmentation and Image Understanding
by: Wei, Wenbo, et al.
Published: (2024)
by: Wei, Wenbo, et al.
Published: (2024)
ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
by: Cao, Shuo, et al.
Published: (2025)
by: Cao, Shuo, et al.
Published: (2025)
CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval
by: Xu, Yifan, et al.
Published: (2024)
by: Xu, Yifan, et al.
Published: (2024)
Open-World Panoptic Segmentation
by: Sodano, Matteo, et al.
Published: (2024)
by: Sodano, Matteo, et al.
Published: (2024)
Similar Items
-
COCONut: Modernizing COCO Segmentation
by: Deng, Xueqing, et al.
Published: (2024) -
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation
by: Athar, Ali, et al.
Published: (2024) -
Randomized Autoregressive Visual Generation
by: Yu, Qihang, et al.
Published: (2024) -
A Simple Video Segmenter by Tracking Objects Along Axial Trajectories
by: He, Ju, et al.
Published: (2023) -
PanDepth: Joint Panoptic Segmentation and Depth Completion
by: Lagos, Juan, et al.
Published: (2022)