Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autor principal:	Cesista, Franz Louis
Formato:	Preprint
Publicado:	2024
Materias:	Computer Vision and Pattern Recognition Computation and Language
Acceso en línea:	https://arxiv.org/abs/2406.11403
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866910814459396096
author	Cesista, Franz Louis
author_facet	Cesista, Franz Louis
contents	Multimodal Foundation Models (MMFMs) have demonstrated strong performance in both computer vision and natural language processing tasks. However, their performance diminishes in tasks that require a high degree of integration between these modalities, such as document understanding. Moreover, finetuning these models and deploying them requires significantly more compute and more engineering effort than unimodal models. In this work, we present Multimodal Structured Generation, a framework that forces (frozen) MMFMs to produce outputs in a strictly structured format by applying hard constraints directly to the output logits. This approach not only ensures that the model generates parseable outputs that downstream APIs can easily ingest but also allows us to force the model to reason before answering, which significantly boosts performance without the need for expensive fine-tuning. We demonstrate the effectiveness of our method through competitive results in the CVPR 2nd MMFM Challenge, highlighting that carefully designed lightweight engineering can outperform expensive and complicated modeling approaches. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge
format	Preprint
id	arxiv_https___arxiv_org_abs_2406_11403
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report Cesista, Franz Louis Computer Vision and Pattern Recognition Computation and Language Multimodal Foundation Models (MMFMs) have demonstrated strong performance in both computer vision and natural language processing tasks. However, their performance diminishes in tasks that require a high degree of integration between these modalities, such as document understanding. Moreover, finetuning these models and deploying them requires significantly more compute and more engineering effort than unimodal models. In this work, we present Multimodal Structured Generation, a framework that forces (frozen) MMFMs to produce outputs in a strictly structured format by applying hard constraints directly to the output logits. This approach not only ensures that the model generates parseable outputs that downstream APIs can easily ingest but also allows us to force the model to reason before answering, which significantly boosts performance without the need for expensive fine-tuning. We demonstrate the effectiveness of our method through competitive results in the CVPR 2nd MMFM Challenge, highlighting that carefully designed lightweight engineering can outperform expensive and complicated modeling approaches. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge
title	Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report
topic	Computer Vision and Pattern Recognition Computation and Language
url	https://arxiv.org/abs/2406.11403

Ejemplares similares