Saved in:
Bibliographic Details
Main Authors: Liang, Qingyuan, Zhang, Zhao, Sun, Zeyu, Lin, Zheng, Luo, Qi, Xiao, Yueyi, Chen, Yizhou, Zhang, Yuqun, Zhang, Haotian, Zhang, Lu, Chen, Bin, Xiong, Yingfei
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2503.05507
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912756544831488
author Liang, Qingyuan
Zhang, Zhao
Sun, Zeyu
Lin, Zheng
Luo, Qi
Xiao, Yueyi
Chen, Yizhou
Zhang, Yuqun
Zhang, Haotian
Zhang, Lu
Chen, Bin
Xiong, Yingfei
author_facet Liang, Qingyuan
Zhang, Zhao
Sun, Zeyu
Lin, Zheng
Luo, Qi
Xiao, Yueyi
Chen, Yizhou
Zhang, Yuqun
Zhang, Haotian
Zhang, Lu
Chen, Bin
Xiong, Yingfei
contents Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
format Preprint
id arxiv_https___arxiv_org_abs_2503_05507
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?
Liang, Qingyuan
Zhang, Zhao
Sun, Zeyu
Lin, Zheng
Luo, Qi
Xiao, Yueyi
Chen, Yizhou
Zhang, Yuqun
Zhang, Haotian
Zhang, Lu
Chen, Bin
Xiong, Yingfei
Programming Languages
Artificial Intelligence
Grammar serves as a cornerstone in programming languages and software engineering, providing frameworks to define the syntactic space and program structure. Existing research demonstrates the effectiveness of grammar-based code representations in small-scale models, showing their ability to reduce syntax errors and enhance performance. However, as language models scale to the billion level or beyond, syntax-level errors become rare, making it unclear whether grammar information still provides performance benefits. To explore this, we develop a series of billion-scale GrammarCoder models, incorporating grammar rules in the code generation process. Experiments on HumanEval (+) and MBPP (+) demonstrate a notable improvement in code generation accuracy. Further analysis shows that grammar-based representations enhance LLMs' ability to discern subtle code differences, reducing semantic errors caused by minor variations. These findings suggest that grammar-based code representations remain valuable even in billion-scale models, not only by maintaining syntax correctness but also by improving semantic differentiation.
title Grammar-Based Code Representation: Is It a Worthy Pursuit for LLMs?
topic Programming Languages
Artificial Intelligence
url https://arxiv.org/abs/2503.05507