Saved in:
Bibliographic Details
Main Authors: Xian, Zixiang, Cui, Chenhui, Huang, Rubing, Fang, Chunrong, Chen, Zhenyu
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.14644
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912409854148608
author Xian, Zixiang
Cui, Chenhui
Huang, Rubing
Fang, Chunrong
Chen, Zhenyu
author_facet Xian, Zixiang
Cui, Chenhui
Huang, Rubing
Fang, Chunrong
Chen, Zhenyu
contents The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code clustering. However, existing methods for source code embedding, including those based on LLMs, often rely on costly supervised training or fine-tuning for domain adaptation. This paper proposes a novel approach to embedding source code by combining large language and sentence embedding models. This approach attempts to eliminate the need for task-specific training or fine-tuning and to effectively address the issue of erroneous information commonly found in LLM-generated outputs. To evaluate the performance of our proposed approach, we conducted a series of experiments on three datasets with different programming languages by considering various LLMs and sentence embedding models. The experimental results have demonstrated the effectiveness and superiority of our approach over the state-of-the-art unsupervised approaches, such as SourcererCC, Code2vec, InferCode, TransformCode, and LLM2Vec. Our findings highlight the potential of our approach to advance the field of SE by providing robust and efficient solutions for source code embedding tasks.
format Preprint
id arxiv_https___arxiv_org_abs_2409_14644
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models
Xian, Zixiang
Cui, Chenhui
Huang, Rubing
Fang, Chunrong
Chen, Zhenyu
Software Engineering
Artificial Intelligence
The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code clustering. However, existing methods for source code embedding, including those based on LLMs, often rely on costly supervised training or fine-tuning for domain adaptation. This paper proposes a novel approach to embedding source code by combining large language and sentence embedding models. This approach attempts to eliminate the need for task-specific training or fine-tuning and to effectively address the issue of erroneous information commonly found in LLM-generated outputs. To evaluate the performance of our proposed approach, we conducted a series of experiments on three datasets with different programming languages by considering various LLMs and sentence embedding models. The experimental results have demonstrated the effectiveness and superiority of our approach over the state-of-the-art unsupervised approaches, such as SourcererCC, Code2vec, InferCode, TransformCode, and LLM2Vec. Our findings highlight the potential of our approach to advance the field of SE by providing robust and efficient solutions for source code embedding tasks.
title An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models
topic Software Engineering
Artificial Intelligence
url https://arxiv.org/abs/2409.14644