Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Ethiraj, Vignesh, David, Ashwath, Menon, Sidhanth, Vijay, Divya, Kannan, Vidhyakshaya
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence 68T50
Online Access:	https://arxiv.org/abs/2504.16460
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908584703426560
author	Ethiraj, Vignesh David, Ashwath Menon, Sidhanth Vijay, Divya Kannan, Vidhyakshaya
author_facet	Ethiraj, Vignesh David, Ashwath Menon, Sidhanth Vijay, Divya Kannan, Vidhyakshaya
contents	The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective. Fine-tuning was performed on T-Embed, a high-quality, large-scale dataset covering diverse telecom concepts, standards, and operational scenarios. Although T-Embed contains some proprietary material and cannot be fully released, we open source 75% of the dataset to support continued research in domain-specific representation learning. On a custom benchmark comprising 1500 query-passage pairs from IETF RFCs and vendor manuals, T-VEC surpasses MPNet, BGE, Jina and E5, demonstrating superior domain grounding and semantic precision in telecom-specific retrieval. Embedding visualizations further showcase tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support semantically faithful NLP applications within the telecom domain.
format	Preprint
id	arxiv_https___arxiv_org_abs_2504_16460
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning Ethiraj, Vignesh David, Ashwath Menon, Sidhanth Vijay, Divya Kannan, Vidhyakshaya Computation and Language Artificial Intelligence 68T50 The specialized vocabulary and nuanced concepts of the telecommunications industry pose persistent challenges for standard Natural Language Processing (NLP) models. Generic embedding models often struggle to represent telecom-specific semantics, limiting their utility in retrieval and downstream tasks. We present T-VEC (Telecom Vectorization Model), a domain-adapted embedding model fine-tuned from the gte-Qwen2-1.5B-instruct backbone using a triplet loss objective. Fine-tuning was performed on T-Embed, a high-quality, large-scale dataset covering diverse telecom concepts, standards, and operational scenarios. Although T-Embed contains some proprietary material and cannot be fully released, we open source 75% of the dataset to support continued research in domain-specific representation learning. On a custom benchmark comprising 1500 query-passage pairs from IETF RFCs and vendor manuals, T-VEC surpasses MPNet, BGE, Jina and E5, demonstrating superior domain grounding and semantic precision in telecom-specific retrieval. Embedding visualizations further showcase tight clustering of telecom-relevant concepts. We release T-VEC and its tokenizer to support semantically faithful NLP applications within the telecom domain.
title	T-VEC: A Telecom-Specific Vectorization Model with Enhanced Semantic Understanding via Deep Triplet Loss Fine-Tuning
topic	Computation and Language Artificial Intelligence 68T50
url	https://arxiv.org/abs/2504.16460

Similar Items