Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.20664 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866918311535575040 |
|---|---|
| author | Karapiperis, Dimitrios Akritidis, Leonidas Bozanis, Panayiotis Verykios, Vassilios |
| author_facet | Karapiperis, Dimitrios Akritidis, Leonidas Bozanis, Panayiotis Verykios, Vassilios |
| contents | Entity Resolution (ER) is a critical task for data integration, yet state-of-the-art supervised deep learning models remain impractical for many real-world applications due to their need for massive, expensive-to-obtain labeled datasets. While Active Learning (AL) offers a potential solution to this "label scarcity" problem, existing approaches introduce severe scalability bottlenecks. Specifically, they achieve high accuracy but incur prohibitive computational costs by re-training complex models from scratch or solving NP-hard selection problems in every iteration. In this paper, we propose ALER, a novel, semi-supervised pipeline designed to bridge the gap between semantic accuracy and computational scalability. ALER eliminates the training bottleneck by using a frozen bi-encoder architecture to generate static embeddings once and then iteratively training a lightweight classifier on top. To address the memory bottleneck associated with large-scale candidate pools, we first select a representative sample of the data and then use K-Means to partition this sample into semantically coherent chunks, enabling an efficient AL loop. We further propose a hybrid query strategy that combines "confused" and "confident" pairs to efficiently refine the decision boundary while correcting high-confidence errors.Extensive evaluation demonstrates ALER's superior efficiency, particularly on the large-scale DBLP dataset: it accelerates the training loop by 1.3x while drastically reducing resolution latency by a factor of 3.8 compared to the fastest baseline. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2601_20664 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | ALER: An Active Learning Hybrid System for Efficient Entity Resolution Karapiperis, Dimitrios Akritidis, Leonidas Bozanis, Panayiotis Verykios, Vassilios Databases Entity Resolution (ER) is a critical task for data integration, yet state-of-the-art supervised deep learning models remain impractical for many real-world applications due to their need for massive, expensive-to-obtain labeled datasets. While Active Learning (AL) offers a potential solution to this "label scarcity" problem, existing approaches introduce severe scalability bottlenecks. Specifically, they achieve high accuracy but incur prohibitive computational costs by re-training complex models from scratch or solving NP-hard selection problems in every iteration. In this paper, we propose ALER, a novel, semi-supervised pipeline designed to bridge the gap between semantic accuracy and computational scalability. ALER eliminates the training bottleneck by using a frozen bi-encoder architecture to generate static embeddings once and then iteratively training a lightweight classifier on top. To address the memory bottleneck associated with large-scale candidate pools, we first select a representative sample of the data and then use K-Means to partition this sample into semantically coherent chunks, enabling an efficient AL loop. We further propose a hybrid query strategy that combines "confused" and "confident" pairs to efficiently refine the decision boundary while correcting high-confidence errors.Extensive evaluation demonstrates ALER's superior efficiency, particularly on the large-scale DBLP dataset: it accelerates the training loop by 1.3x while drastically reducing resolution latency by a factor of 3.8 compared to the fastest baseline. |
| title | ALER: An Active Learning Hybrid System for Efficient Entity Resolution |
| topic | Databases |
| url | https://arxiv.org/abs/2601.20664 |