Enregistré dans:
| Auteurs principaux: | , , |
|---|---|
| Format: | Preprint |
| Publié: |
2024
|
| Sujets: | |
| Accès en ligne: | https://arxiv.org/abs/2412.14533 |
| Tags: |
Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
|
| _version_ | 1866911008052740096 |
|---|---|
| author | Chouhan, Ashish Mandour, Saifeldin Gertz, Michael |
| author_facet | Chouhan, Ashish Mandour, Saifeldin Gertz, Michael |
| contents | Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2412_14533 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | ClusterChat: Multi-Feature Search for Corpus Exploration Chouhan, Ashish Mandour, Saifeldin Gertz, Michael Computation and Language Exploring large-scale text corpora presents a significant challenge in biomedical, finance, and legal domains, where vast amounts of documents are continuously published. Traditional search methods, such as keyword-based search, often retrieve documents in isolation, limiting the user's ability to easily inspect corpus-wide trends and relationships. We present ClusterChat (The demo video and source code are available at: https://github.com/achouhan93/ClusterChat), an open-source system for corpus exploration that integrates cluster-based organization of documents using textual embeddings with lexical and semantic search, timeline-driven exploration, and corpus and document-level question answering (QA) as multi-feature search capabilities. We validate the system with two case studies on a four million abstract PubMed dataset, demonstrating that ClusterChat enhances corpus exploration by delivering context-aware insights while maintaining scalability and responsiveness on large-scale document collections. |
| title | ClusterChat: Multi-Feature Search for Corpus Exploration |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2412.14533 |