Saved in:
| Main Author: | Xu, Binbin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.09701 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
by: Penedo, Guilherme, et al.
Published: (2024)
by: Penedo, Guilherme, et al.
Published: (2024)
Text2Freq: Learning Series Patterns from Text via Frequency Domain
by: Lo, Ming-Chih, et al.
Published: (2024)
by: Lo, Ming-Chih, et al.
Published: (2024)
FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text
by: Xu, Zhenyu, et al.
Published: (2024)
by: Xu, Zhenyu, et al.
Published: (2024)
FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension
by: Kai, Jushi, et al.
Published: (2025)
by: Kai, Jushi, et al.
Published: (2025)
UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
by: Wang, Haoyu, et al.
Published: (2024)
by: Wang, Haoyu, et al.
Published: (2024)
WebQAmGaze: A Multilingual Webcam Eye-Tracking-While-Reading Dataset
by: Ribeiro, Tiago, et al.
Published: (2023)
by: Ribeiro, Tiago, et al.
Published: (2023)
FineWeb-zhtw: Scalable Curation of Traditional Chinese Text Data from the Web
by: Lin, Cheng-Wei, et al.
Published: (2024)
by: Lin, Cheng-Wei, et al.
Published: (2024)
Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages
by: Ma, Chunlan, et al.
Published: (2023)
by: Ma, Chunlan, et al.
Published: (2023)
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
by: Pandey, Prabhat, et al.
Published: (2025)
by: Pandey, Prabhat, et al.
Published: (2025)
Multilingual Text Style Transfer: Datasets & Models for Indian Languages
by: Mukherjee, Sourabrata, et al.
Published: (2024)
by: Mukherjee, Sourabrata, et al.
Published: (2024)
Efficiently Identifying Low-Quality Language Subsets in Multilingual Datasets: A Case Study on a Large-Scale Multilingual Audio Dataset
by: Samir, Farhan, et al.
Published: (2024)
by: Samir, Farhan, et al.
Published: (2024)
Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLMs
by: Chen, Haoyang, et al.
Published: (2025)
by: Chen, Haoyang, et al.
Published: (2025)
M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets
by: Zhao, Chunguang, et al.
Published: (2025)
by: Zhao, Chunguang, et al.
Published: (2025)
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models
by: Atuhurra, Jesse, et al.
Published: (2024)
by: Atuhurra, Jesse, et al.
Published: (2024)
Increasing the Robustness of the Fine-tuned Multilingual Machine-Generated Text Detectors
by: Macko, Dominik, et al.
Published: (2025)
by: Macko, Dominik, et al.
Published: (2025)
WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense Retrieval
by: Dinzinger, Michael, et al.
Published: (2025)
by: Dinzinger, Michael, et al.
Published: (2025)
Test-Time Scaling with Repeated Sampling Improves Multilingual Text Generation
by: Gupta, Ashim, et al.
Published: (2025)
by: Gupta, Ashim, et al.
Published: (2025)
CHATTER: A Character Attribution Dataset for Narrative Understanding
by: Baruah, Sabyasachee, et al.
Published: (2024)
by: Baruah, Sabyasachee, et al.
Published: (2024)
Towards Global AI Inclusivity: A Large-Scale Multilingual Terminology Dataset (GIST)
by: Liu, Jiarui, et al.
Published: (2024)
by: Liu, Jiarui, et al.
Published: (2024)
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
Visual Analytics for Fine-grained Text Classification Models and Datasets
by: Battogtokh, Munkhtulga, et al.
Published: (2024)
by: Battogtokh, Munkhtulga, et al.
Published: (2024)
Advocating Character Error Rate for Multilingual ASR Evaluation
by: K, Thennal D, et al.
Published: (2024)
by: K, Thennal D, et al.
Published: (2024)
Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
by: Senaratna, Nuwan I.
Published: (2025)
by: Senaratna, Nuwan I.
Published: (2025)
MARCA: A Checklist-Based Benchmark for Multilingual Web Search
by: Almeida, Thales Sales, et al.
Published: (2026)
by: Almeida, Thales Sales, et al.
Published: (2026)
MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering
by: Zhang, Xuanliang, et al.
Published: (2025)
by: Zhang, Xuanliang, et al.
Published: (2025)
When Scale Meets Diversity: Evaluating Language Models on Fine-Grained Multilingual Claim Verification
by: Shcharbakova, Hanna, et al.
Published: (2025)
by: Shcharbakova, Hanna, et al.
Published: (2025)
Renard: A Modular Pipeline for Extracting Character Networks from Narrative Texts
by: Amalvy, Arthur, et al.
Published: (2024)
by: Amalvy, Arthur, et al.
Published: (2024)
AVerImaTeC: A Dataset for Automatic Verification of Image-Text Claims with Evidence from the Web
by: Cao, Rui, et al.
Published: (2025)
by: Cao, Rui, et al.
Published: (2025)
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
by: Luo, Hanjun, et al.
Published: (2024)
by: Luo, Hanjun, et al.
Published: (2024)
OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics
by: Chu, Wei, et al.
Published: (2025)
by: Chu, Wei, et al.
Published: (2025)
X-WebAgentBench: A Multilingual Interactive Web Benchmark for Evaluating Global Agentic System
by: Wang, Peng, et al.
Published: (2025)
by: Wang, Peng, et al.
Published: (2025)
Multilingual Attribute Extraction from News Web Pages
by: Bedrin, Pavel, et al.
Published: (2025)
by: Bedrin, Pavel, et al.
Published: (2025)
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense Retrieval
by: Dinzinger, Michael, et al.
Published: (2026)
by: Dinzinger, Michael, et al.
Published: (2026)
E-ARMOR: Edge case Assessment and Review of Multilingual Optical Character Recognition
by: Gupta, Aryan, et al.
Published: (2025)
by: Gupta, Aryan, et al.
Published: (2025)
Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection
by: Xiong, Feng, et al.
Published: (2024)
by: Xiong, Feng, et al.
Published: (2024)
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
by: He, Haorui, et al.
Published: (2025)
by: He, Haorui, et al.
Published: (2025)
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark
by: Macko, Dominik, et al.
Published: (2023)
by: Macko, Dominik, et al.
Published: (2023)
Datasets for Multilingual Answer Sentence Selection
by: Gabburo, Matteo, et al.
Published: (2024)
by: Gabburo, Matteo, et al.
Published: (2024)
ChineseWebText 2.0: Large-Scale High-quality Chinese Web Text with Multi-dimensional and fine-grained information
by: Zhang, Wanyue, et al.
Published: (2024)
by: Zhang, Wanyue, et al.
Published: (2024)
Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments
by: Dobler, Konstantin, et al.
Published: (2026)
by: Dobler, Konstantin, et al.
Published: (2026)
Similar Items
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
by: Penedo, Guilherme, et al.
Published: (2024) -
Text2Freq: Learning Series Patterns from Text via Frequency Domain
by: Lo, Ming-Chih, et al.
Published: (2024) -
FreqMark: Frequency-Based Watermark for Sentence-Level Detection of LLM-Generated Text
by: Xu, Zhenyu, et al.
Published: (2024) -
FreqKV: Key-Value Compression in Frequency Domain for Context Window Extension
by: Kai, Jushi, et al.
Published: (2025) -
UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
by: Wang, Haoyu, et al.
Published: (2024)