Saved in:
Bibliografiske detaljer
Hovedforfatter: Mahamat Saleh, Samir Adam Annour
Format: Recurso digital
Sprog:engelsk
Udgivet: Zenodo 2026
Fag:
Online adgang:https://doi.org/10.5281/zenodo.20392941
Tags: Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
Indholdsfortegnelse:
  • <p><strong>Abstract</strong></p> <p>Venture-capital screening is an imbalanced rare-event prediction problem in which only a small fraction of founders produce extreme outcomes, while false positives waste analyst attention and capital. This study examines the calibration failure modes of local large language models on VCBench, a benchmark for predicting founder success from anonymized pre-founding profiles. We evaluate local Ollama inference with two no-thinking Qwen variants, qwen3:32b and Qwen3-30B-A3B-GGUF:Q4_K_M, on stratified 120-profile validation subsets, and compare these results with trivial baselines and a logistic-regression TF-IDF baseline evaluated on both the 120-profile subset and the full 900-profile public validation split.</p> <p>The main finding is that prompt engineering shifts the predicted-positive rate rather than reliably improving discrimination. Few-Shot prompting increases recall mainly by predicting many more positives, while Vanilla prompting is more conservative but still has wide confidence intervals. A simple LR-TF-IDF classifier achieves stronger F0.5 performance on the available validation data than the tested local LLM prompting configurations. These results motivate a reporting standard for rare-event LLM benchmarks: predicted-positive rate, trivial baselines, precision-recall analysis, confusion-matrix counts, and confidence intervals should be reported alongside any Fβ score.</p>