Saved in:
| Hovedforfatter: | |
|---|---|
| Format: | Recurso digital |
| Sprog: | engelsk |
| Udgivet: |
Zenodo
2026
|
| Fag: | |
| Online adgang: | https://doi.org/10.5281/zenodo.20392941 |
| Tags: |
Tilføj Tag
Ingen Tags, Vær først til at tagge denne postø!
|
Indholdsfortegnelse:
- <p><strong>Abstract</strong></p> <p>Venture-capital screening is an imbalanced rare-event prediction problem in which only a small fraction of founders produce extreme outcomes, while false positives waste analyst attention and capital. This study examines the calibration failure modes of local large language models on VCBench, a benchmark for predicting founder success from anonymized pre-founding profiles. We evaluate local Ollama inference with two no-thinking Qwen variants, qwen3:32b and Qwen3-30B-A3B-GGUF:Q4_K_M, on stratified 120-profile validation subsets, and compare these results with trivial baselines and a logistic-regression TF-IDF baseline evaluated on both the 120-profile subset and the full 900-profile public validation split.</p> <p>The main finding is that prompt engineering shifts the predicted-positive rate rather than reliably improving discrimination. Few-Shot prompting increases recall mainly by predicting many more positives, while Vanilla prompting is more conservative but still has wide confidence intervals. A simple LR-TF-IDF classifier achieves stronger F0.5 performance on the available validation data than the tested local LLM prompting configurations. These results motivate a reporting standard for rare-event LLM benchmarks: predicted-positive rate, trivial baselines, precision-recall analysis, confusion-matrix counts, and confidence intervals should be reported alongside any Fβ score.</p>