Saved in:
Bibliographic Details
Main Authors: Luo, Hongyin, Morgan, Nathaniel, Li, Tina, Zhao, Derek, Ngo, Ai Vy, Schroeder, Philip, Yang, Lijie, Ben-Kish, Assaf, O'Brien, Jack, Glass, James
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.16784
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913953835122688
author Luo, Hongyin
Morgan, Nathaniel
Li, Tina
Zhao, Derek
Ngo, Ai Vy
Schroeder, Philip
Yang, Lijie
Ben-Kish, Assaf
O'Brien, Jack
Glass, James
author_facet Luo, Hongyin
Morgan, Nathaniel
Li, Tina
Zhao, Derek
Ngo, Ai Vy
Schroeder, Philip
Yang, Lijie
Ben-Kish, Assaf
O'Brien, Jack
Glass, James
contents To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
format Preprint
id arxiv_https___arxiv_org_abs_2507_16784
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
Luo, Hongyin
Morgan, Nathaniel
Li, Tina
Zhao, Derek
Ngo, Ai Vy
Schroeder, Philip
Yang, Lijie
Ben-Kish, Assaf
O'Brien, Jack
Glass, James
Computation and Language
To break the context limits of large language models (LLMs) that bottleneck reasoning accuracy and efficiency, we propose the Thread Inference Model (TIM), a family of LLMs trained for recursive and decompositional problem solving, and TIMRUN, an inference runtime enabling long-horizon structured reasoning beyond context limits. Together, TIM hosted on TIMRUN supports virtually unlimited working memory and multi-hop tool calls within a single language model inference, overcoming output limits, positional-embedding constraints, and GPU-memory bottlenecks. Performance is achieved by modeling natural language as reasoning trees measured by both length and depth instead of linear sequences. The reasoning trees consist of tasks with thoughts, recursive subtasks, and conclusions based on the concept we proposed in Schroeder et al, 2025. During generation, we maintain a working memory that retains only the key-value states of the most relevant context tokens, selected by a rule-based subtask-pruning mechanism, enabling reuse of positional embeddings and GPU memory pages throughout reasoning. Experimental results show that our system sustains high inference throughput, even when manipulating up to 90% of the KV cache in GPU memory. It also delivers accurate reasoning on mathematical tasks and handles information retrieval challenges that require long-horizon reasoning and multi-hop tool use.
title Beyond Context Limits: Subconscious Threads for Long-Horizon Reasoning
topic Computation and Language
url https://arxiv.org/abs/2507.16784