Saved in:
Bibliographic Details
Main Authors: Saggese, Giacinto Paolo, Smith, Paul
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.23977
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917175407673344
author Saggese, Giacinto Paolo
Smith, Paul
author_facet Saggese, Giacinto Paolo
Smith, Paul
contents We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production systems. This gap introduces causality violations, batch boundary artifacts, and poor reproducibility of real-time failures. DataFlow resolves these issues through a unified execution model based on directed acyclic graphs (DAGs) with point-in-time idempotency: outputs at any time t depend only on a fixed-length context window preceding t. This guarantee ensures that models developed in batch mode execute identically in streaming production without code changes. The framework enforces strict causality by automatically tracking knowledge time across all transformations, eliminating future-peeking bugs. DataFlow supports flexible tiling across temporal and feature dimensions, allowing the same model to operate at different frequencies and memory profiles via configuration alone. It integrates natively with the Python data science stack and provides fit/predict semantics for online learning, caching and incremental computation, and automatic parallelization through DAG-based scheduling. We demonstrate its effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics.
format Preprint
id arxiv_https___arxiv_org_abs_2512_23977
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Causify DataFlow: A Framework For High-performance Machine Learning Stream Computing
Saggese, Giacinto Paolo
Smith, Paul
Machine Learning
Artificial Intelligence
We present DataFlow, a computational framework for building, testing, and deploying high-performance machine learning systems on unbounded time-series data. Traditional data science workflows assume finite datasets and require substantial reimplementation when moving from batch prototypes to streaming production systems. This gap introduces causality violations, batch boundary artifacts, and poor reproducibility of real-time failures. DataFlow resolves these issues through a unified execution model based on directed acyclic graphs (DAGs) with point-in-time idempotency: outputs at any time t depend only on a fixed-length context window preceding t. This guarantee ensures that models developed in batch mode execute identically in streaming production without code changes. The framework enforces strict causality by automatically tracking knowledge time across all transformations, eliminating future-peeking bugs. DataFlow supports flexible tiling across temporal and feature dimensions, allowing the same model to operate at different frequencies and memory profiles via configuration alone. It integrates natively with the Python data science stack and provides fit/predict semantics for online learning, caching and incremental computation, and automatic parallelization through DAG-based scheduling. We demonstrate its effectiveness across domains including financial trading, IoT, fraud detection, and real-time analytics.
title Causify DataFlow: A Framework For High-performance Machine Learning Stream Computing
topic Machine Learning
Artificial Intelligence
url https://arxiv.org/abs/2512.23977