Saved in:
Bibliographic Details
Main Authors: Barrowclough, George, Andrecki, Marian, Shinner, James, Donghi, Daniele
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.06021
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911045538283520
author Barrowclough, George
Andrecki, Marian
Shinner, James
Donghi, Daniele
author_facet Barrowclough, George
Andrecki, Marian
Shinner, James
Donghi, Daniele
contents In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework's utility is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.
format Preprint
id arxiv_https___arxiv_org_abs_2507_06021
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
Barrowclough, George
Andrecki, Marian
Shinner, James
Donghi, Daniele
Machine Learning
In production recommender systems, feature preprocessing must be faithfully replicated across training and inference environments. This often requires duplicating logic between offline and online environments, increasing engineering effort and introducing risks of dataset shift. We present Kamae, an open-source Python library that bridges this gap by translating PySpark preprocessing pipelines into equivalent Keras models. Kamae provides a suite of configurable Spark transformers and estimators, each mapped to a corresponding Keras layer, enabling consistent, end-to-end preprocessing across the ML lifecycle. Framework's utility is illustrated on real-world use cases, including MovieLens dataset and Expedia's Learning-to-Rank pipelines. The code is available at https://github.com/ExpediaGroup/kamae.
title Kamae: Bridging Spark and Keras for Seamless ML Preprocessing
topic Machine Learning
url https://arxiv.org/abs/2507.06021