Saved in:
Bibliographic Details
Main Authors: Mulc, Thomas, Steele, Jennifer L.
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2407.00085
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910907807825920
author Mulc, Thomas
Steele, Jennifer L.
author_facet Mulc, Thomas
Steele, Jennifer L.
contents Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.
format Preprint
id arxiv_https___arxiv_org_abs_2407_00085
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Compressing Search with Language Models
Mulc, Thomas
Steele, Jennifer L.
Information Retrieval
Machine Learning
Millions of people turn to Google Search each day for information on things as diverse as new cars or flu symptoms. The terms that they enter contain valuable information on their daily intent and activities, but the information in these search terms has been difficult to fully leverage. User-defined categorical filters have been the most common way to shrink the dimensionality of search data to a tractable size for analysis and modeling. In this paper we present a new approach to reducing the dimensionality of search data while retaining much of the information in the individual terms without user-defined rules. Our contributions are two-fold: 1) we introduce SLaM Compression, a way to quantify search terms using pre-trained language models and create a representation of search data that has low dimensionality, is memory efficient, and effectively acts as a summary of search, and 2) we present CoSMo, a Constrained Search Model for estimating real world events using only search data. We demonstrate the efficacy of our contributions by estimating with high accuracy U.S. automobile sales and U.S. flu rates using only Google Search data.
title Compressing Search with Language Models
topic Information Retrieval
Machine Learning
url https://arxiv.org/abs/2407.00085