Saved in:
| Main Authors: | Zhou, Xuhui, Kim, Hyunwoo, Brahman, Faeze, Jiang, Liwei, Zhu, Hao, Lu, Ximing, Xu, Frank, Lin, Bill Yuchen, Choi, Yejin, Mireshghallah, Niloofar, Bras, Ronan Le, Sap, Maarten |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.16427 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models
by: Baheti, Ashutosh, et al.
Published: (2023)
by: Baheti, Ashutosh, et al.
Published: (2023)
Multi-Attribute Constraint Satisfaction via Language Model Rewriting
by: Baheti, Ashutosh, et al.
Published: (2024)
by: Baheti, Ashutosh, et al.
Published: (2024)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
by: Jiang, Liwei, et al.
Published: (2024)
by: Jiang, Liwei, et al.
Published: (2024)
Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
by: Mireshghallah, Niloofar, et al.
Published: (2023)
by: Mireshghallah, Niloofar, et al.
Published: (2023)
From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
by: Mendelsohn, Julia, et al.
Published: (2023)
by: Mendelsohn, Julia, et al.
Published: (2023)
WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild
by: Lin, Bill Yuchen, et al.
Published: (2024)
by: Lin, Bill Yuchen, et al.
Published: (2024)
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
by: Jung, Jaehun, et al.
Published: (2024)
by: Jung, Jaehun, et al.
Published: (2024)
Information-Theoretic Distillation for Reference-less Summarization
by: Jung, Jaehun, et al.
Published: (2024)
by: Jung, Jaehun, et al.
Published: (2024)
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
by: Su, Zhe, et al.
Published: (2024)
by: Su, Zhe, et al.
Published: (2024)
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
by: Jung, Jaehun, et al.
Published: (2023)
by: Jung, Jaehun, et al.
Published: (2023)
MacGyver: Are Large Language Models Creative Problem Solvers?
by: Tian, Yufei, et al.
Published: (2023)
by: Tian, Yufei, et al.
Published: (2023)
Agent Lumos: Unified and Modular Training for Open-Source Language Agents
by: Yin, Da, et al.
Published: (2023)
by: Yin, Da, et al.
Published: (2023)
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
by: Zhou, Xuhui, et al.
Published: (2024)
by: Zhou, Xuhui, et al.
Published: (2024)
Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
by: Mireshghallah, Niloofar, et al.
Published: (2024)
by: Mireshghallah, Niloofar, et al.
Published: (2024)
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
by: Lin, Bill Yuchen, et al.
Published: (2025)
by: Lin, Bill Yuchen, et al.
Published: (2025)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models
by: Ravichander, Abhilasha, et al.
Published: (2025)
by: Ravichander, Abhilasha, et al.
Published: (2025)
AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text
by: Lu, Ximing, et al.
Published: (2024)
by: Lu, Ximing, et al.
Published: (2024)
How to Train Your Fact Verifier: Knowledge Transfer with Multimodal Open Models
by: Lee, Jaeyoung, et al.
Published: (2024)
by: Lee, Jaeyoung, et al.
Published: (2024)
ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data
by: Chen, Tong, et al.
Published: (2025)
by: Chen, Tong, et al.
Published: (2025)
SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs
by: Gu, Yuling, et al.
Published: (2024)
by: Gu, Yuling, et al.
Published: (2024)
GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses
by: Mun, Jimin, et al.
Published: (2026)
by: Mun, Jimin, et al.
Published: (2026)
Social World Models
by: Zhou, Xuhui, et al.
Published: (2025)
by: Zhou, Xuhui, et al.
Published: (2025)
What Makes it Ok to Set a Fire? Iterative Self-distillation of Contexts and Rationales for Disambiguating Defeasible Social and Moral Situations
by: Rao, Kavel, et al.
Published: (2023)
by: Rao, Kavel, et al.
Published: (2023)
RESTOR: Knowledge Recovery in Machine Unlearning
by: Rezaei, Keivan, et al.
Published: (2024)
by: Rezaei, Keivan, et al.
Published: (2024)
Spectrum Tuning: Post-Training for Distributional Coverage and In-Context Steerability
by: Sorensen, Taylor, et al.
Published: (2025)
by: Sorensen, Taylor, et al.
Published: (2025)
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
by: Han, Seungju, et al.
Published: (2024)
by: Han, Seungju, et al.
Published: (2024)
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
by: Kassem, Aly M., et al.
Published: (2024)
by: Kassem, Aly M., et al.
Published: (2024)
PPMI: Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases
by: Bae, Yubeen, et al.
Published: (2025)
by: Bae, Yubeen, et al.
Published: (2025)
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
by: Zheng, Mingqian, et al.
Published: (2025)
by: Zheng, Mingqian, et al.
Published: (2025)
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
by: Sorensen, Taylor, et al.
Published: (2023)
by: Sorensen, Taylor, et al.
Published: (2023)
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
by: Li, Huihan, et al.
Published: (2023)
by: Li, Huihan, et al.
Published: (2023)
A Roadmap to Pluralistic Alignment
by: Sorensen, Taylor, et al.
Published: (2024)
by: Sorensen, Taylor, et al.
Published: (2024)
Position: Privacy Is Not Just Memorization!
by: Mireshghallah, Niloofar, et al.
Published: (2025)
by: Mireshghallah, Niloofar, et al.
Published: (2025)
Synthetic Data Can Mislead Evaluations: Membership Inference as Machine Text Detection
by: Naseh, Ali, et al.
Published: (2025)
by: Naseh, Ali, et al.
Published: (2025)
Ambig-SWE: Interactive Agents to Overcome Underspecificity in Software Engineering
by: Vijayvargiya, Sanidhya, et al.
Published: (2025)
by: Vijayvargiya, Sanidhya, et al.
Published: (2025)
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
by: Xin, Rui, et al.
Published: (2025)
by: Xin, Rui, et al.
Published: (2025)
The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage
by: Hallinan, Skyler, et al.
Published: (2025)
by: Hallinan, Skyler, et al.
Published: (2025)
SafetyAnalyst: Interpretable, Transparent, and Steerable Safety Moderation for AI Behavior
by: Li, Jing-Jing, et al.
Published: (2024)
by: Li, Jing-Jing, et al.
Published: (2024)
A Call for Clarity in Beam Search: How It Works and When It Stops
by: Kasai, Jungo, et al.
Published: (2022)
by: Kasai, Jungo, et al.
Published: (2022)
Particip-AI: A Democratic Surveying Framework for Anticipating Future AI Use Cases, Harms and Benefits
by: Mun, Jimin, et al.
Published: (2024)
by: Mun, Jimin, et al.
Published: (2024)
Similar Items
-
Leftover Lunch: Advantage-based Offline Reinforcement Learning for Language Models
by: Baheti, Ashutosh, et al.
Published: (2023) -
Multi-Attribute Constraint Satisfaction via Language Model Rewriting
by: Baheti, Ashutosh, et al.
Published: (2024) -
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
by: Jiang, Liwei, et al.
Published: (2024) -
Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory
by: Mireshghallah, Niloofar, et al.
Published: (2023) -
From Dogwhistles to Bullhorns: Unveiling Coded Rhetoric with Language Models
by: Mendelsohn, Julia, et al.
Published: (2023)