:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autor principal:	Smirnov, Roman
Formato:	Preprint
Publicado:	2024
Materias:	Machine Learning Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2412.06846
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy
por: Malashin, Roman, et al.
Publicado: (2025)

Understanding and Preserving Safety in Fine-Tuned LLMs
por: Zhang, Jiawen, et al.
Publicado: (2026)

Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
por: Sturman, Olivia, et al.
Publicado: (2024)

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
por: Smirnov, Alexander
Publicado: (2026)

Diffusion Models without Classifier-free Guidance
por: Tang, Zhicong, et al.
Publicado: (2025)

Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
por: Cho, Dongkyu Derek, et al.
Publicado: (2025)

LongSafety: Enhance Safety for Long-Context LLMs
por: Huang, Mianqiu, et al.
Publicado: (2024)

EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
por: Tang, Hanlin, et al.
Publicado: (2024)

Beyond Retention: Orchestrating Structural Safety and Plasticity in Continual Learning for LLMs
por: Meng, Fei
Publicado: (2026)

Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets
por: Cohen, Khen, et al.
Publicado: (2024)

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
por: Pan, Licheng, et al.
Publicado: (2025)

Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
por: Zhang, Jiawen, et al.
Publicado: (2026)

One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
por: Dunefsky, Jacob, et al.
Publicado: (2025)

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
por: Zhang, Jiawei, et al.
Publicado: (2025)

SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
por: Siu, Vincent, et al.
Publicado: (2025)

On Continuity of Robust and Accurate Classifiers
por: Barati, Ramin, et al.
Publicado: (2023)

Efficient Safety Retrofitting Against Jailbreaking for LLMs
por: Garcia-Gasulla, Dario, et al.
Publicado: (2025)

Confidence Calibration of Classifiers with Many Classes
por: LeCoz, Adrien, et al.
Publicado: (2024)

Interpretable and Fair Mechanisms for Abstaining Classifiers
por: Lenders, Daphne, et al.
Publicado: (2025)

Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification
por: Pecher, Branislav, et al.
Publicado: (2026)

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
por: Yang, Xikang, et al.
Publicado: (2025)

Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs
por: Taraghi, Mina, et al.
Publicado: (2025)

Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
por: Ao, Shuang, et al.
Publicado: (2025)

A Practical Approach to using Supervised Machine Learning Models to Classify Aviation Safety Occurrences
por: Siow, Bryan Y.
Publicado: (2025)

Fixed Random Classifier Rearrangement for Continual Learning
por: Huang, Shengyang, et al.
Publicado: (2024)

Generating Universal Adversarial Perturbations for Quantum Classifiers
por: Anil, Gautham, et al.
Publicado: (2024)

Deep Classifier Mimicry without Data Access
por: Braun, Steven, et al.
Publicado: (2023)

Fairness of Classifiers in the Presence of Constraints between Features
por: Cooper, Martin C., et al.
Publicado: (2026)

Simple and Effective Specialized Representations for Fair Classifiers
por: Sinigaglia, Alberto, et al.
Publicado: (2025)

Understanding Prediction Discrepancies in Machine Learning Classifiers
por: Renard, Xavier, et al.
Publicado: (2021)

Bayesian Inference for Correlated Human Experts and Classifiers
por: Kelly, Markelle, et al.
Publicado: (2025)

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
por: Andriushchenko, Maksym, et al.
Publicado: (2024)

Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
por: Bassett, Bruce A., et al.
Publicado: (2026)

Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers
por: Szatkowski, Filip, et al.
Publicado: (2024)

Demystifying the Optimal Fair Classifier in Multi-Class Classification
por: Zhang, Li, et al.
Publicado: (2026)

On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers
por: Kängsepp, Markus, et al.
Publicado: (2022)

Efficient Controllable Diffusion via Optimal Classifier Guidance
por: Oertell, Owen, et al.
Publicado: (2025)

Be Wary of Your Time Series Preprocessing
por: Ennadir, Sofiane, et al.
Publicado: (2026)

Towards Unified Approaches in Self-Supervised Event Stream Modeling: Progress and Prospects
por: Zólyomi, Levente, et al.
Publicado: (2025)

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
por: Jan, Essa, et al.
Publicado: (2024)