Guardado en:
| Autor principal: | Smirnov, Roman |
|---|---|
| Formato: | Preprint |
| Publicado: |
2024
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2412.06846 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy
por: Malashin, Roman, et al.
Publicado: (2025)
por: Malashin, Roman, et al.
Publicado: (2025)
Understanding and Preserving Safety in Fine-Tuned LLMs
por: Zhang, Jiawen, et al.
Publicado: (2026)
por: Zhang, Jiawen, et al.
Publicado: (2026)
Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
por: Sturman, Olivia, et al.
Publicado: (2024)
por: Sturman, Olivia, et al.
Publicado: (2024)
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
por: Smirnov, Alexander
Publicado: (2026)
por: Smirnov, Alexander
Publicado: (2026)
Diffusion Models without Classifier-free Guidance
por: Tang, Zhicong, et al.
Publicado: (2025)
por: Tang, Zhicong, et al.
Publicado: (2025)
Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs
por: Cho, Dongkyu Derek, et al.
Publicado: (2025)
por: Cho, Dongkyu Derek, et al.
Publicado: (2025)
LongSafety: Enhance Safety for Long-Context LLMs
por: Huang, Mianqiu, et al.
Publicado: (2024)
por: Huang, Mianqiu, et al.
Publicado: (2024)
EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs
por: Tang, Hanlin, et al.
Publicado: (2024)
por: Tang, Hanlin, et al.
Publicado: (2024)
Beyond Retention: Orchestrating Structural Safety and Plasticity in Continual Learning for LLMs
por: Meng, Fei
Publicado: (2026)
por: Meng, Fei
Publicado: (2026)
Classifying Overlapping Gaussian Mixtures in High Dimensions: From Optimal Classifiers to Neural Nets
por: Cohen, Khen, et al.
Publicado: (2024)
por: Cohen, Khen, et al.
Publicado: (2024)
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
por: Pan, Licheng, et al.
Publicado: (2025)
por: Pan, Licheng, et al.
Publicado: (2025)
Safety at One Shot: Patching Fine-Tuned LLMs with A Single Instance
por: Zhang, Jiawen, et al.
Publicado: (2026)
por: Zhang, Jiawen, et al.
Publicado: (2026)
One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
por: Dunefsky, Jacob, et al.
Publicado: (2025)
por: Dunefsky, Jacob, et al.
Publicado: (2025)
Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
por: Zhang, Jiawei, et al.
Publicado: (2025)
por: Zhang, Jiawei, et al.
Publicado: (2025)
SteeringSafety: A Systematic Safety Evaluation Framework of Representation Steering in LLMs
por: Siu, Vincent, et al.
Publicado: (2025)
por: Siu, Vincent, et al.
Publicado: (2025)
On Continuity of Robust and Accurate Classifiers
por: Barati, Ramin, et al.
Publicado: (2023)
por: Barati, Ramin, et al.
Publicado: (2023)
Efficient Safety Retrofitting Against Jailbreaking for LLMs
por: Garcia-Gasulla, Dario, et al.
Publicado: (2025)
por: Garcia-Gasulla, Dario, et al.
Publicado: (2025)
Confidence Calibration of Classifiers with Many Classes
por: LeCoz, Adrien, et al.
Publicado: (2024)
por: LeCoz, Adrien, et al.
Publicado: (2024)
Interpretable and Fair Mechanisms for Abstaining Classifiers
por: Lenders, Daphne, et al.
Publicado: (2025)
por: Lenders, Daphne, et al.
Publicado: (2025)
Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification
por: Pecher, Branislav, et al.
Publicado: (2026)
por: Pecher, Branislav, et al.
Publicado: (2026)
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
por: Yang, Xikang, et al.
Publicado: (2025)
por: Yang, Xikang, et al.
Publicado: (2025)
Efficiency vs. Alignment: Investigating Safety and Fairness Risks in Parameter-Efficient Fine-Tuning of LLMs
por: Taraghi, Mina, et al.
Publicado: (2025)
por: Taraghi, Mina, et al.
Publicado: (2025)
Safe Pruning LoRA: Robust Distance-Guided Pruning for Safety Alignment in Adaptation of LLMs
por: Ao, Shuang, et al.
Publicado: (2025)
por: Ao, Shuang, et al.
Publicado: (2025)
A Practical Approach to using Supervised Machine Learning Models to Classify Aviation Safety Occurrences
por: Siow, Bryan Y.
Publicado: (2025)
por: Siow, Bryan Y.
Publicado: (2025)
Fixed Random Classifier Rearrangement for Continual Learning
por: Huang, Shengyang, et al.
Publicado: (2024)
por: Huang, Shengyang, et al.
Publicado: (2024)
Generating Universal Adversarial Perturbations for Quantum Classifiers
por: Anil, Gautham, et al.
Publicado: (2024)
por: Anil, Gautham, et al.
Publicado: (2024)
Deep Classifier Mimicry without Data Access
por: Braun, Steven, et al.
Publicado: (2023)
por: Braun, Steven, et al.
Publicado: (2023)
Fairness of Classifiers in the Presence of Constraints between Features
por: Cooper, Martin C., et al.
Publicado: (2026)
por: Cooper, Martin C., et al.
Publicado: (2026)
Simple and Effective Specialized Representations for Fair Classifiers
por: Sinigaglia, Alberto, et al.
Publicado: (2025)
por: Sinigaglia, Alberto, et al.
Publicado: (2025)
Understanding Prediction Discrepancies in Machine Learning Classifiers
por: Renard, Xavier, et al.
Publicado: (2021)
por: Renard, Xavier, et al.
Publicado: (2021)
Bayesian Inference for Correlated Human Experts and Classifiers
por: Kelly, Markelle, et al.
Publicado: (2025)
por: Kelly, Markelle, et al.
Publicado: (2025)
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
por: Andriushchenko, Maksym, et al.
Publicado: (2024)
por: Andriushchenko, Maksym, et al.
Publicado: (2024)
Evaluating Multimodal LLMs for Inpatient Diagnosis: Real-World Performance, Safety, and Cost Across Ten Frontier Models
por: Bassett, Bruce A., et al.
Publicado: (2026)
por: Bassett, Bruce A., et al.
Publicado: (2026)
Improving Continual Learning Performance and Efficiency with Auxiliary Classifiers
por: Szatkowski, Filip, et al.
Publicado: (2024)
por: Szatkowski, Filip, et al.
Publicado: (2024)
Demystifying the Optimal Fair Classifier in Multi-Class Classification
por: Zhang, Li, et al.
Publicado: (2026)
por: Zhang, Li, et al.
Publicado: (2026)
On the Usefulness of the Fit-on-the-Test View on Evaluating Calibration of Classifiers
por: Kängsepp, Markus, et al.
Publicado: (2022)
por: Kängsepp, Markus, et al.
Publicado: (2022)
Efficient Controllable Diffusion via Optimal Classifier Guidance
por: Oertell, Owen, et al.
Publicado: (2025)
por: Oertell, Owen, et al.
Publicado: (2025)
Be Wary of Your Time Series Preprocessing
por: Ennadir, Sofiane, et al.
Publicado: (2026)
por: Ennadir, Sofiane, et al.
Publicado: (2026)
Towards Unified Approaches in Self-Supervised Event Stream Modeling: Progress and Prospects
por: Zólyomi, Levente, et al.
Publicado: (2025)
por: Zólyomi, Levente, et al.
Publicado: (2025)
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
por: Jan, Essa, et al.
Publicado: (2024)
por: Jan, Essa, et al.
Publicado: (2024)
Ejemplares similares
-
Hypernym Bias: Unraveling Deep Classifier Training Dynamics through the Lens of Class Hierarchy
por: Malashin, Roman, et al.
Publicado: (2025) -
Understanding and Preserving Safety in Fine-Tuned LLMs
por: Zhang, Jiawen, et al.
Publicado: (2026) -
Debiasing Text Safety Classifiers through a Fairness-Aware Ensemble
por: Sturman, Olivia, et al.
Publicado: (2024) -
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction
por: Smirnov, Alexander
Publicado: (2026) -
Diffusion Models without Classifier-free Guidance
por: Tang, Zhicong, et al.
Publicado: (2025)