Saved in:
| Main Authors: | Zalmanovici, Marcel, Raz, Orna, Farchi, Eitan, Freund, Iftach |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.19772 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code
by: Dreyfuss, Itay, et al.
Published: (2025)
by: Dreyfuss, Itay, et al.
Published: (2025)
Using Combinatorial Optimization to Design a High quality LLM Solution
by: Ackerman, Samuel, et al.
Published: (2024)
by: Ackerman, Samuel, et al.
Published: (2024)
Evaluating perturbation robustness of generative systems that use COBOL code inputs
by: Ackerman, Samuel, et al.
Published: (2025)
by: Ackerman, Samuel, et al.
Published: (2025)
How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
by: Fandina, Ora Nova, et al.
Published: (2024)
by: Fandina, Ora Nova, et al.
Published: (2024)
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)
by: Farchi, Eitan, et al.
Published: (2024)
Exploring Straightforward Conversational Red-Teaming
by: Kour, George, et al.
Published: (2024)
by: Kour, George, et al.
Published: (2024)
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Uncovering Code Insights: Leveraging GitHub Artifacts for Deeper Code Understanding
by: Nevo, Ziv, et al.
Published: (2025)
by: Nevo, Ziv, et al.
Published: (2025)
Statistical multi-metric evaluation and visualization of LLM system predictive performance
by: Ackerman, Samuel, et al.
Published: (2025)
by: Ackerman, Samuel, et al.
Published: (2025)
Enhancing Formal Software Specification with Artificial Intelligence
by: Nassar, Antonio Abu, et al.
Published: (2026)
by: Nassar, Antonio Abu, et al.
Published: (2026)
An Agent-Based Framework for the Automatic Validation of Mathematical Optimization Models
by: Zadorojniy, Alexander, et al.
Published: (2025)
by: Zadorojniy, Alexander, et al.
Published: (2025)
Technique to Baseline QE Artefact Generation Aligned to Quality Metrics
by: Farchi, Eitan, et al.
Published: (2025)
by: Farchi, Eitan, et al.
Published: (2025)
A Practical Approach to Combinatorial Test Design
by: Farchi, Eitan, et al.
Published: (2024)
by: Farchi, Eitan, et al.
Published: (2024)
Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Alignment Studio: Aligning Large Language Models to Particular Contextual Regulations
by: Achintalwar, Swapnaja, et al.
Published: (2024)
by: Achintalwar, Swapnaja, et al.
Published: (2024)
LaajMeter: A Framework for LaaJ Evaluation
by: Ackerman, Samuel, et al.
Published: (2025)
by: Ackerman, Samuel, et al.
Published: (2025)
TransformLLM: Adapting Large Language Models via LLM-Transformed Reading Comprehension Text
by: Arbel, Iftach, et al.
Published: (2024)
by: Arbel, Iftach, et al.
Published: (2024)
Unseen Horizons: Unveiling the Real Capability of LLM Code Generation Beyond the Familiar
by: Zhang, Yuanliang, et al.
Published: (2024)
by: Zhang, Yuanliang, et al.
Published: (2024)
Effective LLM-Driven Code Generation with Pythoness
by: Levin, Kyla H., et al.
Published: (2025)
by: Levin, Kyla H., et al.
Published: (2025)
CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models
by: Wagner, Eitan, et al.
Published: (2024)
by: Wagner, Eitan, et al.
Published: (2024)
To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning
by: Lazić, Nevena, et al.
Published: (2026)
by: Lazić, Nevena, et al.
Published: (2026)
Effective Technical Reviews
by: Ballentine, Scott, et al.
Published: (2024)
by: Ballentine, Scott, et al.
Published: (2024)
Quality Engineering for Agile and DevOps on the Cloud and Edge
by: Farchi, Eitan, et al.
Published: (2023)
by: Farchi, Eitan, et al.
Published: (2023)
Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning
by: Fu, Jia, et al.
Published: (2025)
by: Fu, Jia, et al.
Published: (2025)
Instruction Diversity Drives Generalization To Unseen Tasks
by: Zhang, Dylan, et al.
Published: (2024)
by: Zhang, Dylan, et al.
Published: (2024)
Test-Driven Development for Code Generation
by: Mathews, Noble Saji, et al.
Published: (2024)
by: Mathews, Noble Saji, et al.
Published: (2024)
S*: Test Time Scaling for Code Generation
by: Li, Dacheng, et al.
Published: (2025)
by: Li, Dacheng, et al.
Published: (2025)
Generalized Coverage Criteria for Combinatorial Sequence Testing
by: Elyasaf, Achiya, et al.
Published: (2022)
by: Elyasaf, Achiya, et al.
Published: (2022)
Budget Allocation Policies for Real-Time Multi-Agent Path Finding
by: Beck, Raz, et al.
Published: (2025)
by: Beck, Raz, et al.
Published: (2025)
Toward Generalizing Visual Brain Decoding to Unseen Subjects
by: Kong, Xiangtao, et al.
Published: (2024)
by: Kong, Xiangtao, et al.
Published: (2024)
Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
by: Shelat, Shlok, et al.
Published: (2026)
by: Shelat, Shlok, et al.
Published: (2026)
Deep DNA Storage: Scalable and Robust DNA Storage via Coding Theory and Deep Learning
by: Bar-Lev, Daniella, et al.
Published: (2021)
by: Bar-Lev, Daniella, et al.
Published: (2021)
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties
by: Kim, Jinju, et al.
Published: (2026)
by: Kim, Jinju, et al.
Published: (2026)
Backdoors in Conditional Diffusion: Threats to Responsible Synthetic Data Pipelines
by: Lapid, Raz, et al.
Published: (2025)
by: Lapid, Raz, et al.
Published: (2025)
A Deep Inverse-Mapping Model for a Flapping Robotic Wing
by: Sharvit, Hadar, et al.
Published: (2025)
by: Sharvit, Hadar, et al.
Published: (2025)
Revisit Self-Debugging with Self-Generated Tests for Code Generation
by: Chen, Xiancai, et al.
Published: (2025)
by: Chen, Xiancai, et al.
Published: (2025)
Express Your Doubts -- Probabilistic World Modeling Should not be Based on Token logprobs
by: Wagner, Eitan, et al.
Published: (2025)
by: Wagner, Eitan, et al.
Published: (2025)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
by: Cui, Yi
Published: (2025)
by: Cui, Yi
Published: (2025)
Bias Testing and Mitigation in LLM-based Code Generation
by: Huang, Dong, et al.
Published: (2023)
by: Huang, Dong, et al.
Published: (2023)
Similar Items
-
PACIFIC: a framework for generating benchmarks to check Precise Automatically Checked Instruction Following In Code
by: Dreyfuss, Itay, et al.
Published: (2025) -
Using Combinatorial Optimization to Design a High quality LLM Solution
by: Ackerman, Samuel, et al.
Published: (2024) -
Evaluating perturbation robustness of generative systems that use COBOL code inputs
by: Ackerman, Samuel, et al.
Published: (2025) -
How Safe is Your Safety Metric? Automatic Concatenation Tests for Metric Reliability
by: Fandina, Ora Nova, et al.
Published: (2024) -
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)