A Case Study of Web App Coding with OpenAI Reasoning Models
Abstract
This paper presents a case study of coding tasks by the latest reasoning models of OpenAI, i.e. o1-preview and o1-mini, in comparison with other frontier models. The o1 models deliver SOTA results for WebApp1K, a single-task benchmark. To this end, we introduce WebApp1K-Duo, a harder benchmark doubling number of tasks and test cases. The new benchmark causes the o1 model performances to decline significantly, falling behind Claude 3.5. Moreover, they consistently fail when confronted with atypical yet correct test cases, a trap non-reasoning models occasionally avoid. We hypothesize that the performance variability is due to instruction comprehension. Specifically, the reasoning mechanism boosts performance when all expectations are captured, meanwhile exacerbates errors when key expectations are missed, potentially impacted by input lengths. As such, we argue that the coding success of reasoning models hinges on the top-notch base model and SFT to ensure meticulous adherence to instructions.
Community
The blogpost to explain the paper
https://huggingface.co/blog/onekq/daily-software-engineering-work-reasoning-models
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Insights from Benchmarking Frontier Language Models on Web App Code Generation (2024)
- WebApp1K: A Practical Code-Generation Benchmark for Web App Development (2024)
- Can GPT-O1 Kill All Bugs? An Evaluation of GPT-Family LLMs on QuixBugs (2024)
- CRQBench: A Benchmark of Code Reasoning Questions (2024)
- MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper