Does Daily Software Engineering Work Need Reasoning Models?
Let the Eval Begin!
When I started working on benchmarking, my main goal is to see how LLMs can help regular software engineers. My focus is not the (equally important) things which help them get their jobs (LeetCode, Codeforces, etc.), but the actual work they do on their jobs. Since web development is a common denominator, also underrepresented in the benchmark world, WebApp1K was launched.
When OpenAI launched o1 with new SOTAs on all the intelligence-oriented benchmarks, my immediate question is can the new models, regardless how smart they are, help an average developer make ends meet at the end of the day? Well, let's run the benchmark and find out! (๐ต๐ต๐ต๐ต๐ฉ๐ฉ๐ฉ๐ฉ)
The result is a mixture of excitement and disappointment. I'll use two examples to illustrate both. More details can be found in the paper.
Backend Validation or Frontend Validation?
Let's start with the excitement. o1-preview lifts SOTA by 7 points (leaderboard), this is super impressive. Out of curiosity, I looked at problems solved by o1 only. If a problem failed all frontier models, there must be a reason for that. Now let's look at this test case.
test('shows error when submitting a ticket with missing fields', async () => {
fetchMock.post('/api/tickets', { status: 400 });
...
fireEvent.click(screen.getByText('Submit'));
...
expect(fetchMock.calls('/api/tickets').length).toBe(1);
expect(screen.getByText('Title is required')).toBeInTheDocument(); }, 10000);
It is just a simple validation, right? Why is it so hard that it fails all models (until o1)? Take a look at the diagrams below.
It turns out there are two ways to run validation: frontend and backend. Both make sense, but only one will pass the test, which is the backend validation. Why? Because the first expectation states clearly that the API must be called exactly once!
Then why did all frontier models choose frontend validation? To begin with, this is indeed the best practice. If your goal is just to check if a required field is filled, you can get this done on the client side, no need to visit API.
But this is not the key reason. In my evaluation prompt, passing tests is the primary and only goal. There is no mention of best practice or anything related (elegance, readability, etc.) Yet all models (but o1) still chose the wrong path.
I suspect the culprit is the string Title is required
. There must be tons of frontend validation code in the pretraining dataset, and the majority of them contain strings like "ABC is required". My speculation is that it is so easy for a model to activate a piece of knowledge like this, that it overshadows the instructions (pass the tests).
Then how did o1 avoid this trap? Yes the answer is reasoning and reflection. I use ChatGPT to reenact the reasoning process and share in the paper. It is quite a thrill to read through thinking process of o1, and observe the model correct its (not always work) own course to find the right way.
One Module or Two Modules?
Now the disappointment. If the SOTA of your benchmark is higher than 90%, it's time to build a new one, and make it harder. I had a simple idea: combine two problems into one problem, and WebApp1K-Duo was born. Now the model has to write longer code to pass twice as many tests ๐
I almost fell on the floor when the result (both o1-preview and o1-mini) came out: 0 succcess!
Okay, something must be wrong with the test files. I took another look and found below.
import TaskA from โ./TaskA_Bโ;
import TaskB from โ./TaskA_Bโ;
test("Success at task A", async () =>
...
render(<MemoryRouter><TaskA /></MemoryRouter>);
...
, 10000);
...
test("Failure at task B", async () =>
...
render(<MemoryRouter><TaskB /></MemoryRouter>);
...
, 10000);
When merging two test files, I forgot modify their module names. Basically, I intended for models to write one module to pass all tests but give them two module names. What a silly mistake!! ๐ณ
But after I unified the module names, I again discovered something unexpected: other models succeeded occasionally.๐ตโ๐ซ If the tests are wrong, why are they passed? A little research cleared things out: it turns out Javascript default export is name-agnostic (official documentation). So the above test is syntactically right despite super confusing.
When I browsed through their implementation code, I witnessed struggles of all models. They really tried different ways to make things work. The solution in the red box is the sole right answer, also the least intuitive one. Just write one module, although the hint is to write two. In fact, I am simply amazed that models can actually find this way under the circumstances. Claude 3.5 even has a 35% success rate!
Now the million dollar question: why did o1 perform so poorly? What happened to their reasoning and reflection? I don't have a clear answer here, not as clear as the first example. The mistake happened in the planning step (i.e. what do I need to do to pass the tests) which covers the largest scope and impact all subsequent reasonings.
But I don't think this problem is incurable. The answer probably lies in Claude 3.5. What does a non-reasoning model do right to achieve 35% success rate? I think the formulate should work for reasoning models too.
Do We Need Reasoning Models for Practical Coding Tasks?
Now to the question raised in the title. My answer is a firm yes. If you look at the test cases in the examples (more in the paper), they are anything but elegant, coherent, or even reasonable. Why insist on a clumsy validation when the alternative is lightweight for the system and fast for users? The test refactor job is subpar to say the least.
But let me say that I'm glad that my study uncover such cases, because they reveal the real life of software engineers (actually, just a little bit). These are the realities they need to overcome or live with to get their jobs done. Peculiar product spec, being owner of legacy codebase, maintaining logic of strong smell, the list goes on.
This should be an easy sell because I am quite sure we all want LLMs to assist or even relace us on dirty jobs. This paper is a demonstration of reasoning models handle such jobs with transparencies non-reasoning models can't deliver, and with great potentials to surpass previous models in due course.
More reasoning models, including open source ones, will emerge in no time. I can't wait to study and evaluate them.