A new paper titled "STALL+: Boosting LLM-based Repository-level Code Completion with Static Analysis" shows the benefits of integrating static analysis with LLMs. (https://arxiv.org/abs/2406.10018)
Authors evaluate 4 key questions:
- How does each static analysis integration strategy perform in LLM-based repository-level code completion? > They found that integrating static analysis in the prompting phase (especially with file-level dependencies) can achieve the substantially larger improvements than other phases.
- How do different combinations of integration strategies affect LLM-based repository-level code completion? > Languages that are easier to analyze like Java show more improvements compared to dynamic languages like Python.
- How do static analysis integration strategies perform when compared or combined with RAG in LLM-based repository-level code completion? > Static analysis and RAG are complementary and boost the overall accuracy.
- What are the online costs of different integration strategies in LLM-based repository-level code completion? > Combining prompting-phase static analysis and RAG is the best option for cost-effectiveness.
In my @owasp App Sec keynote last year, I had described how one can do static analysis augmented generation (SaAG) to boost the accuracy of LLM based patches for vulnerability remediation. (you can see the talk here - https://www.youtube.com/watch?v=Cw4-ZnUNVLs)
The new Claude Sonnet 3.5 model from Anthropic AI has been getting good reviews on since last night. It is quite good at coding related tasks. We tried it on the Static Analysis Eval benchmark (patched-codes/static-analysis-eval) which measures the ability of a LLM to fix vulnerabilities. The model scores 59.21% which is good but not better than other frontier models (like GPT-4, Gemini-1.5 and LLama-3).
WorkerSafetyQAEval: A new benchmark to evaluate worker safety domain question and answering
Happy to share a new benchmark on question and answers for worker safety domain. The benchmark and leaderboard is available at codelion/worker-safety-qa-eval
We evaluate popular generic chatbots like ChatGPT and HuggingChat on WorkerSafetyQAEval and compare it with a domain specific RAG bot called Securade.ai Safety Copilot - codelion/safety-copilot It highlights the importance of having domain specific knowledge for critical domains like worker safety that require high accuracy. Securade.ai Safety Copilot achieves ~97% on the benchmark setting a new SOTA.
After the announcements yesterday, I got a chance to try the new gemini-1.5-flash model from @goog1e, it is almost as good as gpt-4o on the StaticAnalaysisEval (patched-codes/static-analysis-eval) It is also a bit faster than gpt-4o and much cheaper.
I did run into a recitation flag with an example in the dataset where the api refused to fix the vulnerability and flagged the input as using copyrighted content. This is something you cannot unset even with the safety filters and seems to be an existing bug https://issuetracker.google.com/issues/331677495
But overall you get gpt-4o level performance for 7% the price, we are thinking of making it default in patchwork - https://github.com/patched-codes/patchwork You can use the google_api_key and model options to choose gemini-1.5-flash-latest to run it with patchwork.
You can use it to build patchflows - workflows that use LLMs for software development tasks like bug fixing, pull request review, library migration and documentation.
We just released a new MoE model (meraGPT/mera-mix-4x7B) that is half as large as Mixtral-8x7B while still been competitive with it across different benchmarks. mera-mix-4x7B achieves 76.37 on the open LLM eval.