Python Is All You Need? Introducing Dria-Agent-α
Introduction
Currently the main way for large language models (LLMs) to interact with tools, which goes by many names (tool use, function calling, etc.) is through giving the LLM a specification of the tools it can use (including the arguments of the tool) and the LLM to output a JSON schema that contains the tool(s) to use and the argument(s) to be used within those tools. This approach, while straightforward and reliable, limits the expressive capabilities of LLMs, which are able to elicit more complex reasoning and solutions through programming languages like Python. To this end, we define a framework for LLMs to use tools through Python called Pythonic Function Callling, which prompts the LLM to output actions in Python code.
The motivations behind using Python to interact with tools are the following:
- Reasoning in LLMs is primarily driven by procedural knowledge in pretraining, particularly code documents. [1]
- LLMs equipped with the ability to use Python perform better in agentic scenarios compared to those using JSON-based function calling. [2]
- Python is a very popular programming language [3], probably very abundant in the pretraining data of many LLMs, and very close to human natural language with a very pseudocode-like syntax.
Example
Let's start with a simple example. Take the user query: "Can you check if I have tomorrow 10:00-12:00 available and make an appointment for a meeting with my thesis supervisor if so? If you made the appointment, please add it to my reminders." and the available functions:
def check_availability(day: str, start_time: str, end_time: str) -> bool:
"""
Check if a time slot is available on a given day.
Args:
- day: The day to check in YYYY-MM-DD format
- start_time: Start time in HH:MM format
- end_time: End time in HH:MM format
Returns:
- True if slot is available, False otherwise
"""
pass
def make_appointment(day: str, start_time: str, end_time: str) -> dict:
"""
Make an appointment for a given time slot.
Args:
- day: The day to make appointment in YYYY-MM-DD format
- start_time: Start time in HH:MM format
- end_time: End time in HH:MM format
- title: The title of the appointment
Returns:
- A dictionary with the appointment details and if it's made or not.
dict keys:
- day (str): The day the appointment is on, in YYYY-MM-DD format
- start_time (str): Start time in HH:MM format
- end_time (str): End time in HH:MM format
- appointment_made (bool): Whether the appointment is successfully made or not.
"""
pass
def add_to_reminders(reminder_text: str) -> bool:
"""
Add a text to reminders.
Args:
- reminder_text: The text to add to reminders
Returns:
- Whether the reminder was successfully created or not.
"""
pass
With JSON-based function calling, this would take multiple chat turns to process, as it involves two conditionals (checking if the user has the given time slot available and checking if the user has successfully made an appointment with their thesis supervisor) which would require the LLM to receive the function call results and move forward with the next step given the results. In Pythonic function calling, this becomes a trivial task that the LLM can perform in a single chat turn, like so:
from datetime import datetime, timedelta
# Calculate tomorrow's date
tomorrow = (datetime.now() + timedelta(days=1)).strftime("%Y-%m-%d")
# Check availability for 10:00-12:00
available = check_availability(tomorrow, "10:00", "12:00")
# If available, make the appointment
appointment_details = make_appointment(
day=tomorrow,
start_time="10:00",
end_time="12:00",
title="Meeting with thesis supervisor"
)
# Add appointment to reminders if it was successfully made
if appointment_details["appointment_made"]:
reminder_text = f"Meeting with thesis supervisor on {tomorrow} from 10:00-12:00"
add_to_reminders(reminder_text)
Code Execution
Since the model generates Python code for function calling, we utilize exec-python to parse the model output and executes the code along with the functions. This enables straightforward integration of the Pythonic function calling approach of Dria-Agent-α
, handling the execution of the generated code in a safe and controlled manner.
The execution environment provides structured output that tracks function calls, variable states, and any errors that occur during execution. For example:
x = [1, 2]
y = [2, 3]
z = pair_sum(x, y)
k = pair_sum(z, z)
produces:
{
"function_results": {
"pair_sum": ["z","k"]
},
"variables": {
"x": [1,2],
"y": [2,3],
"z": [3,5],
"k": [6,10]
},
"errors": []
}
This structured output is particularly valuable for multi-turn agentic conversations, as it allows the model to maintain awareness of previous computations and their results, enabling more complex reasoning chains and state-dependent decision making in subsequent interactions.
Methodology
Dria-Agent-α was developed using synthetic data generated through Dria. Dria is a network of LLMs operating in a distributed system, providing high-throughput and powerful pipeline tools for data generation across diverse models.
We designed a framework that creates realistic scenarios requiring complex problem-solving skills, challenging the model to break down problems into manageable steps and utilize provided functions effectively, mimicking real-world tool use cases. The synthetic data generation pipeline, which we plan to release before February 2025 after code cleanup, consists of the following steps:
- Manually define categories and subcategories that represent different domains of tool usage (as shown in the repository structure with domain/subdomain pairs)
- Generate synthetic scenarios using a multi-stage pipeline
- Mock function generation
- User query generation
- Validation of mock functions
- Validation by code execution
- Final dataset compilation
Data Anatomy
Our training data consists of several components:
- User queries
- Python functions with docstrings and parameters (along with their JSON equivalents)
- Mock function implementations that produce expected outputs for correct parameters and different outputs for incorrect ones
- Checklists validating required function calls and their expected outputs for each query
Sample entry:
{
"difficulty": "hard",
"function_schema_python": "def check_user_permissions(username: str, folder_path: str) -> dict:\n \"\"\"Checks the permissions of a specific user for a given network folder.\n\n :param username: The username to check permissions for.\n :param folder_path: The network folder path to check.\n :return: Dictionary containing permission details:\n - read (bool): Whether user has read permissions\n - write (bool): Whether user has write permissions\n - execute (bool): Whether user has execute permissions\n - owner (str): Owner of the folder\n :raises ValueError: If username or folder path is invalid.\"\"\"\n pass\ndef modify_folder_permissions(username: str, folder_path: str, permissions: dict) -> bool:\n \"\"\"Modifies the permissions for a specific user on a network folder.\n\n :param username: The username to modify permissions for.\n :param folder_path: The network folder path to modify.\n :param permissions: Dictionary containing permission settings:\n - read (bool): Whether to grant read permissions\n - write (bool): Whether to grant write permissions\n - execute (bool): Whether to grant execute permissions\n :return: True if permissions were successfully modified, False otherwise.\n :raises ValueError: If invalid parameters are provided.\"\"\"\n pass\ndef verify_folder_access(username: str, folder_path: str) -> bool:\n \"\"\"Verifies if a user can actually access a specific folder after permission changes.\n\n :param username: The username to verify access for.\n :param folder_path: The network folder path to verify.\n :return: True if user can access the folder, False otherwise.\n :raises ValueError: If username or folder path is invalid.\"\"\"\n pass\n",
"function_schema_json": [
{
"name": "check_user_permissions",
"description": "Checks the permissions of a specific user for a given network folder.",
"parameters": {
"type": "object",
"properties": {
"username": {
"type": "string",
"description": "The username to check permissions for."
},
"folder_path": {
"type": "string",
"description": "The network folder path to check."
}
},
"required": [
"username",
"folder_path"
],
"additionalProperties": false
}
},
{
"name": "modify_folder_permissions",
"description": "Modifies the permissions for a specific user on a network folder.",
"parameters": {
"type": "object",
"properties": {
"username": {
"type": "string",
"description": "The username to modify permissions for."
},
"folder_path": {
"type": "string",
"description": "The network folder path to modify."
},
"permissions": {
"type": "object",
"description": "Dictionary containing permission settings:"
}
},
"required": [
"username",
"folder_path",
"permissions"
],
"additionalProperties": false
}
},
{
"name": "verify_folder_access",
"description": "Verifies if a user can actually access a specific folder after permission changes.",
"parameters": {
"type": "object",
"properties": {
"username": {
"type": "string",
"description": "The username to verify access for."
},
"folder_path": {
"type": "string",
"description": "The network folder path to verify."
}
},
"required": [
"username",
"folder_path"
],
"additionalProperties": false
}
}
],
"mock_functions": "def check_user_permissions(username: str, folder_path: str) -> dict:\n \"\"\"\n Checks the permissions of a specific user for a given network folder.\n \n :param username: The username to check permissions for.\n :param folder_path: The network folder path to check.\n :return: Dictionary containing permission details:\n - read (bool): Whether user has read permissions\n - write (bool): Whether user has write permissions\n - execute (bool): Whether user has execute permissions\n - owner (str): Owner of the folder\n :raises ValueError: If username or folder path is invalid.\n \"\"\"\n if not username or not folder_path:\n raise ValueError(\"Username and folder path must be provided\")\n \n if username.lower() == \"alex\" and folder_path == \"\\\\\\\\server\\\\shared\\\\documents\":\n return {\n \"read\": False,\n \"write\": False,\n \"execute\": False,\n \"owner\": \"Administrator\"\n }\n return {\n \"read\": True,\n \"write\": True,\n \"execute\": True,\n \"owner\": \"Administrator\"\n }\ndef modify_folder_permissions(username: str, folder_path: str, permissions: dict) -> bool:\n \"\"\"\n Modifies the permissions for a specific user on a network folder.\n \n :param username: The username to modify permissions for.\n :param folder_path: The network folder path to modify.\n :param permissions: Dictionary containing permission settings:\n - read (bool): Whether to grant read permissions\n - write (bool): Whether to grant write permissions\n - execute (bool): Whether to grant execute permissions\n :return: True if permissions were successfully modified, False otherwise.\n :raises ValueError: If invalid parameters are provided.\n \"\"\"\n if not username or not folder_path:\n raise ValueError(\"Username and folder path must be provided\")\n \n required_keys = [\"read\", \"write\", \"execute\"]\n if not all(key in permissions for key in required_keys):\n raise ValueError(\"Permissions dictionary must contain read, write, and execute keys\")\n\n if username.lower() == \"alex\" and folder_path == \"\\\\\\\\server\\\\shared\\\\documents\":\n return True\n return False\ndef verify_folder_access(username: str, folder_path: str) -> bool:\n \"\"\"\n Verifies if a user can actually access a specific folder after permission changes.\n \n :param username: The username to verify access for.\n :param folder_path: The network folder path to verify.\n :return: True if user can access the folder, False otherwise.\n :raises ValueError: If username or folder path is invalid.\n \"\"\"\n if not username or not folder_path:\n raise ValueError(\"Username and folder path must be provided\")\n \n if username.lower() == \"alex\" and folder_path == \"\\\\\\\\server\\\\shared\\\\documents\":\n return True\n return False",
"user_query": "Hi, it's Linda. Could you modify the permissions for Alex on \\\\server\\shared\\documents to allow read and write access?",
"checklist": {
"functions": [
"check_user_permissions",
"modify_folder_permissions",
"verify_folder_access"
],
"values": [
{
"read": false,
"write": false,
"execute": false,
"owner": "Administrator"
},
true,
true
]
}
}
We generated our training data synthetically with two primary objectives:
- Robust performance on out-of-distribution (OOD) queries
- Ability to solve complex, multi-tool problems in a single shot
A major challenge was creating a comprehensive curriculum for better generalization. We focused on real-world use cases, particularly emphasizing developer-centric scenarios since they constitute a significant portion of agentic requests.
Traditional approaches used curriculum elements as seeds for generating user queries. However, this method led to several issues in our context:
- Incorrect function implementations
- Infeasible mock logic
- Inaccurate checklist items
- User queries lacking sufficient information for proper function parameter selection
To address these challenges, we developed a scenario-first approach. We generate detailed scenarios based on curriculum items that incorporate user background information, context-specific details, and relevant supplementary information. This comprehensive approach enables us to create both mock functions and user queries with sufficient context, effectively avoiding infeasible logic in mock functions and information gaps in user queries.
Data Validations
The inherent challenges of synthetic data necessitate validation mechanisms. To address this, we implemented two key validation approaches: validators with ICL and execution feedback-based validation inspired by RLEF [4].
While OpenAI o1 excelled at validating mock functions and scenario feasibility, it wasn't economically viable to validate the entire training dataset with it. Instead, we:
- Generated a validation dataset using OpenAI o1
- Experimented with different models and validation methods
- Compared their performance against OpenAI o1's validations
Beam Search with Process Reward Models
Our first approach involved scaling Test-Time Compute (TTC) using smaller models:
- Llama3.1 8B achieved 38% agreement with O1's validation set
- Using TTC with a beam size of 16 and Qwen2.5-Coder-32B-Instruct as the process reward model improved performance to ~65%
- Beam sizes up to 64 showed improvements but came with significant computational overhead
- Our experiments showed a model-task specific PRM was needed for significant improvements.
In-Context Learning
Second approach followed was to use in-context learning for bootstrapping reasoning capabilities in more cost-effective models. We created a dSPY-optimized few-shot prompt using OpenAI o1 outputs and established a model pool consisting of Qwen2.5-Coder-32B-Instruct and Claude Sonnet. The system generated validation outputs with detailed rationales, routing simpler examples to smaller models while directing complex cases to larger models. This hierarchical approach achieved approximately 80% agreement with OpenAI o1's validations while maintaining computational efficiency.
Code Execution
The final validation step involved implementing an execution feedback loop. Using Qwen2.5-Coder-32B-Instruct, we executed the generated solutions, collecting both stack traces and checklist scores as feedback. This process was iterated up to three times per problem. We retained only the entries that achieved a checklist output score above 0.75, ensuring high-quality solutions in our final dataset. This execution-based validation helped eliminate solutions that were syntactically correct but failed to meet the functional requirements of the tasks.
Models
We've trained two models so far, on Qwen2.5-Coder-3B-Instruct and Qwen2.5-Coder-7B-Instruct. Our models are called Dria-Agent-α, which are the first generation of agentic models to be released by Dria. The models, Dria-Agent-α-3B and Dria-Agent-α-7B, are available on Hugging Face.
Future Work
This is the first iteration of our framework, and we're already working on the next iteration, which will involve methods from RLEF[4] and rStar-Math[5]. These models are released to showcase the capabilities of Pythonic function calling, and to lead the way for the future generation of Dria-Agent models.
References
- [1] Ruis, Laura, et al. Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models. arXiv:2411.12580, arXiv, 19 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2411.12580.
- [2] Nguyen, Dang, et al. DynaSaur: Large Language Agents Beyond Predefined Actions. arXiv:2411.01747, arXiv, 4 Nov. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2411.01747.
- [3] Staff, GitHub. 'Octoverse: AI Leads Python to Top Language as the Number of Global Developers Surges'. The GitHub Blog, 29 Oct. 2024, https://github.blog/news-insights/octoverse/octoverse-2024/.
- [4] Gehring, Jonas, et al. RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning. arXiv:2410.02089, arXiv, 2 Oct. 2024. arXiv.org, https://doi.org/10.48550/arXiv.2410.02089.
- [5] Guan, Xinyu, et al. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519, arXiv, 8 Jan. 2025. arXiv.org, https://doi.org/10.48550/arXiv.2501.04519.