huang_2024

## Agents (Artificial Intelligence) > [!Abstract]- > Humans are fantastic at messy pattern recognition tasks. However, they often rely on tools - like books, Google Search, or a calculator - to supplement their prior knowledge before arriving at a conclusion. Just like humans, Generative AI models can be trained to use tools to access real-time information or suggest a real-world action. For example, a model can leverage a database retrieval tool to access specific information, like a customer's purchase history, so it can generate tailored shopping recommendations. Alternatively, based on a user's query, a model can make various API calls to send an email response to a colleague or complete a financial transaction on your behalf. To do so, the model must not only have access to a set of external tools, it needs the ability to plan and execute any task in a self- directed fashion. This combination of reasoning, logic, and access to external information that are all connected to a Generative AI model invokes the concept of an agent, or a program that extends beyond the standalone capabilities of a Generative AI model. This whitepaper dives into all these and associated aspects in more detail. > [!Cite]- > Huang, Evan, Emily Xue, Olcan Sercinoglu, Sebastian Riedel, Satinder Baveja, Antonio Gulli, Anant Nawalgaria, Grace Mollison, Joey Haymaker, and Michael Lanning. “Agents (Artifical Intelligence),” 2024. [https://www.kaggle.com/whitepaper-agents](https://www.kaggle.com/whitepaper-agents). > > [link](https://www.kaggle.com/whitepaper-agents) [online](http://zotero.org/users/local/kycSZ2wR/items/HPPNFHL8) [local](zotero://select/library/items/HPPNFHL8) [pdf](file://C:\Users\erikt\Zotero\storage\FRXHJZMM\Huang%20et%20al.%20-%202024%20-%20Reviewers%20and%20Contributors.pdf) ## Notes %% begin notes %% %% end notes %% %% begin annotations %% ### Imported: 2024-11-16 8:21 am In its most fundamental form, a Generative AI agent can be defined as an application that attempts to achieve a goal by observing the world and acting upon it using the tools that it has at its disposal. In the scope of an agent, a model refers to the language model (LM) that will be utilized as the centralized decision maker for agent processes. The model used by an agent can be one or multiple LM’s of any size (small / large) that are capable of following instruction based reasoning and logic frameworks, like ReAct, Chain-of-Thought, or Tree-of-Thoughts. It’s important to note that the model is typically not trained with the specific configuration settings (i.e. tool choices, orchestration/ reasoning setup) of the agent. However, it’s possible to further refine the model for the agent’s tasks by providing it with examples that showcase the agent’s capabilities, including instances of the agent using specific tools or reasoning steps in various contexts. Foundational models, despite their impressive text and image generation, remain constrained by their inability to interact with the outside world. Tools bridge this gap, empowering agents to interact with external data and services while unlocking a wider range of actions beyond that of the underlying model alone. Tools can take a variety of forms and have varying depths of complexity, but typically align with common web API methods like GET, POST, PATCH, and DELETE. For example, a tool could update customer information in a database or fetch weather data to influence a travel recommendation that the agent is providing to the user. The orchestration layer describes a cyclical process that governs how the agent takes in information, performs some internal reasoning, and uses that reasoning to inform its next action or decision. In general, this loop will continue until an agent has reached its goal or a stopping point. a ‘turn’ is defined as an interaction between the interacting system and the agent. (i.e. 1 incoming event/ query and 1 agent response) As of the date of this publication, there are three primary tool types that Google models are able to interact with: Extensions, Functions, and Data Stores. The easiest way to understand Extensions is to think of them as bridging the gap between an API and an agent in a standardized way, allowing agents to seamlessly execute APIs regardless of their underlying implementation. If you’d like to see Extensions in action, you can try them out on the Gemini application by going to Settings > Extensions and then enabling any you would like to test. For example, you could enable the Google Flights extension then ask Gemini “Show me flights from Austin to Zurich leaving next Friday.” A model can take a set of known functions and decide when to use each Function and what arguments the Function needs based on its specification. Functions are executed on the client-side, while Extensions are executed on the agent-side. With functions, the logic and execution of calling the actual API endpoint is offloaded away from the agent and back to the client-side application A model can be used to invoke functions in order to handle complex, client-side execution flows for the end user, where the agent Developer might not want the language model to manage the API execution (as is the case with Extensions). Google’s Vertex AI platform simplifies this process by offering a fully managed environment with all the fundamental elements covered earlier. Using a natural language interface, developers can rapidly define crucial elements of their agents - goals, task instructions, tools, sub-agents for task delegation, and examples - to easily construct the desired system behavior. In addition, the platform comes with a set of development tools that allow for testing, evaluation, measuring agent performance, debugging, and improving the overall quality of developed agents. Agents extend the capabilities of language models by leveraging tools to access realtime information, suggest real-world actions, and plan and execute complex tasks autonomously. agents can leverage one or more language models to decide when and how to transition through states and use external tools to complete any number of complex tasks that would be difficult or impossible for the model to complete on its own. At the heart of an agent’s operation is the orchestration layer, a cognitive architecture that structures reasoning, planning, decision-making and guides its actions. Various reasoning techniques such as ReAct, Chain-of-Thought, and Tree-of-Thoughts, provide a framework for the orchestration layer to take in information, perform internal reasoning, and generate informed decisions or responses. Tools, such as Extensions, Functions, and Data Stores, serve as the keys to the outside world for agents, allowing them to interact with external systems and access knowledge beyond their training data. Extensions provide a bridge between agents and external APIs, enabling the execution of API calls and retrieval of real-time information. functions provide a more nuanced control for the developer through the division of labor, allowing agents to generate Function parameters which can be executed client-side. Data Stores provide agents with access to structured or unstructured data, enabling data-driven applications. %% end annotations %% %% Import Date: 2024-11-16T08:21:45.801-07:00 %%