에이전트 성능 평가

에이전트의 성능을 평가하려면 LangSmith 평가를 사용할 수 있습니다. 먼저 최종 출력이나 실행 궤적과 같은 에이전트의 결과를 판단할 평가자 함수를 정의해야 합니다. 평가 기법에 따라 참조 출력이 필요할 수도 있고 필요하지 않을 수도 있습니다:

type EvaluatorParams = {
    outputs: Record<string, any>;
    referenceOutputs: Record<string, any>;
};

function evaluator({ outputs, referenceOutputs }: EvaluatorParams) {
    // compare agent outputs against reference outputs
    const outputMessages = outputs.messages;
    const referenceMessages = referenceOutputs.messages;
    const score = compareMessages(outputMessages, referenceMessages);
    return { key: "evaluator_score", score: score };
}

시작하려면 AgentEvals 패키지의 사전 구축된 평가자를 사용할 수 있습니다:

npm install agentevals

평가자 생성

에이전트 성능을 평가하는 일반적인 방법은 에이전트의 실행 궤적(도구를 호출하는 순서)을 참조 궤적과 비교하는 것입니다:

import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const outputs = [
    {
        role: "assistant",
        tool_calls: [
        {
            function: {
            name: "get_weather",
            arguments: JSON.stringify({ city: "san francisco" }),
            },
        },
        {
            function: {
            name: "get_directions",
            arguments: JSON.stringify({ destination: "presidio" }),
            },
        },
        ],
    },
];

const referenceOutputs = [
    {
        role: "assistant",
        tool_calls: [
        {
            function: {
            name: "get_weather",
            arguments: JSON.stringify({ city: "san francisco" }),
            },
        },
        ],
    },
];

// Create the evaluator
const evaluator = createTrajectoryMatchEvaluator({
  // Specify how the trajectories will be compared. `superset` will accept output trajectory as valid if it's a superset of the reference one. Other options include: strict, unordered and subset
  trajectoryMatchMode: "superset", 
});

// Run the evaluator
const result = evaluator({
    outputs: outputs,
    referenceOutputs: referenceOutputs,
});

궤적을 비교할 방식을 지정합니다. superset은 출력 궤적이 참조 궤적의 상위 집합일 경우 유효한 것으로 인정합니다. 다른 옵션으로는 strict, unordered, subset이 있습니다.

다음 단계로 궤적 매치 평가자 사용자 정의 방법에 대해 자세히 알아보세요.

LLM-as-a-judge

LLM을 사용하여 궤적을 참조 출력과 비교하고 점수를 출력하는 LLM-as-a-judge 평가자를 사용할 수 있습니다:

import {
    createTrajectoryLlmAsJudge,
    TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
} from "agentevals/trajectory/llm";

const evaluator = createTrajectoryLlmAsJudge({
    prompt: TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE,
    model: "openai:o3-mini",
});

평가자 실행

평가자를 실행하려면 먼저 LangSmith 데이터셋을 생성해야 합니다. 사전 구축된 AgentEvals 평가자를 사용하려면 다음 스키마를 가진 데이터셋이 필요합니다:

input: {"messages": [...]} 에이전트를 호출할 입력 메시지입니다.
output: {"messages": [...]} 에이전트 출력에서 예상되는 메시지 기록입니다. 궤적 평가의 경우 어시스턴트 메시지만 유지하도록 선택할 수 있습니다.

import { Client } from "langsmith";
import { createAgent } from "langchain";
import { createTrajectoryMatchEvaluator } from "agentevals/trajectory/match";

const client = new Client();
const agent = createAgent({...});
const evaluator = createTrajectoryMatchEvaluator({...});

const experimentResults = await client.evaluate(
    (inputs) => agent.invoke(inputs),
    // replace with your dataset name
    { data: "<Name of your dataset>" },
    { evaluators: [evaluator] }
);

Edit the source of this page on GitHub.

Connect these docs programmatically to Claude, VSCode, and more via MCP for real-time answers.

LangChain v1.0

Get started

Core components

Advanced usage

Use in production

평가자 생성

LLM-as-a-judge

평가자 실행

LangChain v1.0

Get started

Core components

Advanced usage

Use in production

​평가자 생성

​LLM-as-a-judge

​평가자 실행

평가자 생성

LLM-as-a-judge

평가자 실행