Basic example
Evaluator args
code evaluator functions must have specific argument names. They can take any subset of the following arguments:run: Run
: The full Run object generated by the application on the given example.example: Example
: The full dataset Example, including the example inputs, outputs (if available), and metdata (if available).inputs: dict
: A dictionary of the inputs corresponding to a single example in a dataset.outputs: dict
: A dictionary of the outputs generated by the application on the giveninputs
.reference_outputs/referenceOutputs: dict
: A dictionary of the reference outputs associated with the example, if available.
inputs
, outputs
, and reference_outputs
. run
and example
are useful only if you need some extra trace or example metadata outside of the actual inputs and outputs of the application.
When using JS/TS these should all be passed in as part of a single object argument.
Evaluator output
Code evaluators are expected to return one of the following types: Python and JS/TSdict
: dicts of the form{"score" | "value": ..., "key": ...}
allow you to customize the metric type (“score” for numerical and “value” for categorical) and metric name. This if useful if, for example, you want to log an integer as a categorical metric.
int | float | bool
: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.str
: this is intepreted as a categorical metric. The function name is used as the name of the metric.list[dict]
: return multiple metrics using a single function.
Additional examples
Requireslangsmith>=0.2.0
Related
- Evaluate aggregate experiment results: Define summary evaluators, which compute metrics for an entire experiment.
- Run an evaluation comparing two experiments: Define pairwise evaluators, which compute metrics by comparing two (or more) experiments against each other.