Basic example
Here, we’ll compute the f1-score, which is a combination of precision and recall. This sort of metric can only be computed over all of the examples in our experiment, so our evaluator takes in a list of outputs, and a list of reference_outputs.evaluate
method as follows:

Summary evaluator args
Summary evaluator functions must have specific argument names. They can take any subset of the following arguments:inputs: list[dict]
: A list of the inputs corresponding to a single example in a dataset.outputs: list[dict]
: A list of the dict outputs produced by each experiment on the given inputs.reference_outputs/referenceOutputs: list[dict]
: A list of the reference outputs associated with the example, if available.runs: list[Run]
: A list of the full Run objects generated by the two experiments on the given example. Use this if you need access to intermediate steps or metadata about each run.examples: list[Example]
: All of the dataset Example objects, including the example inputs, outputs (if available), and metdata (if available).
Summary evaluator output
Summary evaluators are expected to return one of the following types: Python and JS/TSdict
: dicts of the form{"score": ..., "name": ...}
allow you to pass a numeric or boolean score and metric name.
int | float | bool
: this is interepreted as an continuous metric that can be averaged, sorted, etc. The function name is used as the name of the metric.