Evaluating LLM-Generated Output
Brian Kotos
November 5, 2025
When building an AI application powered by large language models (LLMs), quality control is just as critical as writing automated tests for conventional software, but it looks very different in practice.
Unlike traditional code, where the same input always produces the same output, LLMs are non-deterministic. They can generate slightly different responses each time. Rather than a simple pass/fail like a unit test, LLM evals often rely on statistical metrics to assess output quality. Typically these statistical metrics will give you a numeric score on a continuous scale from 0 to 1 (for example, 0.22 or 0.87) representing the quality of the output.
Also, you need to choose between quantitative and qualitative metrics to evaluate the quality of the output, depending on the attribute that you are trying to evaluate. A quantitative metric is something that can be counted or measured and is objective, while a qualitative metric is something that is subjective and based more on feel or perception.
For example, a quantitative metric might measure how similar an LLM's answer is to a “golden answer” (an answer known to be correct). You could use the BLEU algorithm, which compares the actual output to the golden answer and returns a score from 0 to 1, representing the degree of similarity between the LLM's answer and the golden answer.
A qualitative metric, on the other hand, might measure how empathetic the LLM's response feels to the user. This kind of attribute is more subjective and in the past would need to be evaluated by a human judge. However, with the advent of LLMs, you can now use an LLM-as-a-judge approach to evaluate qualitative traits systematically and at scale, something that would be impractical or extremely costly for humans to do.
I'll continue sharing what I learn as I dig deeper into this subject. In the meantime, one resource I've found particularly helpful is a book that I purchased earlier this year, Building AI Intensive Python Applications.