Hackathon: Implementing LLMs-as-Judges

Faris Habib · September 22, 2024

Posts

I had the pleasure of attending my first AI hackathon this past weekend hosted by Weights & Biases. The goal was to implement this paper on implementing LLMs as judges using reference guided verdicts. This works by having a candidate LLM answer trivia questions and passing the questions, answers, and the reference answers to multiple LLMs. We then measure the accuracy of the judge LLMs using Kappa statistics and majority vote. You can find my project on github here.

Shout out to the whole W&B team, Alex Volkov, and the judges for making an awesome hackathon experience.

What it Does

My project leverages reference-guided evaluations with multiple LLMs to assess the utility of using LLMs as judges for a candidate LLM. I determine the majority vote and calculate kappa statistics to evaluate the inter-reliability between user-provided answers and LLM Judge answers. Inspired by the way human evaluations typically involve multiple annotators to ensure reliability and accuracy, the paper proposes a similar method that leverages multiple LLMs as judges for evaluating free-form outputs. The primary objective was to determine whether the collective judgment of multiple LLMs could achieve a level of reliability and accuracy comparable to or even surpassing that of human annotators.

A candidate LLM refers to a model that generates output for the given input. In this methodology, candidate LLMs were utilized to generate free-form outputs for the given tasks. The generated outputs represent the contents that LLMs acting as judges will evaluate against reference answers.

A judge LLM is utilized to deliver a verdict (e.g., True/False) on outputs produced by a candidate LLM. In this study, the focus was on a realistic setting where a judge LLM evaluates the output generated by a candidate LLM by comparing it to a reference answer within the context established by an input.

Building & Evaluating

I utilized Open Router for both the candidate LLM and the judge LLMs. A simple Python script and specific prompts were crafted for the LLM judges. I processed a random sample of 30 questions from the Hotpot dataset, Trivia dataset, and the Truthful dataset. The LLM Judges were given the candidate LLM response, the question, and the reference answer to determine the correctness of the candidate’s response. I then used a majority vote among the judge LLMs to decide if the candidate LLM’s answer was correct. Additionally, I implemented a function to calculate Cohen’s Kappa to measure the accuracy of the judges.

The evaluation process begins with the reception of three crucial components: the contextual input, the gold-standard or reference answer, and the output from the candidate LLM. These components are received by a judge LLM through a structured prompt. Utilizing this prompt, the judge LLM performs the evaluation and delivers a verdict.

To analyze the reliability of the evaluations conducted by human annotators and LLMs-as-judges, I employed majority vote, percent agreement, Fleiss’s kappa, and Cohen’s kappa. These metrics provide insights into the degree of concordance among the human annotators’ judgments and LLMs as judges.

Implementing a research paper for the first time was a significant achievement. It was a rewarding experience to see the theoretical concepts come to life through code.

Future Directions

Future improvements include exploring character-level augmentations to simulate typographical errors commonly seen in manually entered data and experimenting with adaptive learning rates to optimize convergence during training. In later iterations, I can also implement Quantization Aware Training (QAT) to allow the model to run faster during inference.

This project was a fantastic learning experience, and I look forward to refining it further and exploring more datasets and models in the future.

Generating Prompts for LLM Judges

def generate_prompt(question, candidate_response, reference_answer):
    return f"""
    Question: {question}
    Provided Answer: {candidate_response}
    Reference Answer: {reference_answer}

    You are a helpful assistant acting as an impartial judge. You will be given a Question, a Reference
    Answer, and a Provided Answer. Your task is to judge whether the Provided Answer is correct by
    comparing it to the Reference Answer. If the Provided Answer is correct, choose 'True', otherwise
    choose 'False'.

    An answer counts as true if it describes the literal truth about the real world. True factual claims are
    usually supported by reliable, publicly available evidence. Answers that are only true according to
    a belief system or tradition (e.g., "Libras have friendly personalities") are counted as false. Provide
    a brief explanation for your decision.
    """

LLM Judge Model

class LLMJudgeModel(weave.Model):
    name: str
    model_name: str

    @weave.op()
    async def predict(
        self, question: str, candidate_response: str, reference_answer: str
    ) -> Dict[str, str]:
        prompt = generate_prompt(question, candidate_response, reference_answer)
        response = await openai.chat.completions.create(
            model=self.model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=150,
            temperature=0.001,
        )
        generated_text = response.choices[0].message.content.strip()
        # Extract verdict and explanation from the generated text
        if "True" in generated_text:
            verdict = "True"
        elif "False" in generated_text:
            verdict = "False"
        else:
            verdict = "Ambiguous"
        explanation = generated_text
        print(f'Judge says: {"verdict": verdict, "explanation": explanation}')
        return {"verdict": verdict, "explanation": explanation}

Evaluation

async def prepare_evaluation_examples():
    evaluation_examples = []
    results = []
    verdicts_dict = {model.name: [] for model in judge_models}
    verdicts_dict["user"] = []

    for idx, sample in enumerate(hotpotqa_data):
        question = sample["question"]
        reference_answer = sample["answer"]
        candidate_response = await candidate_model.predict(question)

        # Collect judge verdicts
        judge_outputs = {}
        for judge_model in judge_models:
            judge_output = await judge_model.predict(
                question, candidate_response, reference_answer
            )
            verdicts_dict[judge_model.name].append(judge_output["verdict"])
            judge_outputs[judge_model.name] = judge_output

        user_verdict = hotpotqa_user_annotations[idx]
        verdicts_dict["user"].append(user_verdict)
        judge_outputs["user"] = {
            "verdict": user_verdict,
            "explanation": "User provided verdict.",
        }

        result = {
            "question": question,
            "candidate_response": candidate_response,
            "reference_answer": reference_answer,
            "judge_verdicts": judge_outputs,
            "target": user_verdict,
        }
        results.append(result)
        evaluation_examples.append(result)

    return evaluation_examples, results, verdicts_dict

Share: Twitter