LLM Evals for Non-Engineers: A Step-by-Step Guide

Enstine Muki

6 months ago

Large Language Models (LLMs) are amazing. They can write text, help answer questions, and even code. But how do you know if one is doing a good job? That’s where LLM evals come in!

If you’re not an engineer, don’t worry. This guide is for you. We’ll keep it simple, fun, and easy to follow. By the end, you’ll know how to evaluate an LLM step-by-step. No technical background needed. Let’s dive in!

🏁 Why Evaluate LLMs?

Imagine you’re using an AI assistant. Sometimes it gives helpful answers. Other times, it makes stuff up. You want to know when it’s trustworthy, right?

Evaluating LLMs helps us answer questions like:

Is the response correct?
Is it clear and easy to read?
Is it fair and unbiased?
Does it follow instructions?

This process is called an “eval” — short for evaluation.

🧰 What You’ll Need

You don’t need to be a coder. All you need is:

A basic understanding of what the model is supposed to do
Access to a few sample prompts
A spreadsheet or tool to record results
Your honest judgment

You’re now ready to get started!

📋 Step-by-Step Guide to Running LLM Evals

1. Pick a Task

What do you want to test? Choose a specific task the LLM should perform. For example:

Summarizing articles
Answering customer questions
Translating phrases
Editing grammar

Be clear on the goal. This will help you judge responses better.

2. Write Some Prompts

Now create 5 to 10 example questions or instructions (called “prompts”). These should reflect what users would really ask.

Examples:

“Summarize this article about electric vehicles.”
“How do I reset my router?”
“Translate ‘Nice to meet you’ into Spanish.”

The idea is to capture real-world use cases.

3. Get Model Responses

Use a tool like ChatGPT or any LLM platform. Copy-paste each prompt and record what the model replies.

Put the responses in a table. One column for the prompt, one for the output.

4. Create an Evaluation Rubric

A rubric is a checklist or rating system. It helps you stay consistent. Here are some categories non-engineers can use:

Accuracy (Is the answer correct?)
Clarity (Is it easy to understand?)
Relevance (Does it match the question?)
Tone (Is it polite and professional?)

Rank each one from 1 to 5. For example:

1 = Poor
3 = Average
5 = Excellent

You can even color code them in your spreadsheet! 🌈

5. Score Each Response

Now it’s your time to shine! Read each model’s response. Use your rubric to score it.

An example row might look like this:

Prompt: “Translate ‘How are you?’ to French”
Response: “Comment ça va?”
Accuracy: 5
Clarity: 5
Relevance: 5
Tone: 5

Done! Only 9 more to go.

6. Look for Patterns

Once you rate all responses, look across your scores. Are there areas where the model shines? Any weak spots?

For example:

High scores in tone
Low scores in accuracy for technical prompts

This helps you decide if the model is ready for your task — or needs improvement.

7. Share and Improve

You did it! 🎉 Now share your findings. You can show your team:

What the model is good at
Where it needs help
Ideas for new prompts or changes

This kind of feedback is gold for engineers. It helps them fine-tune the model or pick a better one.

🧠 Tips and Tricks

Be Neutral: Don’t assume the model is right. Always check.
Use a Second Opinion: Ask a teammate to score a few responses too.
Try Multiple Models: Compare outputs from different LLMs.
Don’t Overthink: If a response feels wrong, it probably is.

🤔 What If You’re Not Sure?

Sometimes the response is sorta okay… but not perfect. That’s fine! Give it a “3” and write a short note. Comments are helpful.

Example:

“The answer is correct, but the explanation is a bit confusing.”

🎯 Why Your Input Matters

LLM evals aren’t just for engineers. Experts from all fields — marketing, customer support, education — can help train and improve AI.

By giving clear, honest feedback, you’re making AI better for everyone.

Here’s what non-engineers bring to the table:

Real-world experience — You know what people actually need.
Diverse perspectives — AI should work for everyone, not just tech geeks.
Human sense — You can spot things machines miss, like tone or context.

🛠️ Free Tools to Help You

You can use basic tools to do your evals:

Google Sheets – For scoring and tracking responses
ChatGPT or Claude – To get LLM responses
Notion or Trello – To manage tasks and feedback

🚀 Ready to Try?

Great! Pick a small task, write 5 prompts, and give it a go. You might be surprised at how easy it is.

Remember, you don’t need to be a coder to help improve AI. With a keen eye and clear judgment, you’re already contributing.

Thanks for making AI better, one prompt at a time!

Happy Evaluating! 💡