Public training path

Complete User Guide

PromptForge Lab · Beginner-friendly walkthrough

PromptForge Lab is where you stop asking "did it work once?" and start asking "will it keep working under real conditions?" Use it after Prompt Architect has produced a stable prompt draft and you need comparison, stress testing, reasoning audits, or guided...

Next best action

Preview the guidance here, then create an account to save workspaces, unlock guided execution, and continue inside the platform.

Start with the first section Create account to continue

Sections

2 guided blocks

Read Time

5 min focused read

Coverage

206 searchable doc sections

prompt forgetutorialworkflowbenchmarkoperatorscoresmodesvariants

Section 1 of 2

How to Use PromptForge Lab as a Real Testing Workspace

prompt forgetutorialworkflowbenchmarkoperator

Socratic mode

Best when the request is still fuzzy and you need the system to ask better questions before generating a final answer.

Logic Analysis

Use this when the input is a policy, memo, argument, or decision draft that needs assumptions, contradictions, and weak evidence mapped clearly.

Benchmark Studio

Use this when you already have several candidate prompts or models and need a disciplined way to rank them.

Workspace trace

Keep results linked to a workspace so variants, decisions, and reports do not get lost in a one-off experiment.

Step 1: Pick the correct mode

If the problem is unclear, start in Socratic mode. If the reasoning needs inspection, use Logic Analysis. If you are choosing between options, use Benchmark Studio.

Step 2: Prepare one honest input set

Use a real prompt, a realistic dataset, or a real source document. The Lab only produces trustworthy conclusions when the inputs look like production.

Step 3: Lock the variable you are testing

Change one thing at a time. Compare models with the same prompt, or compare prompts on the same model, but do not mix both if you want a clean result.

Step 4: Run and read the outputs side by side

Compare not only style, but also compliance with instructions, missing facts, formatting stability, and latency.

Step 5: Read the result tabs like an operator

Use the main output to judge usefulness, the graph or grounding views to inspect logic, and the analytics area to understand why one option performed better.

Step 6: Save the winner and keep a fallback

Document the winning version, but retain the runner-up. Good release practice always preserves a challenger or rollback option.

Use at least one normal input, one edge case, and one failure-prone input.

Decide the winner based on quality plus cost, not quality alone.

Write down what counts as failure before you run the comparison.

Do one re-run when the result is surprising so you do not promote a lucky sample.

Do This

Run experiments against data that resembles real traffic, not only ideal examples.

Keep benchmark notes so someone else can understand why the winning variant won.

Review latency, format stability, and hallucination risk together.

Share reports only after removing variants that should not become default behavior.

Avoid

Do not crown a winner from a single impressive response.

Do not compare prompts if each variant secretly uses different input data.

Do not confuse a nicer writing style with a better production result.

Do not delete the fallback version after a release decision.

Section 2 of 2

Mode Guide, Benchmark Scores, and Report Logic

prompt forgebenchmarkscoresmodesvariants

PromptForge Lab has three very different jobs: clarify a fuzzy task, inspect reasoning, or rank alternatives. Users get confused when they treat every mode like a general chat surface. The safest rule is simple: Socratic discovers, Logic diagnoses, and Benchmark decides.

Socratic mode

Runs a guided question loop that sharpens the brief before a final answer is attempted. Use it when the problem statement is still weak.

Logic Analysis

Maps assumptions, contradictions, missing evidence, and reasoning structure so a draft can be improved instead of merely rewritten.

Benchmark Studio

Executes a baseline and prompt variants under the same test conditions so you can rank them instead of arguing by opinion.

Workspace-backed runs

The Lab can hydrate from rooms, tasks, and cases so experiments stay connected to the work that produced them.

Rubric quality: judged outputs are scored for task fit, clarity, completeness, constraint handling, and risk awareness.

Speed: lower latency is inverted into a higher score so faster variants are rewarded without hiding quality.

Efficiency: lower token usage is also inverted into a higher score so wasteful variants do not win by style alone.

Composite score: when judge scores exist, the ranking is weighted heavily toward quality, with smaller contributions from speed and efficiency. If judge output is unavailable, the fallback ranking uses speed and token efficiency only.

Define the baseline

Start with the prompt or model currently in use so the benchmark answers a real release question.

Add disciplined variants

Create only the variants that reflect a meaningful hypothesis, not every wording idea that comes to mind.

Judge with an explicit rubric

Make the system explain why outputs won or lost instead of relying on aesthetic preference.

Save the result as an operating decision

Record the winner, the runner-up, and the reason the ranking makes sense for production.

Pro Tip: Fast ranking is not the same as good ranking

A quick run can tell you something. A disciplined rubric tells you whether the result is safe to release. Do not confuse those two levels of confidence.

Release Decision Framework

Academy v4.0 · Interactive Documentation · Beginner Mode