Section 1 of 2
How to Use PromptForge Lab as a Real Testing Workspace
PromptForge Lab is where you stop asking "did it work once?" and start asking "will it keep working under real conditions?" Use it after Prompt Architect has produced a stable prompt draft and you need comparison, stress testing, reasoning audits, or guided coaching.
Socratic mode
Best when the request is still fuzzy and you need the system to ask better questions before generating a final answer.
Logic Analysis
Use this when the input is a policy, memo, argument, or decision draft that needs assumptions, contradictions, and weak evidence mapped clearly.
Benchmark Studio
Use this when you already have several candidate prompts or models and need a disciplined way to rank them.
Workspace trace
Keep results linked to a workspace so variants, decisions, and reports do not get lost in a one-off experiment.
Step 1: Pick the correct mode
If the problem is unclear, start in Socratic mode. If the reasoning needs inspection, use Logic Analysis. If you are choosing between options, use Benchmark Studio.
Step 2: Prepare one honest input set
Use a real prompt, a realistic dataset, or a real source document. The Lab only produces trustworthy conclusions when the inputs look like production.
Step 3: Lock the variable you are testing
Change one thing at a time. Compare models with the same prompt, or compare prompts on the same model, but do not mix both if you want a clean result.
Step 4: Run and read the outputs side by side
Compare not only style, but also compliance with instructions, missing facts, formatting stability, and latency.
Step 5: Read the result tabs like an operator
Use the main output to judge usefulness, the graph or grounding views to inspect logic, and the analytics area to understand why one option performed better.
Step 6: Save the winner and keep a fallback
Document the winning version, but retain the runner-up. Good release practice always preserves a challenger or rollback option.
Use at least one normal input, one edge case, and one failure-prone input.
Decide the winner based on quality plus cost, not quality alone.
Write down what counts as failure before you run the comparison.
Do one re-run when the result is surprising so you do not promote a lucky sample.
Run experiments against data that resembles real traffic, not only ideal examples.
Keep benchmark notes so someone else can understand why the winning variant won.
Review latency, format stability, and hallucination risk together.
Share reports only after removing variants that should not become default behavior.
Do not crown a winner from a single impressive response.
Do not compare prompts if each variant secretly uses different input data.
Do not confuse a nicer writing style with a better production result.
Do not delete the fallback version after a release decision.