For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (
- [ ]) syntax for tracking.
Goal: Build a two-layer eval system that tests the Seam PMS integration skill against synthetic fixture apps, producing numeric scores from both structural analysis and real Seam sandbox validation.
Architecture: Fixture apps are minimal PMS codebases the skill modifies. Layer 1 (rubric_checker.py) statically scores the diff. Layer 2 (sandbox_validator.sh) builds the modified app in Docker, runs it, and validates against the real Seam sandbox API. An orchestrator (run_evals.sh) ties both layers together and produces a summary table.
Tech Stack: Python (rubric checker), Bash (orchestrator + sandbox validator), Docker (app containers), Seam sandbox API, TypeScript/Express + Python/Flask (fixture apps)
Spec: docs/2026-03-23-quantitative-evals-design.md
| File | Responsibility |
|---|---|
evals/rubric.json |
Scoring category definitions and weights |
evals/rubric_checker.py |
Layer 1: grades diff/modified files against answer key |
evals/sandbox_validator.sh |
Layer 2: Docker build/run + Seam sandbox validation |
evals/run_evals.sh |
Orchestrator: invokes skill, runs both layers, prints summary |
evals/results/.gitignore |
Ignore eval results directory |
| File | Responsibility |
|---|---|
evals/fixtures/express-ts/eval_config.json |
Prompt, expected API path, test endpoints/payloads |
evals/fixtures/express-ts/answer_key.json |
Expected files, calls, placements, parameters |
evals/fixtures/express-ts/Dockerfile |
Builds and runs the Express app |
evals/fixtures/express-ts/app/package.json |
Node.js dependencies (no seam) |
evals/fixtures/express-ts/app/tsconfig.json |
TypeScript config |
evals/fixtures/express-ts/app/src/index.ts |
Express server setup, health endpoint |
evals/fixtures/express-ts/app/src/routes/reservations.ts |
CRUD routes for reservations |
evals/fixtures/express-ts/app/src/routes/webhooks.ts |
Existing payment webhook handler |
evals/fixtures/express-ts/app/src/services/reservationService.ts |
Business logic for reservation lifecycle |
evals/fixtures/express-ts/app/src/models/types.ts |
TypeScript interfaces for Reservation, Guest, Property, Unit |
evals/fixtures/express-ts/app/src/models/store.ts |
In-memory data store with seed data |
| File | Responsibility |
|---|---|
evals/fixtures/flask-py/eval_config.json |
Prompt, expected API path, test endpoints/payloads |
evals/fixtures/flask-py/answer_key.json |
Expected files, calls, placements, parameters |
evals/fixtures/flask-py/Dockerfile |
Builds and runs the Flask app |
evals/fixtures/flask-py/app/requirements.txt |
Python dependencies (no seam) |
evals/fixtures/flask-py/app/app.py |
Flask app factory, health endpoint |
evals/fixtures/flask-py/app/blueprints/reservations.py |
CRUD routes for reservations |
evals/fixtures/flask-py/app/blueprints/webhooks.py |
Existing payment webhook handler |
evals/fixtures/flask-py/app/services/reservation_service.py |
Business logic for reservation lifecycle |
evals/fixtures/flask-py/app/models/reservation.py |
Reservation, Guest models |
evals/fixtures/flask-py/app/models/property.py |
Property, Unit models |
evals/fixtures/flask-py/app/models/store.py |
In-memory data store with seed data |
Build the minimal TypeScript Express PMS app. This is the simplest fixture — flat structure, in-memory data, service layer separate from routes.
Files:
-
Create:
evals/fixtures/express-ts/app/package.json -
Create:
evals/fixtures/express-ts/app/tsconfig.json -
Create:
evals/fixtures/express-ts/app/src/models/types.ts -
Create:
evals/fixtures/express-ts/app/src/models/store.ts -
Create:
evals/fixtures/express-ts/app/src/services/reservationService.ts -
Create:
evals/fixtures/express-ts/app/src/routes/reservations.ts -
Create:
evals/fixtures/express-ts/app/src/routes/webhooks.ts -
Create:
evals/fixtures/express-ts/app/src/index.ts -
Create:
evals/fixtures/express-ts/Dockerfile -
Create:
evals/fixtures/express-ts/eval_config.json -
Create:
evals/fixtures/express-ts/answer_key.json -
Step 1: Create package.json and tsconfig.json
package.json — dependencies: express, typescript, @types/express, ts-node. Scripts: build (tsc), start (node dist/index.js), dev (ts-node src/index.ts).
tsconfig.json — target ES2020, outDir dist, rootDir src, strict mode.
- Step 2: Create type definitions
src/models/types.ts — interfaces for:
-
Guest(id, name, email) -
Property(id, name, address) -
Unit(id, propertyId, name — e.g., "Unit 101") -
Reservation(id, guestId, unitId, checkIn, checkOut, status) -
Step 3: Create in-memory data store
src/models/store.ts — export a simple object with arrays for guests, properties, units, reservations. Seed with:
- 1 property ("Sunset Rentals", id: "prop-1")
- 2 units ("Unit 101" id: "unit-101", "Unit 202" id: "unit-202", both under prop-1)
- No reservations (created via API)
Expose helper functions: generateId(), findUnit(id), findGuest(id).
- Step 4: Create reservation service
src/services/reservationService.ts — business logic functions:
createReservation(data)— validates unit exists, creates guest if needed, creates reservation, returns itupdateReservation(id, data)— finds reservation, updates fields (checkIn, checkOut), returns updatedcancelReservation(id)— finds reservation, sets status to "cancelled", returns itgetReservation(id)— returns reservation by id
These are the functions the skill needs to find and add Seam calls to.
- Step 5: Create reservation routes
src/routes/reservations.ts — Express router:
-
POST /api/reservations— callscreateReservation, returns 201 with{ reservation: { id, ... } } -
PUT /api/reservations/:id— callsupdateReservation, returns 200 -
DELETE /api/reservations/:id— callscancelReservation, returns 200 -
GET /api/reservations/:id— callsgetReservation, returns 200 -
Step 6: Create existing webhook handler
src/routes/webhooks.ts — Express router with a POST /webhooks/payments handler that logs the event and returns 200. This gives the skill a pattern to follow when adding Seam webhook handling.
- Step 7: Create server entry point
src/index.ts — creates Express app, mounts reservation routes at /api, webhook routes at /webhooks, adds a GET /health endpoint returning { status: "ok" }, listens on PORT env var (default 3000).
- Step 8: Create Dockerfile
FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["node", "dist/index.js"]- Step 9: Verify the app builds and runs locally
Run: cd evals/fixtures/express-ts/app && npm install && npm run build && npm start
Expected: Server starts on port 3000, curl localhost:3000/health returns {"status":"ok"}, POST /api/reservations creates a reservation.
Then test Docker (note: -f flag because Dockerfile is outside the app dir):
cd evals/fixtures/express-ts && docker build -t eval-express-ts -f Dockerfile ./app && docker run -p 3000:3000 eval-express-tsKill container after verifying.
- Step 10: Create eval_config.json
{
"fixture": "express-ts",
"prompt": "I'm building a short-term rental PMS in TypeScript with Express. We have reservations with check-in/check-out times and want to automatically create access codes on smart locks when guests book. Our customers use August and Yale smart locks. We want property managers to connect their own locks without us building UI — we don't want to build device management ourselves. We just want to push reservation data and have Seam handle the rest. We already have a Seam account with sandbox devices. Don't ask me any setup questions — explore the codebase and write the integration.",
"expected_api_path": "reservation_automations",
"seam_env_var": "SEAM_API_KEY",
"app_port": 3000,
"test_endpoints": {
"create": {
"method": "POST",
"path": "/api/reservations",
"payload": {
"guestName": "Test Guest",
"guestEmail": "eval_test_{{RUN_ID}}@example.com",
"propertyId": "prop-1",
"unitId": "unit-101",
"checkIn": "{{STARTS_AT}}",
"checkOut": "{{ENDS_AT}}"
},
"response_id_path": "reservation.id"
},
"update": {
"method": "PUT",
"path": "/api/reservations/{{RESERVATION_ID}}",
"payload": {
"checkOut": "{{NEW_ENDS_AT}}"
}
},
"cancel": {
"method": "DELETE",
"path": "/api/reservations/{{RESERVATION_ID}}"
}
}
}- Step 11: Create answer_key.json
{
"expected_files_modified": [
"src/routes/reservations.ts",
"src/services/reservationService.ts"
],
"expected_new_files_allowed": [
"src/services/seamService.ts",
"src/routes/webhooks.ts",
"src/config/seam.ts",
"src/lib/seam.ts"
],
"expected_calls": {
"create": ["customers.push_data"],
"update": ["customers.push_data"],
"cancel": ["customers.delete_data"]
},
"expected_placements": {
"customers.push_data": ["createReservation", "updateReservation"],
"customers.delete_data": ["cancelReservation"]
},
"required_parameters": {
"push_data": ["customer_key", "reservations", "user_identities"],
"delete_data": ["customer_key", "reservation_keys"]
},
"expected_package_additions": {
"package.json": ["seam"]
}
}- Step 12: Commit
git add evals/fixtures/express-ts/
git commit -m "feat: add express-ts fixture app for quantitative evals"Build the Python script that scores a skill-modified fixture against its answer key.
Files:
-
Create:
evals/rubric.json -
Create:
evals/rubric_checker.py -
Step 1: Create rubric.json
{
"categories": [
{ "name": "api_path_selection", "weight": 15, "check": "api_path_match" },
{ "name": "file_targeting", "weight": 20, "check": "files_modified_match" },
{ "name": "integration_placement", "weight": 20, "check": "calls_in_expected_functions" },
{ "name": "api_correctness", "weight": 20, "check": "required_params_present" },
{ "name": "lifecycle_completeness", "weight": 15, "check": "all_handlers_modified" },
{ "name": "webhook_setup", "weight": 10, "check": "webhook_route_added" }
]
}- Step 2: Write rubric_checker.py — file loading and diff computation
The script takes 3 args: --pristine <dir> --modified <dir> --fixture-dir <dir> (for answer_key.json and eval_config.json).
Core structure:
-
load_config(fixture_dir)— loads answer_key.json, eval_config.json, rubric.json -
compute_diff(pristine_dir, modified_dir)— returns list of changed files and their diffs (usefilecmp+ file reading, no git dependency) -
read_modified_files(modified_dir, file_list)— reads full content of modified files for content-based checks -
Step 3: Write check functions — api_path_match
check_api_path_match(modified_files, eval_config):
-
Maps API paths to signature calls:
reservation_automations→push_data,access_grants→access_grants.create,lower_level→access_codes.create -
Searches all modified file contents for these signatures
-
Returns 1.0 if the expected path's signature is found (and others aren't), 0.5 if expected is found alongside wrong ones, 0.0 if wrong path or no Seam calls
-
Step 4: Write check functions — files_modified_match
check_files_modified_match(changed_files, answer_key):
-
Compare
changed_filesagainstexpected_files_modified— proportion of expected files that were actually modified -
Check
expected_new_files_allowed— any new files created that aren't in the allow-list deduct proportionally -
Return score 0.0-1.0
-
Step 5: Write check functions — calls_in_expected_functions
check_calls_in_expected_functions(modified_files, answer_key):
-
For each entry in
expected_placements, search the full file content for the SDK call -
Use a heuristic: find function definitions using multiple regex patterns to cover common styles:
function funcName/async function funcName(JS/TS)const funcName =/const funcName = async(arrow functions)funcName(at method-definition indent level (class methods)def funcName(Python)
-
Check if the SDK call string appears between that function's opening and the next function definition (or end of file)
-
Known limitations: deeply nested functions or unusual patterns may be missed. This is acceptable — the heuristic catches the common cases. If a placement check fails due to a pattern miss, the other rubric categories (API correctness, lifecycle completeness) still catch the core behavior.
-
Score proportionally: if 2/3 placements are correct, score is 0.66
-
Step 6: Write check functions — required_params_present
check_required_params_present(modified_files, answer_key):
-
For each SDK call in
required_parameters, search file contents for the call and nearby parameter names -
Use string matching: look for each required parameter name (e.g.,
customer_key,reservations) within 20 lines of the SDK call -
Score proportionally per call, averaged across all calls
-
Step 7: Write check functions — all_handlers_modified
check_all_handlers_modified(changed_files, modified_files, answer_key):
-
For each lifecycle phase (create, update, cancel) in
expected_calls, check that at least one of the expected SDK calls appears in the modified files -
Score: proportion of lifecycle phases covered (0/3, 1/3, 2/3, 3/3)
-
Step 8: Write check functions — webhook_route_added
check_webhook_route_added(changed_files, modified_files):
-
Search for new route definitions containing "seam" or "webhook" (case-insensitive) in modified/new files
-
Look for Seam event type strings:
access_code.set_on_device,access_method.issued,device.disconnected -
Score: 1.0 if webhook route with Seam events found, 0.5 if webhook route but no Seam events, 0.0 if no webhook route
-
Step 9: Write main scoring function
score(pristine_dir, modified_dir, fixture_dir):
-
Load configs
-
Compute diff + read modified files
-
Run each check function
-
Multiply each score by its weight from rubric.json
-
Return JSON:
{ "total": 85, "categories": { "api_path_selection": { "score": 1.0, "weighted": 15 }, ... } } -
Step 10: Add CLI entry point
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--pristine", required=True)
parser.add_argument("--modified", required=True)
parser.add_argument("--fixture-dir", required=True)
args = parser.parse_args()
result = score(args.pristine, args.modified, args.fixture_dir)
print(json.dumps(result, indent=2))- Step 11: Create a "golden" test modification for express-ts
Create evals/fixtures/express-ts/test_golden/ — a pre-modified version of the app that represents a perfect skill output. This is used to test both the rubric checker and sandbox validator.
Changes to make (relative to app/):
package.json— add"seam": "latest"to dependenciessrc/services/reservationService.ts— import Seam, initialize withSEAM_API_KEYenv var. Addseam.customers.push_data(...)call insidecreateReservation()andupdateReservation()withcustomer_key,user_identities,reservations(withreservation_key,user_identity_key,starts_at,ends_at,space_keys). Addseam.customers.delete_data(...)call insidecancelReservation()withcustomer_key,reservation_keys,user_identity_keys.src/routes/webhooks.ts— add aPOST /webhooks/seamhandler that switches onevent_typeforaccess_code.set_on_device,access_code.failed_to_set_on_device,device.disconnected.
Copy the app to test_golden/, apply these changes.
- Step 12: Test rubric_checker against the golden modification
Run: python evals/rubric_checker.py --pristine evals/fixtures/express-ts/app --modified evals/fixtures/express-ts/test_golden --fixture-dir evals/fixtures/express-ts
Expected: Score near 100 with all categories passing.
Then test with a deliberately wrong modification (copy golden, change push_data to access_codes.create) and verify the API path selection score drops to 0.
- Step 12: Commit
git add evals/rubric.json evals/rubric_checker.py
git commit -m "feat: add rubric checker (Layer 1) for quantitative evals"Build the bash script that Dockerizes the modified app, runs it, and validates against the Seam sandbox.
Files:
-
Create:
evals/sandbox_validator.sh -
Step 1: Write sandbox bootstrapping functions
Functions at the top of sandbox_validator.sh:
bootstrap_sandbox()— lists devices, finds an access-code-capable device, creates a space witheval_unit_${RUN_ID}space_key and assigns the device. ExportsDEVICE_ID,SPACE_KEY,CUSTOMER_KEY. Note: the customer (eval_pm_${RUN_ID}) is created implicitly by the app's firstpush_datacall — there is no separate/customers/createendpoint. TheCUSTOMER_KEYis just a known string the validator uses for cleanup viadelete_data.cleanup_sandbox()— deletes space, deletes customer data. Runs viatrapon EXIT so it always fires.api()— same curl wrapper as existing tests:POSTto Seam API with bearer token.
Pattern from: tests/test_reservation_automations.sh lines 23-30 (api function), lines 32-67 (device + space setup).
-
Step 2: Write Docker build and run functions
-
build_app(modified_dir, fixture_dir)— copies modified app to temp dir, runsdocker build -t eval-${FIXTURE}-${RUN_ID} -f "${fixture_dir}/Dockerfile" "${temp_dir}". The-fflag is required because the Dockerfile lives at the fixture level, not inside the app directory. Returns image tag. -
start_app(image_tag, eval_config)—docker run -d --name eval-${FIXTURE}-${RUN_ID} -p ${HOST_PORT}:${APP_PORT} -e ${SEAM_ENV_VAR}=${SEAM_API_KEY} ${image_tag}. Returns container ID. -
wait_for_health(port, timeout)— pollsGET localhost:${port}/healthevery 2s until 200 or timeout. -
stop_app(container_id)—docker stopanddocker rm. -
Step 3: Write template variable resolution
resolve_template(template_str) — replaces {{RUN_ID}}, {{STARTS_AT}}, {{ENDS_AT}}, {{NEW_ENDS_AT}} with computed values:
RUN_ID= timestampSTARTS_AT= now + 1 hour (ISO8601)ENDS_AT= now + 25 hoursNEW_ENDS_AT= now + 49 hours
resolve_payload(payload_json) — resolves all template vars in a JSON payload string.
extract_id(response_json, id_path) — extracts a value from JSON using a dot-path (e.g., reservation.id). Use python3 -c for JSONPath extraction (same pattern as existing tests).
- Step 4: Write validation functions
validate_create(eval_config, app_port):
- Resolve create payload templates
curl -X POST localhost:${app_port}${create_path}with resolved payload- Extract reservation ID from response using
response_id_path - Poll Seam sandbox:
api /access_codes/list -d '{"device_id":"${DEVICE_ID}"}'up to 60s for an access code to appear - Return pass/fail + reservation ID
validate_update(eval_config, app_port, reservation_id):
- Resolve update URL (replace
{{RESERVATION_ID}}) and payload curl -X PUTthe update endpoint- Brief wait (5s) then check Seam sandbox — access code should still exist
- Return pass/fail
validate_cancel(eval_config, app_port, reservation_id):
-
Resolve cancel URL
-
curl -X DELETEthe cancel endpoint -
Poll Seam sandbox up to 30s for access codes to be removed/removing
-
Return pass/fail
-
Step 5: Write main validation flow and scoring
validate(modified_dir, fixture_dir):
- Load
eval_config.json bootstrap_sandboxbuild_app→start_app→wait_for_health- Score tracking:
BUILD_SCORE=0,CREATE_SCORE=0,UPDATE_SCORE=0,CANCEL_SCORE=0 - If app starts:
BUILD_SCORE=10 validate_create→ if pass:CREATE_SCORE=30validate_update→ if pass:UPDATE_SCORE=30validate_cancel→ if pass:CANCEL_SCORE=30- Total = sum of scores
- Output JSON:
{ "total": N, "checks": { "build": 10, "create": 30, ... } } stop_app+cleanup_sandbox(via trap)
- Step 6: Add CLI interface
Script takes args: --modified-dir <dir> --fixture-dir <dir> --run-id <id>
Requires SEAM_API_KEY env var.
#!/bin/bash
set -euo pipefail
# ... parse args, run validate(), output JSON result- Step 7: Test sandbox_validator with the golden modification
Use the golden test modification from Task 2, Step 11 (evals/fixtures/express-ts/test_golden/). Run:
SEAM_API_KEY=<sandbox_key> bash evals/sandbox_validator.sh \
--modified-dir evals/fixtures/express-ts/test_golden \
--fixture-dir evals/fixtures/express-ts \
--run-id test_$(date +%s)Expected: App builds, starts, create/update/cancel all pass. Score: 100.
Note: This requires a valid Seam sandbox API key with devices connected.
- Step 8: Commit
git add evals/sandbox_validator.sh
git commit -m "feat: add sandbox validator (Layer 2) for quantitative evals"Build the top-level script that invokes the skill, runs both layers, and produces the summary.
Files:
-
Create:
evals/run_evals.sh -
Create:
evals/results/.gitignore -
Step 1: Write argument parsing and fixture discovery
run_evals.sh parses flags:
--fixtures(comma-separated, default: all dirs underevals/fixtures/that containeval_config.json)--layers(rubric,sandbox,both; default:both)--api-path(filter fixtures byexpected_api_pathin their eval_config; default: no filter)--runs N(default: 1)
Validate: SEAM_API_KEY must be set if --layers includes sandbox. Docker must be available if sandbox layer is requested.
- Step 2: Write skill invocation function
invoke_skill(fixture_dir, working_dir):
- Copy
fixture_dir/app/to a temp working directory - Initialize a git repo in the working dir (
git init && git add -A && git commit -m "pristine") — needed for diff computation later - Read prompt from
eval_config.json - Run:
claude -p "$PROMPT" --allowedTools Read,Write,Edit,Glob,Grep,Bash --cwd "$working_dir" - Return the working dir path (now contains skill-modified code)
The skill must be loaded into the Claude session. Use the --systemPrompt flag to inject SKILL.md content:
SKILL_CONTENT=$(cat SKILL.md)
claude -p "$PROMPT" \
--systemPrompt "You are using the following skill to guide your work:\n\n$SKILL_CONTENT" \
--allowedTools Read,Write,Edit,Glob,Grep,Bash \
--cwd "$working_dir"If --systemPrompt is not available in the CLI version, fall back to prepending the skill content to the prompt itself: "[SKILL INSTRUCTIONS]\n$SKILL_CONTENT\n[/SKILL INSTRUCTIONS]\n\n$PROMPT".
Add a timeout of 10 minutes per skill invocation (timeout 600 claude -p ...) to prevent hung invocations from blocking the pipeline.
- Step 3: Write layer execution functions
run_rubric(pristine_dir, modified_dir, fixture_dir):
- Calls
python3 evals/rubric_checker.py --pristine "$pristine_dir" --modified "$modified_dir" --fixture-dir "$fixture_dir" - Captures JSON output
- Returns rubric score
run_sandbox(modified_dir, fixture_dir, run_id):
-
Calls
bash evals/sandbox_validator.sh --modified-dir "$modified_dir" --fixture-dir "$fixture_dir" --run-id "$run_id" -
Captures JSON output
-
Returns sandbox score
-
Step 4: Write result aggregation and summary table
aggregate_results(fixture, rubric_scores[], sandbox_scores[]):
- For single run: combined = 0.4 * rubric + 0.6 * sandbox
- For N runs: compute mean/min/max for rubric, sandbox, and combined
- Check API path: did rubric detect the correct path?
print_summary(all_results):
-
Prints the table format from the spec
-
For
--runs N > 1, show mean (min-max) in each column -
Step 5: Write main loop
for fixture in $FIXTURES; do
for run in $(seq 1 $RUNS); do
RUN_ID="${fixture}_$(date +%s)_${run}"
RESULTS_DIR="evals/results/$(date +%Y%m%d_%H%M%S)/${fixture}/run_${run}"
mkdir -p "$RESULTS_DIR"
# Invoke skill (runs are sequential per fixture to avoid port conflicts)
WORKING_DIR=$(invoke_skill "$FIXTURE_DIR" ...)
# Save diff
(cd "$WORKING_DIR" && git diff HEAD~1) > "$RESULTS_DIR/diff.patch"
# Run layers
if [[ "$LAYERS" == *rubric* ]]; then
run_rubric ... > "$RESULTS_DIR/rubric.json"
fi
if [[ "$LAYERS" == *sandbox* ]]; then
run_sandbox ... > "$RESULTS_DIR/sandbox.json"
fi
done
done
print_summary ...- Step 6: Create results .gitignore
evals/results/.gitignore:
*
!.gitignore
- Step 7: Test orchestrator end-to-end with express-ts (rubric only)
Run without sandbox to verify the skill invocation + rubric pipeline:
bash evals/run_evals.sh --fixtures express-ts --layers rubric --runs 1Expected: Skill runs, modifies the fixture, rubric scores the output. Summary table prints.
- Step 8: Test orchestrator end-to-end with express-ts (both layers)
SEAM_API_KEY=<key> bash evals/run_evals.sh --fixtures express-ts --layers both --runs 1Expected: Full pipeline — skill runs, rubric scores, Docker builds and runs, sandbox validates. Summary table with both scores.
- Step 9: Commit
git add evals/run_evals.sh evals/results/.gitignore
git commit -m "feat: add eval orchestrator tying rubric and sandbox layers together"Build the second fixture to validate cross-language support. Flask with blueprints, separate service module, in-memory store.
Files:
-
Create:
evals/fixtures/flask-py/app/requirements.txt -
Create:
evals/fixtures/flask-py/app/app.py -
Create:
evals/fixtures/flask-py/app/models/reservation.py -
Create:
evals/fixtures/flask-py/app/models/property.py -
Create:
evals/fixtures/flask-py/app/models/store.py -
Create:
evals/fixtures/flask-py/app/services/reservation_service.py -
Create:
evals/fixtures/flask-py/app/blueprints/reservations.py -
Create:
evals/fixtures/flask-py/app/blueprints/webhooks.py -
Create:
evals/fixtures/flask-py/Dockerfile -
Create:
evals/fixtures/flask-py/eval_config.json -
Create:
evals/fixtures/flask-py/answer_key.json -
Step 1: Create requirements.txt
flask==3.1.0
No seam package — the skill adds it.
- Step 2: Create models
models/reservation.py — dataclasses for Guest (id, name, email) and Reservation (id, guest_id, unit_id, check_in, check_out, status).
models/property.py — dataclasses for Property (id, name, address) and Unit (id, property_id, name).
models/store.py — in-memory store dict with seed data: 1 property, 2 units. Helper functions: generate_id(), find_unit(id), find_guest(id).
- Step 3: Create reservation service
services/reservation_service.py — functions:
-
create_reservation(data)— validates unit, creates guest if needed, creates reservation, returns it -
update_reservation(reservation_id, data)— finds reservation, updates check_in/check_out, returns it -
cancel_reservation(reservation_id)— finds reservation, sets status to "cancelled", returns it -
get_reservation(reservation_id)— returns reservation by id -
Step 4: Create blueprints
blueprints/reservations.py — Flask blueprint reservations_bp:
POST /api/reservations→ callscreate_reservation, returns JSON with 201PUT /api/reservations/<id>→ callsupdate_reservation, returns JSON with 200DELETE /api/reservations/<id>→ callscancel_reservation, returns JSON with 200GET /api/reservations/<id>→ callsget_reservation, returns JSON with 200
blueprints/webhooks.py — Flask blueprint webhooks_bp:
-
POST /webhooks/payments— logs event, returns 200. Pattern for skill to follow. -
Step 5: Create app.py
Flask app factory pattern:
-
Creates app
-
Registers
reservations_bpandwebhooks_bp -
GET /healthreturns{"status": "ok"} -
Runs on host
0.0.0.0, port fromPORTenv var (default 5000) -
Step 6: Create Dockerfile
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt ./
RUN pip install -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]- Step 7: Verify app builds and runs
Run: cd evals/fixtures/flask-py/app && pip install -r requirements.txt && python app.py
Expected: Server on port 5000, health check works, CRUD endpoints work.
Docker: cd evals/fixtures/flask-py && docker build -t eval-flask-py -f Dockerfile ./app && docker run -p 5000:5000 eval-flask-py
- Step 8: Create eval_config.json
Same structure as express-ts but with Python-appropriate prompt:
{
"fixture": "flask-py",
"prompt": "I'm building a short-term rental property management system in Python with Flask. We have reservations with check-in/check-out times and want to automatically create access codes on smart locks when guests book. Our customers use August and Yale smart locks. We want property managers to connect their own locks without us building UI. We just want to push reservation data and have Seam handle the rest. We already have a Seam account with sandbox devices. Don't ask me any setup questions — explore the codebase and write the integration.",
"expected_api_path": "reservation_automations",
"seam_env_var": "SEAM_API_KEY",
"app_port": 5000,
"test_endpoints": {
"create": {
"method": "POST",
"path": "/api/reservations",
"payload": {
"guest_name": "Test Guest",
"guest_email": "eval_test_{{RUN_ID}}@example.com",
"property_id": "prop-1",
"unit_id": "unit-101",
"check_in": "{{STARTS_AT}}",
"check_out": "{{ENDS_AT}}"
},
"response_id_path": "reservation.id"
},
"update": {
"method": "PUT",
"path": "/api/reservations/{{RESERVATION_ID}}",
"payload": {
"check_out": "{{NEW_ENDS_AT}}"
}
},
"cancel": {
"method": "DELETE",
"path": "/api/reservations/{{RESERVATION_ID}}"
}
}
}- Step 9: Create answer_key.json
{
"expected_files_modified": [
"blueprints/reservations.py",
"services/reservation_service.py"
],
"expected_new_files_allowed": [
"services/seam_service.py",
"blueprints/seam_webhooks.py",
"config/seam.py",
"lib/seam.py"
],
"expected_calls": {
"create": ["customers.push_data"],
"update": ["customers.push_data"],
"cancel": ["customers.delete_data"]
},
"expected_placements": {
"customers.push_data": ["create_reservation", "update_reservation"],
"customers.delete_data": ["cancel_reservation"]
},
"required_parameters": {
"push_data": ["customer_key", "reservations", "user_identities"],
"delete_data": ["customer_key", "reservation_keys"]
},
"expected_package_additions": {
"requirements.txt": ["seam"]
}
}- Step 10: Run evals against flask-py
bash evals/run_evals.sh --fixtures flask-py --layers rubric --runs 1Then with sandbox:
SEAM_API_KEY=<key> bash evals/run_evals.sh --fixtures flask-py --layers both --runs 1- Step 11: Commit
git add evals/fixtures/flask-py/
git commit -m "feat: add flask-py fixture app for quantitative evals"Run both fixtures together, fix any issues, polish the output.
Files:
-
Modify:
evals/run_evals.sh(if needed) -
Modify:
evals/sandbox_validator.sh(if needed) -
Step 1: Run full eval suite
SEAM_API_KEY=<key> bash evals/run_evals.sh --fixtures express-ts,flask-py --layers both --runs 1Expected: Both fixtures run, summary table prints with scores for both.
- Step 2: Run with --runs 3 for consistency data
SEAM_API_KEY=<key> bash evals/run_evals.sh --fixtures express-ts,flask-py --layers both --runs 3Expected: Each fixture runs 3 times, summary shows mean/min/max.
- Step 3: Fix any issues discovered in multi-run testing
Address flaky tests, timeout issues, cleanup failures, port conflicts between concurrent runs, etc.
- Step 4: Final commit
git add -A
git commit -m "fix: polish eval pipeline after multi-fixture testing"