CLAUDE.md — Working in the AMEND Repository

Python environment

All Python scripts run in the amend_python conda environment:

conda activate amend_python

Scripts that are part of the CI pipeline use requirements-ci.txt. The only excluded dep is PySTAN (stan). Everything else needed by dashboard_charts.py and its transitive imports must be listed there. The full conda env is needed to run the analysis/visualization scripts locally.

When adding a new import to any script in the dashboard_charts.py call chain, check whether the package is in requirements-ci.txt and add it with a pinned version if not. The full import chain is: dashboard_charts.py → MADEP_staff, MADEP_enforcements_viz, ECOS_budgets_viz, EPA_303d_viz, EEA_DP_CSO_map → NECIR_CSO_map → cso_maps. To find the pinned version: conda run -n amend_python python -c "import pkg_resources; print(pkg_resources.get_distribution('PKGNAME').version)".

Repository layout

get_data/        Data-fetch and database-assembly scripts (run these first)
docs/data/       CSV output files and Jekyll data-source pages
docs/_includes/  Generated chart HTML (Chart.js for charts, Plotly/Folium for maps)
docs/assets/     Maps, figures, PDFs
analysis/        Visualization and statistical analysis scripts

Data pipeline (in order)

All fetch scripts are run from get_data/:

get_EPARegion1_NPDES_permits.py — EPA NPDES permit listings + PDF sync to GCS
get_budget_CTHRU.py — MA Comptroller CTHRU Socrata API (FY2005–present) + cached MassBudget (FY2001–2004); no auth required
get_DEP_staff_SODA.py — MA Comptroller payroll via SODA API (requires SECRET_SODA_token)
get_EEA_data_portal.py — EEA portal tables (permit, facility, inspection, enforcement, drinkingWater)
get_eea_dp_cso.py — EEA CSO discharge incidents (separate API endpoint) 5.5. get_ATTAINS_303d.py — EPA 303(d) impaired waters from MassGIS shapefiles (biennial data; exits early if all cycles already cached)
validate_data.py — schema + row-count checks; writes docs/data/data_stats.yml
assemble_db.py — builds AMEND.db SQLite, uploads to gs://openamend-data/amend.db, and regenerates docs/assets/db_semantic_context.txt

Scripts that are not part of automated runs:

get_SSAWages.py — Attempts to fetch SSA Average Wage Index (currently blocked by ssa.gov 403). Cached CSV is used as fallback.
get_DEP_staff.py, get_Census_*.py, transform_*.py — run manually as needed

SODA credentials

The get_DEP_staff_SODA.py script reads credentials from get_data/SECRET_SODA_token (two lines: app token, secret token). In CI this file is written from GitHub secrets SODA_APP_TOKEN and SODA_SECRET_TOKEN. Do not commit this file.

GCS infrastructure

Project: openamend (GCP)
Bucket: gs://openamend-data — public read, stores the SQLite DB and large CSVs
Service account key stored in GitHub secret GCP_SA_KEY

To update CORS on the bucket:

bash set_cors_gsutil.sh

CI / GitHub Actions

.github/workflows/update-data.yml — runs every Monday at 06:00 UTC, or on manual dispatch. Steps: fetch → validate → assemble DB → commit CSVs → push. Opens a GitHub issue labeled data-update-failure if any step fails.

.github/workflows/update-charts.yml — triggers after a successful data update run. Runs analysis/dashboard_charts.py, which generates all dashboard charts with dash_ prefix. Uses end_date=date.today() for rolling CSO data window. See Live Dashboard section below for details.

Known issues and workarounds

Budget data (CTHRU): MA Comptroller CTHRU Socrata API replaces MassBudget (which returned 403 Cloudflare blocks as of early 2026). CTHRU provides DEP, DCR, EEA, and Fish&Game admin budgets FY2005–present. FY2001–2004 are from cached MassBudget CSVs and do not auto-update.
SSAWages (403): The SSA website (ssa.gov) blocks automated access despite User-Agent headers. get_SSAWages.py script is created but not yet run in CI. Falls back to cached 2023-02-03 version. assemble_db.py auto-extends with zero-growth placeholder rows for any year gap.
EPA NPDES page changes: EPA changed JSON format and column names around 2025; both handled with isinstance checks and fallback column detection.
EEA CSOAPI: Requires Referer and Origin headers matching the portal URL; plain requests return HTTP 500. Pagination is 1-indexed.
303(d) data (biennial): get_ATTAINS_303d.py fetches from MassGIS S3-hosted shapefiles (not the ATTAINS REST API, which times out on /assessments). Data updates only biennially (even years); the script exits early if all known cycles are already in the cached CSV. The 2020 cycle was never published by MassGIS. The 2024/2026 cycle is in draft as of April 2026 — the script will auto-detect it when MassGIS publishes the approved shapefile. CSO_303d_Mapping in assemble_db.py is a manually curated dict (35 verified matches of 56 CSO waterbodies); update it when a new cycle is added by reviewing new assessment unit names against CSO waterBody values.

Running scripts

Critical: All scripts use relative paths and MUST be run from their parent directory. The session starts in the repo root (/home/nes/Documents/MAenvironmentaldata).

Data fetch and assembly (from repo root, cd into get_data/):

cd get_data
conda run -n amend_python python get_EPARegion1_NPDES_permits.py
conda run -n amend_python python get_budget_CTHRU.py
conda run -n amend_python python get_DEP_staff_SODA.py
conda run -n amend_python python get_EEA_data_portal.py
conda run -n amend_python python get_eea_dp_cso.py
conda run -n amend_python python validate_data.py
conda run -n amend_python python assemble_db.py

Analysis and visualization (from repo root, cd into analysis/):

cd ../analysis
conda run -n amend_python python MADEP_staff.py
conda run -n amend_python python MADEP_enforcements_viz.py
conda run -n amend_python python dashboard_charts.py

Why this matters:

get_data/*.py scripts read from ../docs/data/ and write outputs there
analysis/*.py scripts read from ../docs/data/ and ../get_data/ and write to ../_includes/charts/
Using wrong working directory causes "file not found" or "cannot save to non-existent directory" errors
Always use conda run -n amend_python (not conda activate); this ensures clean environment isolation between runs

Local Jekyll preview

Run from the docs/ directory in the amend_jekyll conda env. Use --host localhost (not 0.0.0.0) so that site.url resolves to http://localhost:4000 and sidebar links work correctly in the browser:

cd docs
conda run -n amend_jekyll bundle exec jekyll serve --host localhost --port 4000 --baseurl ""

Analysis scripts

PySTAN models are excluded from CI. Run locally with the full conda env.

Live Dashboard

The live dashboard at /dashboard.html auto-updates weekly via update-charts.yml. All dashboard chart files use a dash_ filename prefix to prevent overwriting historical blog post charts.

How it works

analysis/dashboard_charts.py — Master script that generates all dashboard charts:

Wraps calls to MADEP_staff.generate_charts(), MADEP_enforcements_viz.generate_charts(), and ECOS_budgets_viz.generate_charts() with prefix='dash_'
Instantiates CSOAnalysisEEADP with end_date=date.today() for rolling CSO data window, make_regression=False (Stan excluded from CI), make_maps=False (too heavy for weekly CI)
Calls dashboard-specific plot methods: plot_monthly_volume_and_rainfall(), plot_monthly_modeled_vs_metered_fraction(), plot_monthly_volume_by_watershed(), plot_annual_volume_by_operator()

Four refactored analysis scripts — Each now has a generate_charts(engine, prefix='') function:

MADEP_staff.py — Generates 6 staffing charts
MADEP_enforcements_viz.py — Generates 6 enforcement charts (overall, vs budget, by type, penalties overall, penalties stacked, topic breakdown); uses MAEEADP_Enforcement for counts/fines 1996–2026, MADEP_enforcement for topic breakdown through 2017; enforces routine notice filtering to restore enforcement-staffing correlation
ECOS_budgets_viz.py — Generates 3 budget comparison charts
EPA_303d_viz.py — Generates 4 dashboard charts (impaired trend, causes, CSO-to-impaired, TMDL progress) + writes docs/data/facts_EPA303d.yml; generate_post_charts() also generates 2 analysis-post charts (watershed bar chart, folium TMDL map) — requires folium, excluded from CI

docs/dashboard.md — Jekyll post at /dashboard.html that includes the dash_*.html chart files. See file for data sources and methodology notes.

Key implementation details

Numpy serialization: chartjs.py automatically converts numpy scalars and NaN to JSON-safe values via _NumpyEncoder. You do not need to manually convert arrays before passing to add_dataset().

Chart.js version: The current version is defined in analysis/chartjs.py as JS_URL (currently chart.js@4.4.4). To upgrade:

Update the version string in JS_URL in chartjs.py
Check the Chart.js migration guide for breaking API changes
Update make_chart_canvas() in chartjs.py if the scale/plugin config API changed
Search for any hardcoded CDN URLs in analysis/ scripts (e.g. _CHARTJS_CDN in EPA_303d_viz.py) and update those too
Re-run dashboard_charts.py and visually verify charts render correctly

Date handling: Use pd.to_datetime() when comparing Python date objects with pandas datetime64 values. Avoids TypeError: ufunc 'isnan' not supported errors.

File protection mechanism: Blog post charts (e.g., MADEP_staffing_overall.html) are never regenerated by CI — only dashboard charts with dash_ prefix are. The if __name__ == '__main__': block in each refactored script calls generate_charts(engine, prefix='') (empty prefix), preserving original filenames for local testing. CI always uses prefix='dash_'.

Data currency

Chart	Data source	Auto-updates?	Notes
Staffing levels	MA Comptroller SODA API	Yes	Updated Monday via data pipeline
Staffing vs funding	SODA + CTHRU Socrata	Yes	Both update Monday; FY2001–2004 from cached MassBudget (static)
Seniority	SODA payroll	Yes	Data only available through 2016 from VisibleGovernment; Comptroller data doesn't include seniority calculations
Enforcement overall	EEA Data Portal	Yes	1996–2026 (substantive actions only, routine notices filtered)
Enforcement vs funding	EEA DP + CTHRU Socrata	Yes	Both update Monday
Enforcement by type	EEA Data Portal MAEEADP_Enforcement	Yes	Shows substantive enforcement action types through 2026; filtered from 54k+ routine notices
ECOS per-capita spending	ECOS budget survey	Static	Fetched manually; data covers FY2009–2023 (last updated from FY2016–2019 and FY2020–2023 reports, April 2026). Next report expected ~2028. Run `get_ECOS_data.py` when new report is published.
CSO annual volume + rainfall	EEA DP CSO + NOAA ACIS	Yes	Uses `end_date=date.today()` for rolling window
CSO monthly counts + rainfall	EEA DP CSO + NOAA ACIS	Yes	Same as above; rainfall overlaid as line
CSO by operator (annual trends)	EEA DP CSO	Yes	Top 10 operators shown; updated Monday
CSO modeled vs metered	EEA DP CSO	Yes	Monthly CSO-Untreated (detailed), SSO aggregated annually
CSO by watershed	EEA DP CSO + geography lookup	Yes	Top 8 waterbodies; uses Waterbody fallback if Watershed not available
303(d) impaired trend	MassGIS shapefiles (biennial)	Auto-detects new cycle	Script exits early if cycles unchanged; new data expected ~2027 for 2024/2026 cycle
303(d) causes breakdown	MassGIS shapefiles (biennial)	Auto-detects new cycle	Top 15 causes in most recent approved cycle
303(d) CSO to impaired	MassGIS + EEA DP CSO + CSO_303d_Mapping	Auto (CSO); biennial (303d)	CSO_303d_Mapping must be manually reviewed when a new 303d cycle is added
303(d) TMDL progress	MassGIS shapefiles (biennial)	Auto-detects new cycle	hasTmdl derived from category column in each cycle's DBF

Dashboard includes italicized notes where data is static (budget, ECOS, seniority cutoff).

Maintenance tasks

When data sources change:

If SSA website unblocks automated access, uncomment the get_SSAWages.py call in CI and it will auto-fetch real AWI data (currently using cached 2023-02-03 version)
If ECOS releases new data: download the PDF to get_data/ECOS/, run get_ECOS_data.py (from get_data/), then assemble_db.py, then dashboard_charts.py. ECOS publishes roughly every 3–5 years; current data covers FY2009–2023.
If EEA DP schema changes, test locally first, then update CI script
CTHRU budget data (DEP, DCR, EEA, Fish&Game) auto-updates Monday; FY2001–2004 from cached MassBudget do not update

When adding new dashboard charts:

Add chart generation method to the appropriate analysis script (or EEA_DP_CSO_map.py)
Update dashboard_charts.py to call the new method with output slug f'{PREFIX}MAEEADP_dashboard_...'
Add chart include line to docs/dashboard.md with methodology notes
Test locally: python dashboard_charts.py && cd docs && bundle exec jekyll serve --host localhost --port 4000 --baseurl ""

AI Analysis

The AI Analysis page (docs/ai_analysis.html) lets users ask natural-language questions. The LLM generates SQL, executes it client-side via the existing sql.js Web Worker, and renders results with Plotly.

Supported providers: Groq (default, free), OpenAI, Google Gemini. Anthropic Claude is excluded — no browser CORS support.

Semantic context

The LLM receives docs/assets/db_semantic_context.txt instead of bare CREATE TABLE statements. This file includes table descriptions, 5 sample rows per table (showing actual value formats like ALL-CAPS town names), per-column notes (typos, date formats, join keys), and cross-table join relationships.

The semantic context must be regenerated whenever data sources change (new tables, renamed columns, new data loaded). It is auto-regenerated by assemble_db.py after each DB build.

To regenerate manually:

cd get_data
conda run -n amend_python python generate_semantic_context.py

When adding or changing a data source:

Update TABLE_DESCRIPTIONS and COLUMN_NOTES in get_data/generate_semantic_context.py
Run generate_semantic_context.py to update docs/assets/db_semantic_context.txt
Commit both files along with any CSV/DB changes

Key implementation files:

get_data/generate_semantic_context.py — generates semantic context from AMEND.db
docs/assets/db_semantic_context.txt — the generated context (committed, auto-updated)
docs/ai_analysis.html — page layout (disclaimer, settings, two-panel chat + artifact tray); also contains starter question buttons
docs/assets/ai_analysis.js — all client-side logic (~950 lines vanilla JS)
docs/assets/ai_analysis.css — styles

Starter questions

Starter question buttons live in docs/ai_analysis.html as .ai-starter-btn elements inside #ai-starters. When updating them, follow these principles:

Prefer trend/historical questions over point-in-time status questions; current-status questions are only appropriate for datasets with high recency (CSO, enforcement, permits — not 303d which only goes to 2022)
Prioritise cross-dataset joins — the best questions exercise joins between 2–3 tables (e.g. CSO + precipitation, enforcement + budget, CSO + 303d mapping + impairments)
Cover a range of plot types — aim for scatter, dual-axis line, bar, choropleth, and stacked bar across the full set
Be broadly representative — one or two questions per major data asset (CSO, staffing/budget, enforcement, 303d, drinking water, ECOS, NPDES/permits, EJ)

Semantic context pitfalls

Common LLM SQL errors and how the semantic context addresses them:

ALL-CAPS fields: waterBody, municipality, Town are stored in ALL CAPS — document this with UPPER() examples
Misspelled column: volumnOfEvent (not "volume") — document explicitly in COLUMN_NOTES
Year as FLOAT: MAEEADP_CSO.Year is stored as FLOAT (e.g. 2023.0) — note CAST(Year AS INTEGER) if needed
Table alias confusion: lookup/mapping tables (e.g. CSO_303d_Mapping, CSO_WatershedMapping) have very few columns — add explicit IMPORTANT: this table has NO [other columns] warnings in TABLE_DESCRIPTIONS and COLUMN_NOTES to prevent the LLM from applying filters (e.g. reportingCycle) to the wrong alias
Join pattern examples: add concrete SQL examples to JOIN_RELATIONSHIPS for non-obvious multi-table joins; the LLM reliably follows patterns it has seen

API key security

Browser storage: The Ask AI page stores API keys in browser localStorage (plain text, client-side). This is not cryptographically secured but keeps keys off your servers. On shared machines, any JavaScript on the page or other browser extensions could access the key.

Recommendations for users:

Use restricted/temporary API keys if your provider supports them (e.g., API keys with rate limits, expiration dates, or scope limitations)
On shared machines, clear browser storage after use (localStorage.clear() in browser console), or use private/incognito browsing
Providers like Groq, OpenAI, and Gemini all support creating limited-scope keys

For the demo recording script: The record_ai_demo.py script accepts API keys via environment variables only (GROQ_API_KEY, OPENAI_API_KEY, GOOGLE_API_KEY). Keys are held in memory during recording and never persisted to disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md — Working in the AMEND Repository

Python environment

Repository layout

Data pipeline (in order)

SODA credentials

GCS infrastructure

CI / GitHub Actions

Known issues and workarounds

Running scripts

Local Jekyll preview

Analysis scripts

Live Dashboard

How it works

Key implementation details

Data currency

Maintenance tasks

AI Analysis

Semantic context

Starter questions

Semantic context pitfalls

API key security

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md — Working in the AMEND Repository

Python environment

Repository layout

Data pipeline (in order)

SODA credentials

GCS infrastructure

CI / GitHub Actions

Known issues and workarounds

Running scripts

Local Jekyll preview

Analysis scripts

Live Dashboard

How it works

Key implementation details

Data currency

Maintenance tasks

AI Analysis

Semantic context

Starter questions

Semantic context pitfalls

API key security