Skip to content

Optimize SqlTableWriter #269

Open
kaapstorm wants to merge 22 commits intomasterfrom
nh/write_table
Open

Optimize SqlTableWriter #269
kaapstorm wants to merge 22 commits intomasterfrom
nh/write_table

Conversation

@kaapstorm
Copy link
Copy Markdown
Contributor

Optimizes SqlTableWriter:

  • Instead of running make_table_compatible() on every row, only runs on the first 10 (SCHEMA_CHECK_ROWS const) rows.
  • Avoids increasing SQLAlchemy MetaData object growth.
  • Upserts in batches of 1000 rows.
  • Uses native bulk upserts when possible (PostgreSQL, MySQL) instead of one row at a time.
  • Commits 1000 rows at a time instead of maintaining an open transaction for an entire export.

Transactions are committed explicitly for now. When we upgrade SQLAlchemy from 1.4 to 2.0 then we will use SQLAlchemy's connection.begin() context managers for managing transactions. That is out of scope for this PR.

@kaapstorm kaapstorm marked this pull request as ready for review April 10, 2026 20:45
@kaapstorm kaapstorm marked this pull request as draft April 10, 2026 21:27
@kaapstorm kaapstorm marked this pull request as ready for review April 11, 2026 12:43
@kaapstorm
Copy link
Copy Markdown
Contributor Author

Note

It seems, based on testing, that this change is still unstable when exporting large amounts of data. I am going to upgrade SQLAlchemy from 1.4 to 2.0 to get the benefit of better transaction management. New commits are expected on this branch.

@millerdev
Copy link
Copy Markdown
Contributor

Suggestion: convert to DRAFT until you're done pushing to the branch

@kaapstorm kaapstorm marked this pull request as draft April 15, 2026 17:30
@kaapstorm
Copy link
Copy Markdown
Contributor Author

Based on testing, upgrading SQLAlchemy to 2.0 has resolved memory issues. 😅

@kaapstorm kaapstorm marked this pull request as ready for review April 15, 2026 23:48
kaapstorm and others added 5 commits April 16, 2026 14:24
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kaapstorm kaapstorm changed the base branch from master to nh/write_table_pref April 16, 2026 18:42
kaapstorm and others added 7 commits April 16, 2026 14:57
Remove MetaData(bind=...), autoload=True, declarative_base() import,
and branched-connection pattern in env.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
IntegrityError handling no longer aborts the outer transaction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kaapstorm kaapstorm marked this pull request as draft April 16, 2026 18:57
Raw SQL strings must be wrapped in text() in SQLAlchemy 2.0. Also
convert positional %s parameters to named :name bindings, and add
conn.commit() calls after DDL operations now that autocommit has
been removed.
engine.execute() was removed in SQLAlchemy 2.0. Callers must obtain
a connection via engine.connect() and execute statements on it.
kaapstorm and others added 8 commits April 16, 2026 16:49
In SQLAlchemy 2.0 rows are tuple-like by default. Call .mappings() on
the Result to iterate dict-like RowMapping objects.
Accessing the raw DB-API connection via Connection.connection.connection
is deprecated in SQLAlchemy 2.0. Use Connection.connection.dbapi_connection
instead.
Replace raw SQL COMMIT with SQLAlchemy 2.0 transaction API:
- __enter__/__exit__ manage begin/commit/rollback
- _flush() commits current transaction and starts a new one
- _flush_batch() rolls back failed transaction before retry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SQLAlchemy 2.0's .mappings() returns RowMapping dicts, not tuples.
Update assertions to compare against dicts with column names as keys.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Annotate compatibility dict in writers.py to accept dialect-specific
  types like JSON and BIT
- Use separate variable names for postgres and mysql insert statements
  to avoid type conflicts
- Remove unused type: ignore comment from Checkpoint class
- Cast pagination_mode to str for enum indexing
- Assert dbapi_connection is not None before accessing autocommit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Base automatically changed from nh/write_table_pref to master April 16, 2026 21:43
@kaapstorm kaapstorm marked this pull request as ready for review April 16, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants