Fix use-after-free crash when close races with ping or query#1433
Fix use-after-free crash when close races with ping or query#1433bquorning wants to merge 4 commits intobrianmario:masterfrom
Conversation
Calling `client.close` from one thread while another thread is
executing `client.ping` (or any other MySQL operation) can crash
the process with a segfault. Both operations release Ruby's GVL
via `rb_thread_call_without_gvl` and then call libmysql functions
(`mysql_close` and `mysql_ping`) concurrently on the same `MYSQL *`
connection without synchronization. `mysql_close` frees internal
state while the other operation is still accessing it.
The existing `active_fiber` mechanism does not prevent this because:
- `ping` never sets `active_fiber`
- `close` never checks `active_fiber`
- `active_fiber` is a Ruby-level (GVL-held) check, not a native
thread synchronization primitive
Add an `rb_nativethread_lock_t` (Ruby's cross-platform mutex) to
`mysql_client_wrapper` that serializes all `nogvl_*` functions which
access the underlying MYSQL connection: `nogvl_close`, `nogvl_ping`,
`nogvl_send_query`, `nogvl_read_query_result`, `nogvl_do_result`,
and `nogvl_select_db`.
Each protected function acquires the mutex, checks `wrapper->closed`
to bail out early if `close` already ran, performs the MySQL
operation, and releases the mutex. This ensures `mysql_close` never
runs concurrently with any other MySQL operation on the same
connection.
Behavioral change: `close` now blocks until any in-flight MySQL
operation completes, rather than calling `mysql_close` immediately.
This is necessary to prevent the use-after-free, and only affects
the rare case where `close` races with another operation.
Performance impact is negligible. An uncontended mutex lock+unlock
pair costs ~11 ns, while the MySQL operations it wraps cost
~224 µs (ping) to ~249 µs (SELECT 1) — an overhead of ~0.005%.
Reproducer:
client = Mysql2::Client.new(host: "127.0.0.1", port: 33061, username: "root")
Thread.new { client.ping }
client.close
Co-authored-by: Nony Dutton <nony.dutton@zendesk.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a comment at the top of client.c explaining the three concurrency mechanisms (active_fiber, mutex, refcount), what each one guards against, and why they are not interchangeable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| it "should not crash when ping and close are called concurrently" do | ||
| 100.times do | ||
| client = new_client | ||
| thread = Thread.new { client.ping } |
There was a problem hiding this comment.
I think if we do client.query("SELECT SLEEP(1)") then we can get it to consistently fail without worrying about the flakiness of ping executing before client.close does. That way we can probably remove the 100.times do.
There was a problem hiding this comment.
That said, the block runs 100 times slightly faster than 1 second so SELECT SLEEP(1) would be moderately slower.
|
This is great, thank you! Kicking off the CI run |
|
Currently at a conference so can't properly review the PR before a few days, but just based on the description I'd advise against using a native mutex, as it if the thread owning it dies (typically on fork) it could result in a deadlock of the VM. I suspect there is a simpler fix for this race. |
|
So I got a few minutes to look, and yes, I believe the proper fix is to extend the use of |
|
I’ll try that @byroot. Thanks for the review. |
|
FYI @sodabrew CI seems to be failing because of expired signatures. I am not sure how to update them, but I suspect CI will continue to fail until that has been fixed. |
When one thread calls close while another is blocked in a query or ping (without the GVL), mysql_close frees the MYSQL struct that the in-flight operation is still reading, causing a use-after-free crash. Fix this by splitting close into two paths based on active_fiber: - Idle (no in-flight operation): call mysql_close directly while holding the GVL for a clean COM_QUIT disconnect. - Busy (operation in-flight): set a closed flag and call shutdown(SHUT_RDWR) on the socket to interrupt the blocked I/O, then defer mysql_close to decr_mysql2_client (refcount == 0) where no concurrent access is possible. A native mutex is intentionally avoided: if a thread holds such a mutex at fork() time, the child inherits a permanently-locked mutex and any subsequent close or GC deadlocks. Also: - Set active_fiber around ping's nogvl call so close can detect in-flight pings - Check wrapper->closed in REQUIRE_CONNECTED, closed?, and ping since mysql_close is now deferred and CONNECTED() remains true after close - Add a doc comment to client.c explaining the three thread safety mechanisms (active_fiber, closed flag + shutdown, refcount) - Add tests for concurrent close+ping, close+query, and fork safety Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| end | ||
| end | ||
|
|
||
| it "should not deadlock when forking while a query is in-flight" do |
There was a problem hiding this comment.
This spec may be too heavy-handed to keep around. Let me know if I should delete it before merge.
I am not very comfortable in C and relying a lot on Claude Code (and you) to make the right decisions here 😅 What I’m getting back is that
Instead we are adding a |
I believe it's fine, it's already the design of
Yes, but we also don't want anyone starting using it while we're closing it, which is achieved by setting the active fiber. The After glancing at your PR, one problem remaining is that you must either ensure that the APIs you are calling between setting and unsetting the current fiber will never raise, or use |
| * raising. close intentionally skips this check — it must work | ||
| * regardless of connection state. | ||
| * | ||
| * 2. closed flag + socket shutdown (C level) |
There was a problem hiding this comment.
I don't think this is necessary. We can rely on initialized = 0, Mysql2::Client can reconnect.
|
Simpler fix in #1436 |
That looks so much better than my approach. I’ll close this PR. |
Summary
Fix a use-after-free crash when
closeis called from one thread while another thread is executingping,query, or another operation without the GVL. Instead of callingmysql_closeimmediately (which frees theMYSQLstruct while another thread is still using it),closenow detects whether an operation is in-flight and takes one of two paths:mysql_closedirectly while holding the GVL for a cleanCOM_QUITdisconnect.closedflag and callshutdown(SHUT_RDWR)on the socket to interrupt the blocked I/O, then defermysql_closetodecr_mysql2_client(refcount == 0) where no concurrent access is possible.Reproducer:
This crashes with a segfault because both
nogvl_closeandnogvl_pingrelease Ruby's GVL viarb_thread_call_without_gvland then call libmysql functions concurrently on the sameMYSQL *connection.mysql_closefrees internal state whilemysql_pingis still accessing it.Why not a native mutex
A native mutex was considered but intentionally avoided: if a thread holds such a mutex at
fork()time, the child inherits a permanently-locked mutex and any subsequentcloseor GC deadlocks. The flag +shutdown()approach avoids this problem entirely.A Ruby-level
Mutexwas also discussed and dismissed in #1284 by @casperisfine, with valid arguments about method dispatch overhead,Thread.handle_interrupthandling, and behavioral changes.Why the connection pool cannot prevent this
The pool serializes starting operations at the Ruby level, but once two threads have each passed the Ruby-level checks (one doing
ping, the other doingclose), they are insiderb_thread_call_without_gvlconcurrently. The pool has no visibility into that window.Changes
closeinto idle/busy paths as described aboveactive_fiberaroundping'snogvlcall soclosecan detect in-flight pingswrapper->closedinREQUIRE_CONNECTED,closed?, andpingsincemysql_closeis now deferred andCONNECTED()remains true aftercloseclient.cexplaining the three thread safety mechanisms (active_fiber, closed flag + shutdown, refcount)Related issues
mysql_pingcrash with thread contention in Passenger.Test plan
ping+close,query+close, and fork safetyrake spec:valgrindpasses🤖 Generated with Claude Code