fix: recover latin-1 encoded Location headers on redirects#12325
fix: recover latin-1 encoded Location headers on redirects#12325MAXDVVV wants to merge 4 commits intoaio-libs:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #12325 +/- ##
========================================
Coverage 99.11% 99.11%
========================================
Files 130 130
Lines 45558 45658 +100
Branches 2404 2406 +2
========================================
+ Hits 45156 45256 +100
Misses 272 272
Partials 130 130
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Merging this PR will not alter performance
Comparing Footnotes
|
aiohttp/client.py
Outdated
| except (UnicodeEncodeError, UnicodeDecodeError): | ||
| try: | ||
| raw = r_url.encode("utf-8", "surrogateescape") | ||
| r_url = raw.decode("latin-1") |
There was a problem hiding this comment.
What if it's not latin-1? This seems unreasonable for us to just start guessing charsets randomly.
If fallback_charset_resolver is set, we could use that instead maybe?
…very Address reviewer feedback: instead of hardcoding latin-1, consult the session's fallback_charset_resolver to determine the charset for recovering non-ASCII Location headers. Latin-1 remains the ultimate fallback per RFC 7230 (historical HTTP/1.1 header encoding). Refs: aio-libs#10047
| _raw = r_url.encode("utf-8", "surrogateescape") | ||
| _charset = self._resolve_charset(resp, _raw) | ||
| r_url = _recover_redirect_location(r_url, _charset) |
There was a problem hiding this comment.
Surely we just decode it with the charset and lose the new function..?
Problem
When a server sends a
Locationheader containing raw latin-1 encoded bytes (e.g.\xf8forø), the redirect URL gets corrupted.Redirect chain example (from #10047):
Root cause
The HTTP parser decodes header values with
utf-8/surrogateescape(http_parser.py L208). When a server sends raw latin-1 bytes in theLocationheader (which some servers do, despite RFC violations), bytes like\xf8are not valid UTF-8 and get decoded as surrogates like\udcf8. These surrogates then causeURL()to produce a broken URL.Fix
In the redirect handling code (
client.py), after reading theLocationheader value, detect if it contains surrogates (can't encode to UTF-8). If so, round-trip throughsurrogateescapeback to bytes and decode as latin-1, recovering the original characters:This is a targeted fix that only affects redirect URL processing, not general header decoding.
Verification
Fixes #10047