Skip to content

fix: recover latin-1 encoded Location headers on redirects#12325

Open
MAXDVVV wants to merge 4 commits intoaio-libs:masterfrom
MAXDVVV:fix/redirect-non-ascii-location-10047
Open

fix: recover latin-1 encoded Location headers on redirects#12325
MAXDVVV wants to merge 4 commits intoaio-libs:masterfrom
MAXDVVV:fix/redirect-non-ascii-location-10047

Conversation

@MAXDVVV
Copy link
Copy Markdown

@MAXDVVV MAXDVVV commented Apr 6, 2026

Problem

When a server sends a Location header containing raw latin-1 encoded bytes (e.g. \xf8 for ø), the redirect URL gets corrupted.

Redirect chain example (from #10047):

https://cornelius-k.dk/synsproeve/
  → Location: https://cornelius-k.dk/synsprøve  (URL-encoded %C3%B8, OK)
  → Location: https://cornelius-k.dk/synspr\xf8ve  (raw latin-1 byte!)
    → aiohttp sees: https://cornelius-k.dk/synspr\udcf8ve  (broken surrogate)
    → 404!

Root cause

The HTTP parser decodes header values with utf-8/surrogateescape (http_parser.py L208). When a server sends raw latin-1 bytes in the Location header (which some servers do, despite RFC violations), bytes like \xf8 are not valid UTF-8 and get decoded as surrogates like \udcf8. These surrogates then cause URL() to produce a broken URL.

Fix

In the redirect handling code (client.py), after reading the Location header value, detect if it contains surrogates (can't encode to UTF-8). If so, round-trip through surrogateescape back to bytes and decode as latin-1, recovering the original characters:

'\udcf8'encode('utf-8', 'surrogateescape') → b'\xf8'decode('latin-1') → 'ø'

This is a targeted fix that only affects redirect URL processing, not general header decoding.

Verification

>>> r_url = 'https://cornelius-k.dk/synspr\udcf8ve'
>>> raw = r_url.encode('utf-8', 'surrogateescape')
>>> r_url = raw.decode('latin-1')
>>> r_url
'https://cornelius-k.dk/synsprøve'  # correct!

Fixes #10047

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Apr 6, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.11%. Comparing base (e412ccb) to head (f11f79d).
⚠️ Report is 7 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #12325    +/-   ##
========================================
  Coverage   99.11%   99.11%            
========================================
  Files         130      130            
  Lines       45558    45658   +100     
  Branches     2404     2406     +2     
========================================
+ Hits        45156    45256   +100     
  Misses        272      272            
  Partials      130      130            
Flag Coverage Δ
CI-GHA 98.97% <100.00%> (+<0.01%) ⬆️
OS-Linux 98.72% <100.00%> (-0.01%) ⬇️
OS-Windows 96.97% <100.00%> (-0.03%) ⬇️
OS-macOS 97.87% <100.00%> (-0.01%) ⬇️
Py-3.10.11 97.43% <100.00%> (+<0.01%) ⬆️
Py-3.10.20 97.90% <100.00%> (+<0.01%) ⬆️
Py-3.11.15 98.11% <100.00%> (+0.01%) ⬆️
Py-3.11.9 97.64% <100.00%> (+<0.01%) ⬆️
Py-3.12.10 97.72% <100.00%> (+<0.01%) ⬆️
Py-3.12.13 98.20% <100.00%> (+<0.01%) ⬆️
Py-3.13.12 98.45% <100.00%> (+<0.01%) ⬆️
Py-3.14.3 98.50% <100.00%> (+<0.01%) ⬆️
Py-3.14.3t ?
Py-3.14.4t 97.51% <100.00%> (?)
Py-pypy3.11.15-7.3.21 97.39% <100.00%> (-0.01%) ⬇️
VM-macos 97.87% <100.00%> (-0.01%) ⬇️
VM-ubuntu 98.72% <100.00%> (-0.01%) ⬇️
VM-windows 96.97% <100.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 6, 2026

Merging this PR will not alter performance

✅ 61 untouched benchmarks
⏩ 4 skipped benchmarks1


Comparing MAXDVVV:fix/redirect-non-ascii-location-10047 (f11f79d) with master (f55503d)

Open in CodSpeed

Footnotes

  1. 4 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

except (UnicodeEncodeError, UnicodeDecodeError):
try:
raw = r_url.encode("utf-8", "surrogateescape")
r_url = raw.decode("latin-1")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it's not latin-1? This seems unreasonable for us to just start guessing charsets randomly.

If fallback_charset_resolver is set, we could use that instead maybe?

@Dreamsorcerer Dreamsorcerer added the pr-unfinished The PR is unfinished and may need a volunteer to complete it label Apr 7, 2026
…very

Address reviewer feedback: instead of hardcoding latin-1, consult the
session's fallback_charset_resolver to determine the charset for
recovering non-ASCII Location headers. Latin-1 remains the ultimate
fallback per RFC 7230 (historical HTTP/1.1 header encoding).

Refs: aio-libs#10047
Comment on lines +875 to +877
_raw = r_url.encode("utf-8", "surrogateescape")
_charset = self._resolve_charset(resp, _raw)
r_url = _recover_redirect_location(r_url, _charset)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely we just decode it with the charset and lose the new function..?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bot:chronographer:provided There is a change note present in this PR pr-unfinished The PR is unfinished and may need a volunteer to complete it

Projects

None yet

Development

Successfully merging this pull request may close these issues.

On redirects, middle URL with ø char gets parsed wrongly - leading to a 404

3 participants