Fix URL import encoding for pages without charset in HTTP header#660
Merged
jzohrab merged 1 commit intoLuteOrg:developfrom May 3, 2026
Merged
Fix URL import encoding for pages without charset in HTTP header#660jzohrab merged 1 commit intoLuteOrg:developfrom
jzohrab merged 1 commit intoLuteOrg:developfrom
Conversation
When importing a book from a URL, `requests.get().text` decodes the response using the encoding from the HTTP Content-Type header. Per RFC 2616, if no charset is specified for text/html, requests defaults to ISO-8859-1. This causes mojibake for pages that are UTF-8 encoded but don't declare charset in the HTTP header (only in HTML meta tags). Fix: use `response.content` (raw bytes) instead of `response.text`, so BeautifulSoup can detect the correct encoding from the HTML `<meta charset>` tag, BOM, or other in-document declarations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
axel584
approved these changes
Apr 30, 2026
axel584
left a comment
There was a problem hiding this comment.
I love 1-line PR ;-)
Your fix works great on my linux computer.
jzohrab
approved these changes
May 3, 2026
Collaborator
jzohrab
left a comment
There was a problem hiding this comment.
Well, that's interesting, thank you! Running it through CI now ... though CI only does a fake import with a dummy web server, so it's not really exercising this fix properly.
Collaborator
|
Tested locally with the URL given, thank you! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
charsetin the HTTPContent-Typeheader (e.g.Content-Type: text/htmlwith no charset) get their text garbled with mojibake. This is becauserequestsdefaults to ISO-8859-1 fortext/htmlper RFC 2616, even when the page is actually UTF-8.response.content(raw bytes) instead ofresponse.text, letting BeautifulSoup detect the correct encoding from in-document declarations like<meta charset="utf-8">.Reproduction
https://ancient-buddhist-texts.net/Texts-and-Translations/Dipavamsa/01-Yakkhas.htmContent-Type: text/html(no charset), but the HTML contains<meta charset="utf-8">pītipāmojjajananaṁrender aspÄ«tipÄmojjajananaá¹How the fix works
response.textuses the encoding from HTTP headers (ISO-8859-1 default), whileresponse.contentreturns raw bytes. BeautifulSoup already handles encoding detection from BOM,<meta charset>, and XML declarations when given bytes, so this is the correct approach.Test plan
response.content+ BeautifulSoup correctly decodes UTF-8 from the HTML meta tag🤖 Generated with Claude Code