Skip to content

Fix URL import encoding for pages without charset in HTTP header#660

Merged
jzohrab merged 1 commit intoLuteOrg:developfrom
vikarag:fix/url-import-encoding
May 3, 2026
Merged

Fix URL import encoding for pages without charset in HTTP header#660
jzohrab merged 1 commit intoLuteOrg:developfrom
vikarag:fix/url-import-encoding

Conversation

@vikarag
Copy link
Copy Markdown

@vikarag vikarag commented Mar 23, 2026

Summary

  • When importing a book from a URL, pages that don't declare charset in the HTTP Content-Type header (e.g. Content-Type: text/html with no charset) get their text garbled with mojibake. This is because requests defaults to ISO-8859-1 for text/html per RFC 2616, even when the page is actually UTF-8.
  • Fix: use response.content (raw bytes) instead of response.text, letting BeautifulSoup detect the correct encoding from in-document declarations like <meta charset="utf-8">.

Reproduction

  1. Import a book from URL: https://ancient-buddhist-texts.net/Texts-and-Translations/Dipavamsa/01-Yakkhas.htm
  2. The server returns Content-Type: text/html (no charset), but the HTML contains <meta charset="utf-8">
  3. Pali diacritics like pītipāmojjajananaṁ render as pÄ«tipÄmojjajananaá¹

How the fix works

response.text uses the encoding from HTTP headers (ISO-8859-1 default), while response.content returns raw bytes. BeautifulSoup already handles encoding detection from BOM, <meta charset>, and XML declarations when given bytes, so this is the correct approach.

Test plan

  • Verified the fix with the reproduction URL above — Pali diacritics now render correctly
  • Verified via interactive Python session inside the Docker container that response.content + BeautifulSoup correctly decodes UTF-8 from the HTML meta tag
  • Existing URL import functionality should be unaffected for pages that do declare charset in HTTP headers (BeautifulSoup handles both cases)

🤖 Generated with Claude Code

When importing a book from a URL, `requests.get().text` decodes the
response using the encoding from the HTTP Content-Type header. Per
RFC 2616, if no charset is specified for text/html, requests defaults
to ISO-8859-1. This causes mojibake for pages that are UTF-8 encoded
but don't declare charset in the HTTP header (only in HTML meta tags).

Fix: use `response.content` (raw bytes) instead of `response.text`,
so BeautifulSoup can detect the correct encoding from the HTML
`<meta charset>` tag, BOM, or other in-document declarations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@axel584 axel584 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love 1-line PR ;-)

Your fix works great on my linux computer.

Copy link
Copy Markdown
Collaborator

@jzohrab jzohrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's interesting, thank you! Running it through CI now ... though CI only does a fake import with a dummy web server, so it's not really exercising this fix properly.

@jzohrab jzohrab changed the base branch from master to develop May 3, 2026 00:21
@jzohrab jzohrab added this to Lute-v3 May 3, 2026
@jzohrab jzohrab moved this to In Progress in Lute-v3 May 3, 2026
@jzohrab
Copy link
Copy Markdown
Collaborator

jzohrab commented May 3, 2026

Tested locally with the URL given, thank you!

@jzohrab jzohrab merged commit f9393b4 into LuteOrg:develop May 3, 2026
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Lute-v3 May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants