Fix URL import encoding for pages without charset in HTTP header by vikarag · Pull Request #660 · LuteOrg/lute-v3

vikarag · 2026-03-23T02:57:52Z

Summary

When importing a book from a URL, pages that don't declare charset in the HTTP Content-Type header (e.g. Content-Type: text/html with no charset) get their text garbled with mojibake. This is because requests defaults to ISO-8859-1 for text/html per RFC 2616, even when the page is actually UTF-8.
Fix: use response.content (raw bytes) instead of response.text, letting BeautifulSoup detect the correct encoding from in-document declarations like <meta charset="utf-8">.

Reproduction

Import a book from URL: https://ancient-buddhist-texts.net/Texts-and-Translations/Dipavamsa/01-Yakkhas.htm
The server returns Content-Type: text/html (no charset), but the HTML contains <meta charset="utf-8">
Pali diacritics like pītipāmojjajananaṁ render as pÄ«tipÄmojjajananaá¹

How the fix works

response.text uses the encoding from HTTP headers (ISO-8859-1 default), while response.content returns raw bytes. BeautifulSoup already handles encoding detection from BOM, <meta charset>, and XML declarations when given bytes, so this is the correct approach.

Test plan

Verified the fix with the reproduction URL above — Pali diacritics now render correctly
Verified via interactive Python session inside the Docker container that response.content + BeautifulSoup correctly decodes UTF-8 from the HTML meta tag
Existing URL import functionality should be unaffected for pages that do declare charset in HTTP headers (BeautifulSoup handles both cases)

🤖 Generated with Claude Code

When importing a book from a URL, `requests.get().text` decodes the response using the encoding from the HTTP Content-Type header. Per RFC 2616, if no charset is specified for text/html, requests defaults to ISO-8859-1. This causes mojibake for pages that are UTF-8 encoded but don't declare charset in the HTTP header (only in HTML meta tags). Fix: use `response.content` (raw bytes) instead of `response.text`, so BeautifulSoup can detect the correct encoding from the HTML `<meta charset>` tag, BOM, or other in-document declarations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

axel584

I love 1-line PR ;-)

Your fix works great on my linux computer.

jzohrab

Well, that's interesting, thank you! Running it through CI now ... though CI only does a fake import with a dummy web server, so it's not really exercising this fix properly.

jzohrab · 2026-05-03T01:14:14Z

Tested locally with the URL given, thank you!

axel584 approved these changes Apr 30, 2026

View reviewed changes

jzohrab approved these changes May 3, 2026

View reviewed changes

jzohrab changed the base branch from master to develop May 3, 2026 00:21

jzohrab added this to Lute-v3 May 3, 2026

jzohrab moved this to In Progress in Lute-v3 May 3, 2026

jzohrab merged commit f9393b4 into LuteOrg:develop May 3, 2026

github-project-automation Bot moved this from In Progress to Done in Lute-v3 May 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix URL import encoding for pages without charset in HTTP header#660

Fix URL import encoding for pages without charset in HTTP header#660
jzohrab merged 1 commit intoLuteOrg:developfrom
vikarag:fix/url-import-encoding

vikarag commented Mar 23, 2026 •

edited

Loading

Uh oh!

axel584 left a comment

Uh oh!

jzohrab left a comment

Uh oh!

jzohrab commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vikarag commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproduction

How the fix works

Test plan

Uh oh!

axel584 left a comment

Choose a reason for hiding this comment

Uh oh!

jzohrab left a comment

Choose a reason for hiding this comment

Uh oh!

jzohrab commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vikarag commented Mar 23, 2026 •

edited

Loading