This repo contains Python code for tagging words based on their etymology. You can get more information from the Words Words Words web site or by browsing this repository.
Finished product: http://charlesreid1.github.io/wordswordswords
Repository name is a reference to Hamlet.
- Python 3.10+ (recommended)
- External API Keys:
- To use the LLM-based etymology lookup, you must set the
ANTHROPIC_API_KEYenvironment variable.
- To use the LLM-based etymology lookup, you must set the
-
Clone the repository:
git clone https://github.com/charlesreid1/wordswordswords.git cd wordswordswords -
Install the required dependencies:
pip install -r requirements.txt
-
(Optional) Initialize the SQLite cache from existing CSV files:
python -m etymology.run --migrate
The primary way to use the library is via the etymology.run module.
To process a book from a Gutenberg HTML file through all stages (extraction, lookup, and generation):
python -m etymology.run --book <book_name> --step allThe <book_name> corresponds to the filename in the gutenberg/ directory (e.g., frankenstein).
You can also run individual steps of the pipeline:
- Extract: Process the raw Gutenberg HTML and generate a word frequency list.
python -m etymology.run --book <book_name> --step extract
- Lookup: Fetch etymological data for the word list using configured sources (Wiktionary, Etymonline, LLM). Results are cached in
data/etymologies.db.python -m etymology.run --book <book_name> --step lookup
- Generate: Create the final tagged HTML file where recognized words are wrapped in
<span class='language'>tags.python -m etymology.run --book <book_name> --step generate
The library also supports its original execution mode:
python -m etymology.run --mode legacy --book <book_name> --step allThe pipeline's behavior can be customized in etymology/config.yaml:
- Sources: Choose the order of etymology lookup sources (
wiktionary,etymonline,llm). - WordNet Lemmatization: Enable/disable lemmatization before looking up a word's etymology.
- Rate Limits: Configure delay between requests for external sources.
- LLM Settings: Specify which model to use for LLM-based lookup.
The project uses Pelican to generate the static web site. The process involves three main steps:
First, generate the etymology-tagged HTML files for the books you want to update. These files are the source for the Pelican build:
python -m etymology.run --book <book_name> --step html --mode legacyThe generated files will be placed in the html/ directory.
The Pelican site templates include these HTML files from book-specific _includes directories. You must copy the updated files from the root html/ directory to the corresponding Pelican directory:
# Example for Dubliners
cp html/dubliners*.html pelican/dubliners/_includes/
# Example for Frankenstein
cp html/frankenstein*.html pelican/frankenstein/_includes/Note: Most books use an _includes directory, but roughingit uses _include.
Navigate to the pelican/ directory and run the build script to generate the final site:
cd pelican
./make_stuff.shThe final site will be generated in pelican/output/.
To preview the site locally, run a web server from the output directory:
cd pelican/output
python -m http.server 8000Then visit http://localhost:8000/wordswordswords/ in your browser.
etymology/: Main Python package containing the core logic.gutenberg/: Input HTML files from Project Gutenberg.csv/: Intermediate word lists and legacy CSV etymology data.data/: SQLite cache and progress tracking files.html/: Generated tagged HTML files.pelican/: Configuration for generating the static web site.