Skip to content

charlesreid1/wordswordswords

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

122 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Words Words Words

This repo contains Python code for tagging words based on their etymology. You can get more information from the Words Words Words web site or by browsing this repository.

Finished product: http://charlesreid1.github.io/wordswordswords

Repository name is a reference to Hamlet.

Prerequisites

  • Python 3.10+ (recommended)
  • External API Keys:
    • To use the LLM-based etymology lookup, you must set the ANTHROPIC_API_KEY environment variable.

Installation

  1. Clone the repository:

    git clone https://github.com/charlesreid1/wordswordswords.git
    cd wordswordswords
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. (Optional) Initialize the SQLite cache from existing CSV files:

    python -m etymology.run --migrate

Usage

The primary way to use the library is via the etymology.run module.

Running the Full Pipeline

To process a book from a Gutenberg HTML file through all stages (extraction, lookup, and generation):

python -m etymology.run --book <book_name> --step all

The <book_name> corresponds to the filename in the gutenberg/ directory (e.g., frankenstein).

Pipeline Stages

You can also run individual steps of the pipeline:

  • Extract: Process the raw Gutenberg HTML and generate a word frequency list.
    python -m etymology.run --book <book_name> --step extract
  • Lookup: Fetch etymological data for the word list using configured sources (Wiktionary, Etymonline, LLM). Results are cached in data/etymologies.db.
    python -m etymology.run --book <book_name> --step lookup
  • Generate: Create the final tagged HTML file where recognized words are wrapped in <span class='language'> tags.
    python -m etymology.run --book <book_name> --step generate

Legacy Mode

The library also supports its original execution mode:

python -m etymology.run --mode legacy --book <book_name> --step all

Configuration

The pipeline's behavior can be customized in etymology/config.yaml:

  • Sources: Choose the order of etymology lookup sources (wiktionary, etymonline, llm).
  • WordNet Lemmatization: Enable/disable lemmatization before looking up a word's etymology.
  • Rate Limits: Configure delay between requests for external sources.
  • LLM Settings: Specify which model to use for LLM-based lookup.

Building the Web Site

The project uses Pelican to generate the static web site. The process involves three main steps:

1. Generate Tagged HTML

First, generate the etymology-tagged HTML files for the books you want to update. These files are the source for the Pelican build:

python -m etymology.run --book <book_name> --step html --mode legacy

The generated files will be placed in the html/ directory.

2. Synchronize with Pelican

The Pelican site templates include these HTML files from book-specific _includes directories. You must copy the updated files from the root html/ directory to the corresponding Pelican directory:

# Example for Dubliners
cp html/dubliners*.html pelican/dubliners/_includes/

# Example for Frankenstein
cp html/frankenstein*.html pelican/frankenstein/_includes/

Note: Most books use an _includes directory, but roughingit uses _include.

3. Run the Pelican Build

Navigate to the pelican/ directory and run the build script to generate the final site:

cd pelican
./make_stuff.sh

The final site will be generated in pelican/output/.

4. Local Preview

To preview the site locally, run a web server from the output directory:

cd pelican/output
python -m http.server 8000

Then visit http://localhost:8000/wordswordswords/ in your browser.

Project Structure

  • etymology/: Main Python package containing the core logic.
  • gutenberg/: Input HTML files from Project Gutenberg.
  • csv/: Intermediate word lists and legacy CSV etymology data.
  • data/: SQLite cache and progress tracking files.
  • html/: Generated tagged HTML files.
  • pelican/: Configuration for generating the static web site.

Releases

No releases published

Packages

 
 
 

Contributors