Words Words Words

This repo contains Python code for tagging words based on their etymology. You can get more information from the Words Words Words web site or by browsing this repository.

Finished product: http://charlesreid1.github.io/wordswordswords

Repository name is a reference to Hamlet.

Prerequisites

Python 3.10+ (recommended)
External API Keys:
- To use the LLM-based etymology lookup, you must set the ANTHROPIC_API_KEY environment variable.

Installation

Clone the repository:

git clone https://github.com/charlesreid1/wordswordswords.git
cd wordswordswords

Install the required dependencies:
```
pip install -r requirements.txt
```
(Optional) Initialize the SQLite cache from existing CSV files:
```
python -m etymology.run --migrate
```

Usage

The primary way to use the library is via the etymology.run module.

Running the Full Pipeline

To process a book from a Gutenberg HTML file through all stages (extraction, lookup, and generation):

python -m etymology.run --book <book_name> --step all

The <book_name> corresponds to the filename in the gutenberg/ directory (e.g., frankenstein).

Pipeline Stages

You can also run individual steps of the pipeline:

Extract: Process the raw Gutenberg HTML and generate a word frequency list.
```
python -m etymology.run --book <book_name> --step extract
```
Lookup: Fetch etymological data for the word list using configured sources (Wiktionary, Etymonline, LLM). Results are cached in data/etymologies.db.
```
python -m etymology.run --book <book_name> --step lookup
```
Generate: Create the final tagged HTML file where recognized words are wrapped in <span class='language'> tags.
```
python -m etymology.run --book <book_name> --step generate
```

Legacy Mode

The library also supports its original execution mode:

python -m etymology.run --mode legacy --book <book_name> --step all

Configuration

The pipeline's behavior can be customized in etymology/config.yaml:

Sources: Choose the order of etymology lookup sources (wiktionary, etymonline, llm).
WordNet Lemmatization: Enable/disable lemmatization before looking up a word's etymology.
Rate Limits: Configure delay between requests for external sources.
LLM Settings: Specify which model to use for LLM-based lookup.

Building the Web Site

The project uses Pelican to generate the static web site. The process involves three main steps:

1. Generate Tagged HTML

First, generate the etymology-tagged HTML files for the books you want to update. These files are the source for the Pelican build:

python -m etymology.run --book <book_name> --step html --mode legacy

The generated files will be placed in the html/ directory.

2. Synchronize with Pelican

The Pelican site templates include these HTML files from book-specific _includes directories. You must copy the updated files from the root html/ directory to the corresponding Pelican directory:

# Example for Dubliners
cp html/dubliners*.html pelican/dubliners/_includes/

# Example for Frankenstein
cp html/frankenstein*.html pelican/frankenstein/_includes/

Note: Most books use an _includes directory, but roughingit uses _include.

3. Run the Pelican Build

Navigate to the pelican/ directory and run the build script to generate the final site:

cd pelican
./make_stuff.sh

The final site will be generated in pelican/output/.

4. Local Preview

To preview the site locally, run a web server from the output directory:

cd pelican/output
python -m http.server 8000

Then visit http://localhost:8000/wordswordswords/ in your browser.

Project Structure

etymology/: Main Python package containing the core logic.
gutenberg/: Input HTML files from Project Gutenberg.
csv/: Intermediate word lists and legacy CSV etymology data.
data/: SQLite cache and progress tracking files.
html/: Generated tagged HTML files.
pelican/: Configuration for generating the static web site.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
csv		csv
etymology		etymology
gutenberg		gutenberg
pelican		pelican
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
books.yaml		books.yaml
deploy.sh		deploy.sh
memo		memo
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Words Words Words

Prerequisites

Installation

Usage

Running the Full Pipeline

Pipeline Stages

Legacy Mode

Configuration

Building the Web Site

1. Generate Tagged HTML

2. Synchronize with Pelican

3. Run the Pelican Build

4. Local Preview

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Words Words Words

Prerequisites

Installation

Usage

Running the Full Pipeline

Pipeline Stages

Legacy Mode

Configuration

Building the Web Site

1. Generate Tagged HTML

2. Synchronize with Pelican

3. Run the Pelican Build

4. Local Preview

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages