Skip to content

Commit 6aa0178

Browse files
committed
Version: 0.10.5-alpha.2
Unify BPE/Char tokenizers, add tokenize CLI, modularize - Refactor: Unified BPE and Char tokenizer/vocabulary modules with extensible config - Feature: Add tokenize CLI tool for training, encoding, and decoding text - Infra: Modularize codebase, update CMake and tests for new structure - Test: Add comprehensive unit tests for BPE/Char tokenizers and trainers BREAKING CHANGE: Legacy Gpt2/Llama3 tokenizer modules removed; all code and tools now use unified BPE/Char modules and new file locations. Update any scripts or integrations to use the new tokenize CLI and module APIs.
1 parent 55a64b5 commit 6aa0178

40 files changed

Lines changed: 2869 additions & 499 deletions

.github/copilot-instructions.md

Lines changed: 35 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,67 +1,64 @@
11
# Mila — Copilot Instructions
22

3-
## Code generation policy
3+
## General Guidelines
44
- Generate code only when explicitly requested (e.g., "implement", "update", "write code", "generate", "create code"). Otherwise provide analysis, design guidance, and minimal examples.
5-
- Mila is at the Alpha stage of development, please do not consider backward compatibility with previous versions when generating code.
5+
- Mila is at the Alpha stage of development; please do not consider backward compatibility with previous versions when generating code.
66

7-
## Doxygen / file header policy
8-
- File-level Doxygen comments must be concise summaries (one to three short sentences).
9-
- Purpose: provide a quick summary of the file intent for readers and tools.
10-
- Must NOT repeat detailed API, implementation notes, or usage examples.
11-
- Detailed documentation belongs in the module/class/function-level Doxygen comments (module API).
12-
- Put behavior, parameters, return semantics, ownership/lifetime, threading assumptions, and examples on the relevant symbol's Doxygen block.
13-
- Module-level comments (module partitions) should describe the public API surface and usage patterns.
14-
- Example file-level header (preferred):
15-
- Brief one-line summary: "Configuration for the Residual module."
16-
- Optional short second sentence for scope: "Provides fluent setters used by Residual and backend factories."
17-
- Rationale: keeps files scannable and avoids duplicated, stale documentation across many files.
18-
19-
## Coding Style
20-
21-
### Blank Lines Around Blocks
22-
- Add blank line before control flow blocks (if, for, while, switch)
23-
- Add blank line after closing brace of blocks
24-
- Exception: No blank line between `} else {` or `} catch {`
25-
26-
### Blank Lines Around Return Statements
27-
- Add blank line before `return` statement (final return in function)
28-
- Exception: Early returns (guard clauses) don't need blank line
29-
- Exception: Single-statement functions don't need blank line
7+
## Code Style
8+
- Do not columnize/align code with extra spaces. Identifiers and types should use standard single-space formatting. Column alignment breaks when names change.
9+
- Add blank line before control flow blocks (if, for, while, switch).
10+
- Add blank line after closing brace of blocks.
11+
- Exception: No blank line between `} else {` or `} catch {`.
12+
- Add blank line before `return` statement (final return in function).
13+
- Exception: Early returns (guard clauses) don't need blank line.
14+
- Exception: Single-statement functions don't need blank line.
3015

3116
## High-level constraints
3217
- Project is alpha: breaking changes and simplifications are acceptable.
3318
- Backward compatibility is NOT required. Do not use Deprecated APIs.
34-
- Do not use Mila deprecated API
19+
- Do not use Mila deprecated API.
3520
- Host code: C++23 using modules and module partitions. Tests: GTest. Build: CMake + Ninja.
3621

37-
## Comment policy
22+
## Comment Policy
3823
- NEVER generate trivial comments that simply restate what the code does. For example, do not generate comments like:
3924
- `// increment i` for the line `i++;`
4025
Such trivial, repetitive comments must not be produced by Copilot.
41-
- Use only ASCII characters (no Unicode checkmarks, emojis, or special symbols)
42-
- Don't add simple validation comments (e.g., "Good", "Correct", "OK", "Bad")
43-
- Comments should explain WHAT the code's intent or contract is, or WHY a non-obvious approach is requirednot restate HOW the code performs obvious operations.
44-
- Good: `// accumulate running mean across batch to avoid a second pass`
45-
- Good: `// Use integer index to preserve pointer stability required by the SIMD kernel`
26+
- Use only ASCII characters (no Unicode checkmarks, emojis, or special symbols).
27+
- Don't add simple validation comments (e.g., "Good", "Correct", "OK", "Bad").
28+
- Comments should explain WHAT the code's intent or contract is, or WHY a non-obvious approach is requirednot restate HOW the code performs obvious operations.
29+
- Good: `// accumulate running mean across batch to avoid a second pass`.
30+
- Good: `// Use integer index to preserve pointer stability required by the SIMD kernel`.
4631
- Prefer documenting:
4732
- Function/module contract: inputs, outputs, side-effects, threading assumptions, and performance/precision trade-offs.
4833
- Non-obvious algorithms, invariants, and corner cases that callers or maintainers must preserve.
4934
- API expectations: ownership, lifetime, and accumulation semantics (overwrite vs accumulate).
5035
- Keep comments technical and informative, not evaluative or apologetic.
5136
- Do not include reasoning or justification for design decisions in code comments (keep rationale in design documents or commit messages).
5237
- Avoid commenting trivial lines of code that are self-explanatory; prefer a brief block comment describing the overall purpose of the surrounding code instead.
53-
- Documentation comments (Doxygen) should describe behavior, usage, public contracts and examples — not explain why changes were made.
38+
- Documentation comments (Doxygen) should describe behavior, usage, public contracts, and examples—not explain why changes were made.
39+
40+
## Doxygen / File Header Policy
41+
- File-level Doxygen comments must be concise summaries (one to three short sentences).
42+
- Purpose: provide a quick summary of the file intent for readers and tools.
43+
- Must NOT repeat detailed API, implementation notes, or usage examples.
44+
- Detailed documentation belongs in the module/class/function-level Doxygen comments (module API).
45+
- Put behavior, parameters, return semantics, ownership/lifetime, threading assumptions, and examples on the relevant symbol's Doxygen block.
46+
- Module-level comments (module partitions) should describe the public API surface and usage patterns.
47+
- Example file-level header (preferred):
48+
- Brief one-line summary: "Configuration for the Residual module."
49+
- Optional short second sentence for scope: "Provides fluent setters used by Residual and backend factories."
50+
- Rationale: keeps files scannable and avoids duplicated, stale documentation across many files.
5451

55-
## Doxygen guidance for generated code
52+
## Doxygen Guidance for Generated Code
5653
- When emitting Doxygen for symbols:
5754
- Use the full signature and describe preconditions, postconditions, and side-effects.
5855
- Prefer param/return tags for public methods.
5956
- Use short examples only in the symbol comment (not in file headers).
6057
- Avoid emitting long prose in file headers; put detail in the API-level documentation.
6158

62-
## Notes for AI assistant
59+
## Notes for AI Assistant
6360
- When recommending code, prefer modern C++ idioms (RAII, smart pointers, STL algorithms).
6461
- Always include testing suggestions and consider CPU/CUDA parity.
65-
- In explanatory text (not code), you may use formatting symbols for clarity, but generated code comments must follow the comment policy above
66-
- Keep commit messages and explanatory responses separate from code documentation
67-
- Unit tests are structured by project, namespace and classplace tests under the Tests tree following the repository project layout and mirror the production namespace/class organization.
62+
- In explanatory text (not code), you may use formatting symbols for clarity, but generated code comments must follow the comment policy above.
63+
- Keep commit messages and explanatory responses separate from code documentation.
64+
- Unit tests are structured by project, namespace, and classplace tests under the Tests tree following the repository project layout and mirror the production namespace/class organization.

Mila/CMakeLists.txt

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -289,9 +289,9 @@ PUBLIC
289289
#----------------------------------------------------------------------
290290
# Dnn / Data
291291
#----------------------------------------------------------------------
292-
"Src/Dnn/Data/DataLoader.ixx"
293-
"Src/Dnn/Data/TokenSequenceLoader.ixx"
294-
"Src/Dnn/Data/TokenSequenceLoader.Config.ixx"
292+
"Src/Data/Loaders/DataLoader.ixx"
293+
"Src/Data/Loaders/TokenSequenceLoader.ixx"
294+
"Src/Data/Loaders/TokenSequenceLoader.Config.ixx"
295295

296296
#---------------------------------------------------------------
297297
# Dnn / Serialization
@@ -381,31 +381,36 @@ PUBLIC
381381

382382
"Src/Dnn/Components/Transformers/LlaMa/Llama.Presets.ixx"
383383

384-
"Src/Dnn/Data/Tokenizer.ixx"
385-
"Src/Dnn/Data/Gpt2Tokenizer.ixx"
386-
"Src/Dnn/Data/Llama3Tokenizer.ixx"
387-
"Src/Dnn/Data/TokenizerVocabulary.ixx"
388-
"Src/Dnn/Data/TokenizerType.ixx"
384+
"Src/Data/Tokenizers/Tokenizer.ixx"
385+
# DEPRECATED: "Src/Dnn/Data/Gpt2Tokenizer.ixx"
386+
# DEPRECATED: "Src/Dnn/Data/Llama3Tokenizer.ixx"
387+
"Src/Data/Tokenizers/TokenizerVocabulary.ixx"
388+
"Src/Data/Tokenizers/TokenizerType.ixx"
389389

390390
"Src/Data/Core/FileHeader.ixx"
391391
"Src/Data/Core/TokenizerTrainer.ixx"
392392
"Src/Data/Core/TrainerFactory.ixx"
393393

394-
"Src/Data/Tokenizers/Bpe/Gpt2/BPETokenizer.ixx"
395-
"Src/Data/Tokenizers/Bpe/Gpt2/BpeTrainer.ixx"
396-
"Src/Data/Tokenizers/Bpe/Gpt2/BPEVocabulary.ixx"
397-
"Src/Data/Tokenizers/Bpe/Gpt2/BpeVocabularyConfig.ixx"
398-
399-
"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Tokenizer.ixx"
400-
"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Vocabulary.ixx"
401-
"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Vocabulary.Config.ixx"
394+
# REVIEW: Unified Bpe tockenizer
395+
#"Src/Data/Tokenizers/Bpe/Gpt2/BPETokenizer.ixx"
396+
#"Src/Data/Tokenizers/Bpe/Gpt2/BpeTrainer.ixx"
397+
#"Src/Data/Tokenizers/Bpe/Gpt2/BPEVocabulary.ixx"
398+
#"Src/Data/Tokenizers/Bpe/Gpt2/BpeVocabularyConfig.ixx"
399+
#"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Tokenizer.ixx"
400+
#"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Vocabulary.ixx"
401+
#"Src/Data/Tokenizers/Bpe/Gpt4/Gpt4Vocabulary.Config.ixx"
402402

403403
"Src/Data/Tokenizers/Char/CharTokenizer.ixx"
404404
"Src/Data/Tokenizers/Char/CharVocabularyConfig.ixx"
405405
"Src/Data/Tokenizers/Char/CharTrainer.ixx"
406406
"Src/Data/Tokenizers/Char/CharVocabulary.ixx"
407407

408-
"Src/Data/Tokenizers/Bpe/PreTokenizationMode.ixx"
408+
"Src/Data/Tokenizers/Bpe/BpeVocabularyConfig.ixx"
409+
"Src/Data/Tokenizers/Bpe/BpeVocabulary.ixx"
410+
"Src/Data/Tokenizers/Bpe/BpeTokenizer.ixx"
411+
"Src/Data/Tokenizers/Bpe/BpeTrainer.ixx"
412+
413+
"Src/Data/Tokenizers/Bpe/BpePreTokenizationMode.ixx"
409414

410415
"Src/Dnn/Components/Transformers/GenerateParams.ixx"
411416
"Src/Data/Tokenizers/SpecialTokens.ixx"
@@ -430,7 +435,6 @@ PUBLIC
430435
"Src/Dnn/Components/Attention/GQA/GroupedQueryAttention.Config.ixx"
431436

432437
"Src/Dnn/Compute/Operations/PairedOperation.ixx"
433-
434438
)
435439

436440
set(MILA_INSTALL_FILE_SET_ARGS FILE_SET module_files DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}/mila/modules)
@@ -780,8 +784,8 @@ endif()
780784
# Generate documentation with Doxygen
781785
add_subdirectory( Docs )
782786

783-
# Add data tokenization tools
784-
add_subdirectory( Src/Data/Tools )
787+
# Add tools: tokenizer trainer, model exporter, etc.
788+
add_subdirectory( Tools )
785789

786790
# Configure code coverage for MSVC
787791
if( MILA_ENABLE_COVERAGE AND MSVC)
Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/**
2-
* @file Gpt2Tokenizer.ixx
2+
* @file Gpt2Tokenizer_old.ixx
33
* @brief GPT-style BPE tokenizer and binary loader used by Mila.
44
*
55
* Loads a compact binary tokenizer format and provides encode/decode functionality.
@@ -20,7 +20,7 @@ module;
2020
#include <functional>
2121
#include <limits>
2222

23-
export module Data.Gpt2Tokenizer;
23+
export module Data.Gpt2Tokenizer_old_old;
2424

2525
import Data.Tokenizer;
2626

@@ -50,7 +50,7 @@ namespace Mila::Dnn::Data
5050
* mutations. Concurrent read-only encode/decode usage is acceptable when no
5151
* writer modifies state.
5252
*/
53-
export class Gpt2Tokenizer : public Tokenizer {
53+
export class Gpt2Tokenizer_old : public Tokenizer {
5454
public:
5555
/**
5656
* @brief Create a tokenizer by loading the binary file at `path`.
@@ -60,8 +60,8 @@ namespace Mila::Dnn::Data
6060
* Preconditions: `path` points to a file produced by the repository
6161
* conversion utility or another producer that follows the same layout.
6262
*/
63-
static std::unique_ptr<Gpt2Tokenizer> fromFile( const std::string& path ) {
64-
auto tokenizer = std::unique_ptr<Gpt2Tokenizer>( new Gpt2Tokenizer() );
63+
static std::unique_ptr<Gpt2Tokenizer_old> fromFile( const std::string& path ) {
64+
auto tokenizer = std::unique_ptr<Gpt2Tokenizer_old>( new Gpt2Tokenizer_old() );
6565
if ( !tokenizer->loadFromBinary( path ) ) {
6666
return nullptr;
6767
}
@@ -162,7 +162,7 @@ namespace Mila::Dnn::Data
162162
}
163163

164164
private:
165-
Gpt2Tokenizer() = default;
165+
Gpt2Tokenizer_old() = default;
166166

167167
/**
168168
* @brief Load the tokenizer from the repository binary layout.
Lines changed: 16 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -8,15 +8,18 @@ module;
88
#include <memory>
99
#include <algorithm>
1010

11-
export module Data.LlamaTokenizer;
11+
export module Data.LlamaTokenizer_old;
1212

1313
import Data.Tokenizer;
1414

1515
namespace Mila::Dnn::Data
1616
{
17-
export class LlamaTokenizer : public Tokenizer {
17+
// DEPRECATED: This is the original LLaMA tokenizer implementation, retained for reference
18+
// TODO: Remove this class after the new LLaMA tokenizer is fully implemented and tested.
19+
20+
export class LlamaTokenizer_old : public Tokenizer {
1821
public:
19-
static std::unique_ptr<LlamaTokenizer> fromFile( const std::string& path );
22+
static std::unique_ptr<LlamaTokenizer_old> fromFile( const std::string& path );
2023

2124
std::vector<TokenId> encode( const std::string& text ) override;
2225
std::string decode( std::span<const TokenId> tokens ) override;
@@ -41,7 +44,7 @@ namespace Mila::Dnn::Data
4144
bool isValidToken( TokenId tokenId ) const override;
4245

4346
private:
44-
LlamaTokenizer() = default;
47+
LlamaTokenizer_old() = default;
4548

4649
bool loadFromBinary( const std::string& path );
4750

@@ -67,8 +70,8 @@ namespace Mila::Dnn::Data
6770
bool useByteFallback_{ true };
6871
};
6972

70-
std::unique_ptr<LlamaTokenizer> LlamaTokenizer::fromFile( const std::string& path ) {
71-
auto tokenizer = std::unique_ptr<LlamaTokenizer>( new LlamaTokenizer() );
73+
std::unique_ptr<LlamaTokenizer_old> LlamaTokenizer_old::fromFile( const std::string& path ) {
74+
auto tokenizer = std::unique_ptr<LlamaTokenizer_old>( new LlamaTokenizer_old() );
7275

7376
if ( !tokenizer->loadFromBinary( path ) ) {
7477
return nullptr;
@@ -77,7 +80,7 @@ namespace Mila::Dnn::Data
7780
return tokenizer;
7881
}
7982

80-
bool LlamaTokenizer::loadFromBinary( const std::string& path ) {
83+
bool LlamaTokenizer_old::loadFromBinary( const std::string& path ) {
8184
std::ifstream file( path, std::ios::binary );
8285

8386
if ( !file ) {
@@ -150,16 +153,16 @@ namespace Mila::Dnn::Data
150153
return true;
151154
}
152155

153-
std::string LlamaTokenizer::normalizeText( const std::string& text ) const {
156+
std::string LlamaTokenizer_old::normalizeText( const std::string& text ) const {
154157
return " " + text;
155158
}
156159

157-
std::vector<TokenId> LlamaTokenizer::encode( const std::string& text ) {
160+
std::vector<TokenId> LlamaTokenizer_old::encode( const std::string& text ) {
158161
std::string normalized = normalizeText( text );
159162
return sentencePieceEncode( normalized );
160163
}
161164

162-
std::vector<TokenId> LlamaTokenizer::sentencePieceEncode( const std::string& text ) const {
165+
std::vector<TokenId> LlamaTokenizer_old::sentencePieceEncode( const std::string& text ) const {
163166
std::vector<TokenId> result;
164167
size_t pos = 0;
165168

@@ -205,7 +208,7 @@ namespace Mila::Dnn::Data
205208
return result;
206209
}
207210

208-
std::string LlamaTokenizer::decode( std::span<const TokenId> tokens ) {
211+
std::string LlamaTokenizer_old::decode( std::span<const TokenId> tokens ) {
209212
std::string result;
210213

211214
for ( auto tokenId : tokens ) {
@@ -237,12 +240,12 @@ namespace Mila::Dnn::Data
237240
return result;
238241
}
239242

240-
std::string LlamaTokenizer::tokenToString( TokenId tokenId ) const {
243+
std::string LlamaTokenizer_old::tokenToString( TokenId tokenId ) const {
241244
auto it = idToPiece_.find( tokenId );
242245
return it != idToPiece_.end() ? it->second : "<UNK>";
243246
}
244247

245-
bool LlamaTokenizer::isValidToken( TokenId tokenId ) const {
248+
bool LlamaTokenizer_old::isValidToken( TokenId tokenId ) const {
246249
return idToPiece_.contains( tokenId );
247250
}
248251
}
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

0 commit comments

Comments
 (0)