We introduce a six-phase curriculum training strategy specifically designed for sparse Mixture-of-Experts architectures. Unlike conventional two-phase training (pre-training then fine-tuning), our approach progressively builds capabilities through distinct learning regimes: Foundation (broad knowledge compression), Specialization (expert differentiation and routing), Alignment (instruction following and format adherence), Safety (harm avoidance with maintained helpfulness), Metacognition (confidence calibration and abstention), and Optimization (inference efficiency without quality degradation). Each phase introduces specific objectives while maintaining stability on previous capabilities. We demonstrate that phase ordering matters critically—training metacognition before alignment, for example, produces 23% worse calibration—and provide detailed data mixture strategies, learning rate schedules, and phase transition criteria validated on our 54B MoE architecture.
The dominant paradigm in LLM training—pre-train on everything, then fine-tune for alignment—is increasingly insufficient for models deployed in life-critical domains. The problem is fundamental: capabilities learned in the second phase (instruction following, safety, helpfulness) must be built on top of capabilities from the first phase (knowledge, reasoning). When these phases interact poorly, you get either a knowledgeable model that won't follow instructions, or an obedient model that hallucinates knowledge it never properly learned.
Our six-phase approach makes the capability progression explicit. Each phase has a single primary objective, clear entry criteria, specific data mixtures, and measurable completion criteria. This eliminates the ambiguity of "did we train long enough?" that plagues two-phase approaches and provides a roadmap for systematic capability building.
The key insight is that human cognitive development follows a similar progression: foundational knowledge → specialized expertise → social alignment → ethical reasoning → self-awareness → efficiency. Our training curriculum mirrors this developmental arc, with each phase building on the structural changes established by previous phases.
Compress a broad world model into the shared parameters (attention layers, embeddings, routers) and establish initial expert differentiation through load-balanced routing.
| Domain | Proportion | Source Type | Quality Filter |
|---|---|---|---|
| Web text (deduplicated) | 45% | Common Crawl, filtered | Perplexity < 50, dedup 13-gram |
| Code (multi-language) | 18% | GitHub, filtered | Stars ≥ 3, compilable, licensed |
| Scientific literature | 12% | arXiv, PubMed, S2ORC | Peer-reviewed or ≥5 citations |
| Books (diverse) | 10% | Project Gutenberg, academic | Full text, no OCR artifacts |
| Multilingual (pt-BR focus) | 8% | CC-100, OSCAR, curated PT | Language ID > 0.95 |
| Mathematics (formal) | 4% | ProofPile, MATH, MathPile | Verified solutions |
| Structured (tables, JSON) | 3% | Wikipedia tables, APIs | Parseable, non-trivial |
Phase 1 is complete when: (a) validation loss plateaus (< 0.1% improvement per billion tokens over 100B tokens), (b) all 9 experts receive between 8-14% of routed tokens (balanced ± 3%), and (c) the model achieves ≥ 70% on a held-out general knowledge evaluation set.
Transform the uniformly-utilized experts into genuinely specialized sub-networks with distinct knowledge domains. After Phase 2, each expert should "own" a recognizable capability cluster.
Phase 2 uses domain-concentrated batches: instead of uniformly sampling from all domains, each batch is 70% concentrated in a single domain. This creates routing pressure—the router learns that certain experts are better suited to certain inputs—without the instability of completely domain-separated training.
| Batch Type | Primary Domain (70%) | Mixed (30%) | Target Expert |
|---|---|---|---|
| Code-heavy | Multi-language programming | General | Expert 1, 2 |
| Science-heavy | Scientific + math | General | Expert 3, 4 |
| Language-heavy | Literary + multilingual | General | Expert 5, 6 |
| Reasoning-heavy | Logic + structured | General | Expert 7, 8 |
| General | Web + mixed | All domains | Expert 9 (generalist) |
We measure specialization through the routing entropy per input domain. At the start of Phase 2, routing entropy is high (uniform distribution → H ≈ 2.2 bits for 9 experts). By the end, domain-specific inputs have routing entropy of 0.8-1.2 bits (concentrated on 2-3 experts), while general queries maintain H ≈ 1.8 bits (broader distribution).
A critical risk in Phase 2 is that experts become so specialized they can't contribute to cross-domain reasoning. We prevent this through:
Transform the model from a next-token predictor into an instruction-following assistant that produces structured, helpful responses.
Phase 3 uses Supervised Fine-Tuning (SFT) on high-quality instruction-response pairs, followed by DPO (Direct Preference Optimization) on human preference data:
A well-documented problem in alignment is the "alignment tax"—models become less capable after RLHF/DPO because they learn to avoid uncertainty rather than maintain broad knowledge. Our mitigations:
Phase 3 is complete when: (a) instruction-following accuracy ≥ 95% on IFEval benchmark, (b) no degradation > 2% on Phase 1 knowledge benchmarks (MMLU, ARC, HellaSwag), and (c) DPO win rate ≥ 70% against the pre-DPO checkpoint on held-out preference data.
Establish robust harm avoidance without sacrificing helpfulness. The model must refuse genuinely harmful requests while remaining maximally helpful for ambiguous-but-legitimate queries.
Most safety training makes models overly cautious—refusing to discuss medical symptoms, legal questions, or security concepts because they superficially resemble harmful queries. Our Phase 4 explicitly optimizes for the safety-helpfulness frontier:
The overrefusal penalty term penalizes the model for refusing legitimate queries in professional domains (medical, legal, security). This is critical for life-critical applications where refusing to engage is itself harmful.
Post-Phase 4 safety metrics:
Train the metacognition heads to accurately estimate per-token confidence and trigger abstention when the model lacks sufficient knowledge or certainty. This is the critical phase for life-critical deployment.
Metacognition training requires a model that already has: (a) stable knowledge representations (Phase 1-2), (b) consistent output formats (Phase 3), and (c) calibrated behavior under adversarial pressure (Phase 4). Attempting metacognition on an unaligned model produces unreliable confidence estimates because the model's behavior is itself unstable.
We verify this through ablation: training metacognition after Phase 2 (skipping alignment and safety) produces confidence estimates with 23% worse calibration error than the full sequence. The model cannot accurately judge "what it knows" when "what it does" is still changing.
The metacognition architecture adds lightweight probe heads (single linear layer) at layers 18, 36, and 54 that predict a scalar confidence value. Training data is constructed by:
This creates a direct mapping: internal representation patterns → likelihood of correct output. The probes learn to detect the "signature" of uncertain or hallucinated generation in the model's hidden states.
For medical/legal domains, false confidence (saying something wrong with high confidence) is far worse than false uncertainty (abstaining on a question you could answer). We implement this asymmetry:
The 10× penalty for confident-but-wrong in medical contexts ensures the model strongly biases toward abstention when uncertain about medical facts.
Prepare the model for production inference: extend context length, improve generation efficiency, and reduce latency—all without degrading capabilities established in Phases 1-5.
Phase 1-5 train at 4096 context length for efficiency. Phase 6 extends to 16,384 tokens through progressive context lengthening:
Phase 6 includes training a small "draft" model (1.3B parameters, same vocabulary) that shares the main model's embedding layer. This draft model is used for speculative decoding at inference time, proposing 4-8 candidate tokens that the main model verifies in parallel. This provides 2-3× speedup for batch-size-1 inference.
Phase 6 evaluates which experts contribute least to final output quality through systematic ablation. While we don't prune in the current release (all 9 experts are retained), this analysis informs future distilled versions:
| Expert Removed | Quality Loss (avg) | Most Affected Domain |
|---|---|---|
| Expert 1 (Code-primary) | -4.2% | Programming (-11.3%) |
| Expert 3 (Science-primary) | -3.8% | STEM reasoning (-9.7%) |
| Expert 5 (Language-primary) | -2.1% | Translation (-7.4%) |
| Expert 9 (Generalist) | -5.6% | All domains (-3-8%) |
Our six-phase sequence is not arbitrary—it encodes dependency relationships between capabilities. We validate this through controlled ablations:
| Ordering Variant | Final Quality | Safety Score | Calibration Error |
|---|---|---|---|
| Full sequence (1→2→3→4→5→6) | 98.2% | 99.8% | 2.1% |
| Skip Phase 2 (no specialization) | 91.4% | 99.5% | 3.8% |
| Phase 5 before Phase 4 (meta before safety) | 96.8% | 97.2% | 5.4% |
| Phase 5 before Phase 3 (meta before alignment) | 94.1% | 96.8% | 7.9% |
| Two-phase (pre-train + RLHF only) | 88.3% | 94.2% | 12.4% |
| Phase 4 before Phase 3 (safety before alignment) | 93.7% | 99.6% | 4.1% |
Key findings: (a) Metacognition requires both alignment AND safety as prerequisites—skipping either degrades calibration. (b) Specialization (Phase 2) provides 7% absolute quality gain. (c) Traditional two-phase training is 10 points below our full sequence on quality and 6× worse on calibration.
| Phase | Tokens | % of Total | Primary Cost |
|---|---|---|---|
| Phase 1: Foundation | 2.8T | 93.0% | Compute (GPU hours) |
| Phase 2: Specialization | 150B | 5.0% | Compute |
| Phase 3: Alignment | 35B | 1.2% | Human annotation |
| Phase 4: Safety | 12B | 0.4% | Red-team expertise |
| Phase 5: Metacognition | 8B | 0.3% | Expert verification |
| Phase 6: Optimization | 80B | 2.7% | Long-context data curation |
| Total | ~3.08T | 100% |
Phase 1 dominates in tokens (93%) but Phases 3-5 dominate in human cost per token—each requires expert-verified, curated data that costs 100-1000× more per example than web-crawled text.
Each phase transition follows a strict protocol to prevent catastrophic forgetting:
We observe that phase transitions create temporary loss increases (the new objective conflicts briefly with the previous regime) that resolve within 2-5% of phase duration. This is expected and healthy—it indicates the model is adapting its representations for the new capability without being so constrained that learning is blocked.
The six-phase training curriculum represents a deliberate, principled approach to building capable, safe, and self-aware language models. By making the capability progression explicit and sequential—knowledge before specialization, specialization before alignment, alignment before safety, safety before metacognition—we avoid the capability conflicts and calibration failures that plague conventional two-phase training. The result is a model that not only performs at frontier quality but knows its own limitations and communicates them clearly to users—the minimum requirement for deployment in domains where errors have real-world consequences.