Lua Vision Research · Training

6-Phase NCAS Training: A Curriculum Strategy for Progressive Capability Acquisition in Sparse Expert Models

Paulo Câmara, David Kang, Plínio Ceccon
Lua Vision Tecnologia · São Paulo, Brazil
Published: November 2025 · Revised: March 2026
Abstract

We introduce a six-phase curriculum training strategy specifically designed for sparse Mixture-of-Experts architectures. Unlike conventional two-phase training (pre-training then fine-tuning), our approach progressively builds capabilities through distinct learning regimes: Foundation (broad knowledge compression), Specialization (expert differentiation and routing), Alignment (instruction following and format adherence), Safety (harm avoidance with maintained helpfulness), Metacognition (confidence calibration and abstention), and Optimization (inference efficiency without quality degradation). Each phase introduces specific objectives while maintaining stability on previous capabilities. We demonstrate that phase ordering matters critically—training metacognition before alignment, for example, produces 23% worse calibration—and provide detailed data mixture strategies, learning rate schedules, and phase transition criteria validated on our 54B MoE architecture.

1. Introduction: Why Six Phases?

The dominant paradigm in LLM training—pre-train on everything, then fine-tune for alignment—is increasingly insufficient for models deployed in life-critical domains. The problem is fundamental: capabilities learned in the second phase (instruction following, safety, helpfulness) must be built on top of capabilities from the first phase (knowledge, reasoning). When these phases interact poorly, you get either a knowledgeable model that won't follow instructions, or an obedient model that hallucinates knowledge it never properly learned.

Our six-phase approach makes the capability progression explicit. Each phase has a single primary objective, clear entry criteria, specific data mixtures, and measurable completion criteria. This eliminates the ambiguity of "did we train long enough?" that plagues two-phase approaches and provides a roadmap for systematic capability building.

The key insight is that human cognitive development follows a similar progression: foundational knowledge → specialized expertise → social alignment → ethical reasoning → self-awareness → efficiency. Our training curriculum mirrors this developmental arc, with each phase building on the structural changes established by previous phases.

2. Phase 1: Foundation

2.1 Objective

Compress a broad world model into the shared parameters (attention layers, embeddings, routers) and establish initial expert differentiation through load-balanced routing.

2.2 Data Composition

DomainProportionSource TypeQuality Filter
Web text (deduplicated)45%Common Crawl, filteredPerplexity < 50, dedup 13-gram
Code (multi-language)18%GitHub, filteredStars ≥ 3, compilable, licensed
Scientific literature12%arXiv, PubMed, S2ORCPeer-reviewed or ≥5 citations
Books (diverse)10%Project Gutenberg, academicFull text, no OCR artifacts
Multilingual (pt-BR focus)8%CC-100, OSCAR, curated PTLanguage ID > 0.95
Mathematics (formal)4%ProofPile, MATH, MathPileVerified solutions
Structured (tables, JSON)3%Wikipedia tables, APIsParseable, non-trivial

2.3 Training Configuration

2.4 Completion Criteria

Phase 1 is complete when: (a) validation loss plateaus (< 0.1% improvement per billion tokens over 100B tokens), (b) all 9 experts receive between 8-14% of routed tokens (balanced ± 3%), and (c) the model achieves ≥ 70% on a held-out general knowledge evaluation set.

3. Phase 2: Specialization

3.1 Objective

Transform the uniformly-utilized experts into genuinely specialized sub-networks with distinct knowledge domains. After Phase 2, each expert should "own" a recognizable capability cluster.

3.2 Data Strategy

Phase 2 uses domain-concentrated batches: instead of uniformly sampling from all domains, each batch is 70% concentrated in a single domain. This creates routing pressure—the router learns that certain experts are better suited to certain inputs—without the instability of completely domain-separated training.

Batch TypePrimary Domain (70%)Mixed (30%)Target Expert
Code-heavyMulti-language programmingGeneralExpert 1, 2
Science-heavyScientific + mathGeneralExpert 3, 4
Language-heavyLiterary + multilingualGeneralExpert 5, 6
Reasoning-heavyLogic + structuredGeneralExpert 7, 8
GeneralWeb + mixedAll domainsExpert 9 (generalist)

3.3 Expert Differentiation Metrics

We measure specialization through the routing entropy per input domain. At the start of Phase 2, routing entropy is high (uniform distribution → H ≈ 2.2 bits for 9 experts). By the end, domain-specific inputs have routing entropy of 0.8-1.2 bits (concentrated on 2-3 experts), while general queries maintain H ≈ 1.8 bits (broader distribution).

3.4 Preventing Catastrophic Specialization

A critical risk in Phase 2 is that experts become so specialized they can't contribute to cross-domain reasoning. We prevent this through:

4. Phase 3: Alignment

4.1 Objective

Transform the model from a next-token predictor into an instruction-following assistant that produces structured, helpful responses.

4.2 Training Approach

Phase 3 uses Supervised Fine-Tuning (SFT) on high-quality instruction-response pairs, followed by DPO (Direct Preference Optimization) on human preference data:

4.3 Alignment Tax and Mitigation

A well-documented problem in alignment is the "alignment tax"—models become less capable after RLHF/DPO because they learn to avoid uncertainty rather than maintain broad knowledge. Our mitigations:

4.4 Completion Criteria

Phase 3 is complete when: (a) instruction-following accuracy ≥ 95% on IFEval benchmark, (b) no degradation > 2% on Phase 1 knowledge benchmarks (MMLU, ARC, HellaSwag), and (c) DPO win rate ≥ 70% against the pre-DPO checkpoint on held-out preference data.

5. Phase 4: Safety

5.1 Objective

Establish robust harm avoidance without sacrificing helpfulness. The model must refuse genuinely harmful requests while remaining maximally helpful for ambiguous-but-legitimate queries.

5.2 The Safety-Helpfulness Frontier

Most safety training makes models overly cautious—refusing to discuss medical symptoms, legal questions, or security concepts because they superficially resemble harmful queries. Our Phase 4 explicitly optimizes for the safety-helpfulness frontier:

L_safety = L_harm_avoidance + β × L_overrefusal_penalty + γ × L_domain_helpfulness

The overrefusal penalty term penalizes the model for refusing legitimate queries in professional domains (medical, legal, security). This is critical for life-critical applications where refusing to engage is itself harmful.

5.3 Data Composition

5.4 Results

Post-Phase 4 safety metrics:

6. Phase 5: Metacognition

6.1 Objective

Train the metacognition heads to accurately estimate per-token confidence and trigger abstention when the model lacks sufficient knowledge or certainty. This is the critical phase for life-critical deployment.

6.2 Why Phase 5 Requires Phases 1-4

Metacognition training requires a model that already has: (a) stable knowledge representations (Phase 1-2), (b) consistent output formats (Phase 3), and (c) calibrated behavior under adversarial pressure (Phase 4). Attempting metacognition on an unaligned model produces unreliable confidence estimates because the model's behavior is itself unstable.

We verify this through ablation: training metacognition after Phase 2 (skipping alignment and safety) produces confidence estimates with 23% worse calibration error than the full sequence. The model cannot accurately judge "what it knows" when "what it does" is still changing.

6.3 Training the Metacognition Heads

The metacognition architecture adds lightweight probe heads (single linear layer) at layers 18, 36, and 54 that predict a scalar confidence value. Training data is constructed by:

  1. Running the model on QA pairs with known ground-truth answers
  2. Labeling each response as "correct" or "incorrect" based on semantic matching
  3. Training the probe heads to predict correctness from the hidden states that produced each response

This creates a direct mapping: internal representation patterns → likelihood of correct output. The probes learn to detect the "signature" of uncertain or hallucinated generation in the model's hidden states.

6.4 Asymmetric Calibration Loss

For medical/legal domains, false confidence (saying something wrong with high confidence) is far worse than false uncertainty (abstaining on a question you could answer). We implement this asymmetry:

L_meta = w₊ × CE(ĉ, 1 | correct) + w₋ × CE(ĉ, 0 | incorrect)
w₊ = 1.0, w₋ = 10.0 (medical), w₋ = 8.0 (legal), w₋ = 2.0 (general)

The 10× penalty for confident-but-wrong in medical contexts ensures the model strongly biases toward abstention when uncertain about medical facts.

7. Phase 6: Optimization

7.1 Objective

Prepare the model for production inference: extend context length, improve generation efficiency, and reduce latency—all without degrading capabilities established in Phases 1-5.

7.2 Context Extension

Phase 1-5 train at 4096 context length for efficiency. Phase 6 extends to 16,384 tokens through progressive context lengthening:

7.3 Speculative Decoding Preparation

Phase 6 includes training a small "draft" model (1.3B parameters, same vocabulary) that shares the main model's embedding layer. This draft model is used for speculative decoding at inference time, proposing 4-8 candidate tokens that the main model verifies in parallel. This provides 2-3× speedup for batch-size-1 inference.

7.4 Expert Pruning Analysis

Phase 6 evaluates which experts contribute least to final output quality through systematic ablation. While we don't prune in the current release (all 9 experts are retained), this analysis informs future distilled versions:

Expert RemovedQuality Loss (avg)Most Affected Domain
Expert 1 (Code-primary)-4.2%Programming (-11.3%)
Expert 3 (Science-primary)-3.8%STEM reasoning (-9.7%)
Expert 5 (Language-primary)-2.1%Translation (-7.4%)
Expert 9 (Generalist)-5.6%All domains (-3-8%)

8. Phase Ordering: Ablation Studies

8.1 Why Order Matters

Our six-phase sequence is not arbitrary—it encodes dependency relationships between capabilities. We validate this through controlled ablations:

Ordering VariantFinal QualitySafety ScoreCalibration Error
Full sequence (1→2→3→4→5→6)98.2%99.8%2.1%
Skip Phase 2 (no specialization)91.4%99.5%3.8%
Phase 5 before Phase 4 (meta before safety)96.8%97.2%5.4%
Phase 5 before Phase 3 (meta before alignment)94.1%96.8%7.9%
Two-phase (pre-train + RLHF only)88.3%94.2%12.4%
Phase 4 before Phase 3 (safety before alignment)93.7%99.6%4.1%

Key findings: (a) Metacognition requires both alignment AND safety as prerequisites—skipping either degrades calibration. (b) Specialization (Phase 2) provides 7% absolute quality gain. (c) Traditional two-phase training is 10 points below our full sequence on quality and 6× worse on calibration.

9. Data Efficiency

9.1 Total Token Budget

PhaseTokens% of TotalPrimary Cost
Phase 1: Foundation2.8T93.0%Compute (GPU hours)
Phase 2: Specialization150B5.0%Compute
Phase 3: Alignment35B1.2%Human annotation
Phase 4: Safety12B0.4%Red-team expertise
Phase 5: Metacognition8B0.3%Expert verification
Phase 6: Optimization80B2.7%Long-context data curation
Total~3.08T100%

Phase 1 dominates in tokens (93%) but Phases 3-5 dominate in human cost per token—each requires expert-verified, curated data that costs 100-1000× more per example than web-crawled text.

10. Stability and Phase Transitions

10.1 Transition Protocols

Each phase transition follows a strict protocol to prevent catastrophic forgetting:

  1. Checkpoint validation: Run the complete evaluation suite before transitioning. The new phase cannot begin until the model meets all completion criteria for the current phase.
  2. Learning rate reset: Each phase begins with a short warmup (100-500 steps) from 0 to the phase-specific peak learning rate.
  3. Selective unfreezing: Not all parameters are trained in all phases. Phase 5 only trains the metacognition probe heads. Phase 3 freezes specialized experts.
  4. Replay buffers: Each phase includes 5-10% data from previous phases to maintain established capabilities.

10.2 Loss Landscape Analysis

We observe that phase transitions create temporary loss increases (the new objective conflicts briefly with the previous regime) that resolve within 2-5% of phase duration. This is expected and healthy—it indicates the model is adapting its representations for the new capability without being so constrained that learning is blocked.

11. Conclusion

The six-phase training curriculum represents a deliberate, principled approach to building capable, safe, and self-aware language models. By making the capability progression explicit and sequential—knowledge before specialization, specialization before alignment, alignment before safety, safety before metacognition—we avoid the capability conflicts and calibration failures that plague conventional two-phase training. The result is a model that not only performs at frontier quality but knows its own limitations and communicates them clearly to users—the minimum requirement for deployment in domains where errors have real-world consequences.

References

  1. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
  2. Rafailov, R., et al. (2023). Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.
  3. Touvron, H., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
  4. Jiang, A. Q., et al. (2024). Mixtral of experts. arXiv:2401.04088.
  5. Bengio, Y., et al. (2009). Curriculum learning. ICML 2009.
  6. Bai, Y., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862.
  7. Kadavath, S., et al. (2022). Language models (mostly) know what they know. arXiv:2207.05221.
  8. Lin, S., et al. (2024). The unlocking spell on base LLMs: Rethinking alignment via in-context learning. ICLR 2024.
  9. Mukherjee, S., et al. (2023). Orca: Progressive learning from complex explanation traces of GPT-4. arXiv:2306.02707.
  10. Wei, J., et al. (2022). Emergent abilities of large language models. TMLR 2022.
  11. Hoffmann, J., et al. (2022). Training compute-optimal large language models. NeurIPS 2022.
  12. Zhou, C., et al. (2024). Lima: Less is more for alignment. NeurIPS 2023.