Method: Implicit Chain of Thought via Knowledge Distillation (ICoT-KD)
🎯 Goal:
Train a model to answer complex questions without generating reasoning steps, by learning only the final answer from a teacher model's CoT output.
🧠 Core Approach:
Teacher Model (e.g., GPT-3.5):
Generates full reasoning (CoT) + final answer
5 × 8 = 40 → 40 − 12 = 28 → Answer: 28
Student Model (e.g., T5, GPT-J):
Sees only the question → learns to predict “28”
✦ CoT is never shown during training or inference
🛠️ Training Steps:
Teacher generates (Question → CoT + Answer)
Extract (Question → Answer)
Train student on final answers only
✨ Enhancements (Optional):
Self-Consistency Voting (across multiple CoT outputs)
Filtering incorrect teacher answers
✅ Key Advantages:
Fast, CoT-free inference
No model changes required
Effective on math/symbolic tasks
Works with medium-sized models
Methodology: Implicit Chain of Thought via Knowledge Distillation (ICoT-KD)
Goal: Train a model to answer complex reasoning questions correctly without generating explicit reasoning steps — by using CoT-labeled answers from a teacher model.
🧠 Core Framework
1. Teacher Model (e.g., GPT-3.5):
Prompted with CoT-style questions to produce step-by-step rationales followed by final answers.
Example output:
“There are 7 days in a week. 7 squared is 49. Answer: 49”
2. Student Model (e.g., T5, GPT-J):
Trained to map the original input question → only the final answer, using the teacher’s output.
CoT steps are not shown to the student at any point.
Training supervised via standard cross-entropy loss on the final answer.
🧪 Optional Enhancements
Self-Consistency Decoding (SCD):
Use majority voting across multiple CoT generations to select the most consistent answer.
Model Filtering:
Student only distills from teacher generations where the answer matches the gold label.
📌 Training Pipeline
Generate (Q, CoT + A) pairs via teacher
Extract (Q, A) pairs
Train student on (Q → A)
No CoT reasoning at inference
✅ Advantages
General-purpose, model-agnostic
Works with medium models (T5-Base, GPT-J)
Requires no architectural changes
Effective on math and symbolic reasoning tasks
Methodology: Stepwise Internalization for Implicit CoT Reasoning
🎯 Goal:
Train language models to internalize reasoning steps — achieving accurate answers without outputting intermediate steps.
⚙️ Key Approach: Stepwise Internalization
Start with Explicit CoT Training:
Train the model on questions with full step-by-step reasoning and final answer.
Gradual Token Removal (Curriculum Learning):
Iteratively remove CoT tokens from inputs.
Fine-tune the model at each stage.
Forces the model to internalize reasoning within hidden states.
Final Stage – Fully Implicit CoT:
The model predicts the answer directly from the question with no visible reasoning steps.
🔁 Training Optimization Techniques:
Removal Smoothing: Adds random offset to CoT token removal to avoid abrupt changes.
Optimizer Reset: Reset training optimizer at each stage to stabilize learning.
📈 Benefits:
Simpler than knowledge distillation-based methods.
No teacher model required.
Model-agnostic and scalable (effective from GPT-2 to Mistral-7B).
Significant speed gains with minimal loss in accuracy.
Methodology: Reasoning in a Continuous Latent Space (Latent CoT)
🎯 Goal:
Train models to reason internally — without generating reasoning steps — by using a latent vector to carry the thought process.
⚙️ Core Architecture
Reasoning Encoder
Takes a question and maps it to a latent vector (a hidden representation of the reasoning process).
Learns to encode “how to think” into a compact form.
Answer Decoder
Uses the latent vector to generate the final answer only.
No reasoning steps are ever output.
🧪 How it’s Trained
Use existing Chain-of-Thought (CoT) traces to guide the encoder.
CoT helps shape the latent space, even though the model never generates the steps.
The training is fully differentiable (end-to-end), allowing the entire system to be optimized smoothly.
✅ Why It’s Powerful
No CoT at inference: reasoning is done silently inside the vector space.
Faster and more compact than explicit CoT methods.
Generalizes well across reasoning tasks.
What is this paper trying to do?
Normally, when a language model solves a hard question (like a math problem), we make it write out the steps, like:
"7 × 4 = 28. 28 + 12 = 40. Answer: 40."
This is called Chain of Thought (CoT) — it helps the model think clearly and get better answers.
But writing out all those steps:
Takes more time
Makes the model slower
Isn’t always needed if the model can “think” silently
🎯 So what’s the new idea?
Instead of making the model write its thinking, this paper teaches it to do the reasoning silently — inside its “mind”.
Like how humans often do math in their head without saying each step out loud.
They call this Latent CoT — the thinking happens in a hidden, internal form.
🧱 How does the model do it?
It’s like building a machine with two parts:
1. 🧠 Reasoning Encoder
It reads the question
It creates a special vector (a bunch of numbers) that secretly represents how to solve the problem
Think of this like your brain quietly planning an answer
2. 🗣️ Answer Decoder
It takes that hidden “thought vector” and turns it into the final answer
It doesn’t show any reasoning steps — just the answer
🧪 How do they train it?
At first, they let the model see examples with full CoT steps (like 7×4 = 28 → 28 + 12 = 40). But the model is trained to:
Not repeat those steps
Just use them to shape its internal thinking space
In simple terms:
The model learns from the explanations, but doesn’t copy them — it learns to reason silently.
And because the whole system is trained all together, it learns smoothly and efficiently.
✅ Why is this helpful?
🔕 Faster: No reasoning steps to write out
🧠 Smarter: Reasoning is hidden, but still accurate
📦 Compact: Takes less space and time
🔁 Trainable end-to-end: Easy to improve all parts together
🔬 Good at reasoning tasks like math and logic
🎓 Final Analogy:
Imagine teaching a student to solve problems in their head after showing them many worked-out examples — and they still get the right answers, just silently. That’s exactly what this model is doing.
🔁 Updated Example for Slide
Teacher Model (e.g., GPT-3.5) — Prompted with CoT:
“Alex had 5 packs of markers. Each pack had 8 markers. He gave 12 markers to a friend. How many does he have left?
→ 5 × 8 = 40
→ 40 − 12 = 28
*Answer: 28”
Student Model (e.g., T5, GPT-J) — Trained to see only:
“Alex had 5 packs of markers. Each pack had 8 markers. He gave 12 markers to a friend. How many does he have left?”
→ "28"
✅ This example comes from GSM8K, one of the key datasets used in the paper’s experiments .
Let me know if you’d like this integrated into the concise slide version or included in your experimental framework!