How to navigate today’s model landscape, classify model types, choose and evaluate LLMs, and decide between prompt engineering, RAG, fine‑tuning, or training.
Exploring and Comparing Different LLMs
In the previous lesson we introduced Generative AI and how Large Language Models (LLMs) enable new educational startup scenarios. Here we dive into selecting and comparing models: taxonomy, trade‑offs, evaluation, and improvement pathways (prompting, retrieval augmentation, fine‑tuning, and—rarely—training from scratch).
Goal: Enable practical, evidence-based model decisions instead of brand-driven or “bigger-is-better” assumptions.
Lesson Overview
You will learn:
- A modern taxonomy of foundation and large language models (2024–2025 landscape)
- How licensing (open‑weight vs. proprietary) and modality affect choice
- Architecture patterns (decoder-only, encoder-decoder, MoE, multimodal unification)
- A structured selection & evaluation framework (capability, cost, latency, safety, control)
- When to apply Prompt Engineering, RAG, Parameter-Efficient Fine-Tuning, or Full Fine-Tuning
- How Azure AI Studio (Model Catalog, benchmarks, evaluation & deployment) supports the lifecycle
Learning Goals
After completing this lesson you can:
- Classify an LLM by modality, architecture, and license.
- Apply a decision path to select a model and customization approach.
- Design an evaluation loop (offline + human + safety + regression).
- Explain when not to fine-tune and why RAG often comes first.
- Deploy and iterate using Azure AI Studio tools.
1. Model Landscape Taxonomy
Modern models can be compared along multiple axes:
Axis | Categories | Why It Matters |
---|---|---|
Modality | Text-only, Multimodal (text+image), Audio, Video, Code-specialized | Task coverage; fewer models vs. pipeline composition |
License / Access | Proprietary API (hosted), Open-weight (downloadable), Restricted license (research/commercial clauses) | Control, cost optimization, compliance, customization |
Scale | Frontier (hundreds of billions / MoE), Mid (tens of billions), Small / Efficient (<10B) | Latency, cost, edge/embedded feasibility |
Architecture | Decoder-only, Encoder-only, Encoder-Decoder, Mixture-of-Experts (MoE), Retrieval-augmented stacks | Generation quality vs. efficiency vs. bidirectional features |
Specialization | General chat, Code, Reasoning, Safety-aligned, Domain-tuned (medical, finance) | Match to task shape; avoid overpaying |
Adaptation Support | Prompt-only, Function calling, Tool use, Structured output, Fine-tuning / LoRA endpoints | Integration pathways |
Representative Examples (Early 2025)
Type | Open-Weight / Downloadable | Proprietary / Hosted |
---|---|---|
General Chat / Multimodal | Llama 3 family, Mistral & Mixtral MoE, Gemma 2, Phi-3 (mini / small / medium), DeepSeek | GPT‑4o, Claude 3 family, Gemini 1.5, Cohere Command R/R+, Azure OpenAI variants |
Code | StarCoder2, Code Llama, Phi-3 medium (instruct) | GPT‑4o (code), Claude 3 Sonnet for reasoning, Gemini Code |
Small / Edge | Phi-3 mini, Mistral 7B, Gemma 2 2B/9B, TinyLlama | Hosted distilled endpoints / dynamic scaling |
Embeddings | Instructor models, E5 variants, bge-large, nomic-embed | text-embedding-3-large, Cohere Embed, Azure OpenAI embeddings |
Speech / Audio | Whisper (open), Distil-Whisper | GPT‑4o audio, Azure Neural voices |
Image Generation | Stable Diffusion SDXL / Turbo, Flux derivatives | DALL·E 3, Midjourney (service), Ideogram |
Terminology: Foundation model (broadly pre-trained across large unlabeled corpora, often multimodal) and LLM (language-focused foundation model). Lines are increasingly blurred as unified multimodal models emerge.
2. Foundation Models vs. Chat-Tuned LLMs
- Base / Foundation Model: Broadly pre-trained, not instruction-aligned, may output unstructured continuations.
- Instruction / Chat-Tuned Model: Base model further aligned via supervised fine-tuning + reinforcement signals (RLHF / DPO / preference optimization).
- Further Domain Adaptation: Lightweight fine-tuning (LoRA), RAG layering, tool orchestration, or safety guardrails.
Practical implication: Use an instruction/chat tuned variant for most application prototypes—only drop to base if you need raw continuation behaviors (e.g., research token statistics) or custom alignment pipelines.
3. Open-Weight vs. Proprietary
Dimension | Open-Weight Advantage | Proprietary Advantage | Considerations |
---|---|---|---|
Control | Full infra & fine-tune freedom | Managed scaling & reliability | Ops burden vs. agility |
Data Privacy | On-prem / air-gapped possible | Enterprise assurances / regional hosting | Evaluate data retention policies |
Cost Curve | Lower marginal at scale | Lower startup cost, pay-as-you-go | Volume break-even shifts |
Performance Pace | You apply updates selectively | Immediate access to frontier releases | Risk of stagnation vs. forced upgrades |
Safety | Customizable policies | Built-in layered safety filters | Need to implement guardrails yourself with open-weight |
Hybrid patterns (proprietary for critical reasoning + open-weight for cost-sensitive batch tasks) are increasingly common.
4. Architectural Patterns
Pattern | Use Case Strength | Examples |
---|---|---|
Decoder-only (causal) | Autoregressive generation (chat, code) | GPT family, Llama, Mistral, Phi |
Encoder-only | Understanding / classification / embeddings | BERT, RoBERTa |
Encoder-Decoder | Translation, summarization, seq2seq | T5, BART |
Mixture-of-Experts (Sparse) | Higher quality per FLOP | Mixtral, DeepSeek MoE variants |
Retrieval-Augmented Stack | Grounded QA, freshness | Any LLM + Azure AI Search / vector DB |
Multimodal Unified | Text + image (and emerging audio/video) | GPT‑4o, Gemini 1.5, Llama 3 multimodal variants |
5. Model Selection Framework
Score candidate models across weighted criteria:
Criterion | Questions | Metrics / Signals |
---|---|---|
Capability | Does it meet task accuracy & reasoning depth? | Benchmark scores, internal eval set |
Latency | Meets UX threshold (e.g., p95 < 2s)? | Tokens/sec, first-token latency |
Cost | Sustainable at projected volume? | $/1K input + output tokens, GPU hours |
Control | Need on-prem / custom fine-tune? | License, adaptation APIs |
Safety & Compliance | Built-in filters? Region/data residency? | Policy docs, SOC2/ISO reports |
Scalability | Spikes, concurrency, batching efficiency? | Autoscaling tests |
Maintenance | Update cadence manageable? | Release notes frequency |
Tip: Start with a narrow internal eval dataset (20–50 carefully curated prompts representing core tasks + edge cases) before scaling to larger synthetic sets.
6. Evaluation Methodology
- Define Tasks: Summarization, rubric feedback, misconception extraction, JSON structuring.
- Create Ground Truth / Reference Sets:
- Human annotated examples (gold set).
- Edge/failure cases (long inputs, ambiguous instructions, adversarial prompts).
- Automated Scoring (where possible):
- Exact / semantic similarity (cosine over embeddings) for extraction.
- Schema validation for structured outputs.
- Latency & cost instrumentation.
- Model-Assisted Judging:
- Use a separate higher-performing model to draft comparative judgments—human spot-check (avoids bias & scale issues).
- Safety & Red-Teaming:
- Prompt injection attempts, PII leakage tests, harmful content triggers.
- Regression Gates:
- Freeze baseline metrics; any change (prompt template, RAG chunking, quantization) must not degrade gated metrics > tolerance band.
Track: helpfulness, grounded accuracy, hallucination rate, harmful content rate, latency p50/p95, cost ($/req), schema adherence %, user satisfaction.
7. Customization Decision Path
Can prompt engineering + better instructions + output schema achieve target?
└─ Yes → Ship & monitor
└─ No →
Do we just need proprietary / current data?
└─ Yes → Add RAG (vector + retrieval ranking)
└─ No →
Are failure modes stylistic or narrow domain phrasing?
└─ Yes → Parameter-Efficient Fine-Tuning (LoRA/QLoRA)
└─ No →
Do we have large, high-quality labeled pairs + budget?
└─ Yes → Full fine-tune / alignment
└─ No → Revisit requirements or hybrid approach
Training a new LLM from scratch is rarely justified—requires massive curated data, sustained infra, safety alignment pipeline, and continuous evaluation.
8. Improving LLM Results (Updated Approaches)
Approach | When | Benefits | Risks / Costs |
---|---|---|---|
Prompt Engineering (role, style, schema) | First line | Fast iteration | Prompt drift, verbosity costs |
Retrieval-Augmented Generation (RAG) | Need current or proprietary truth | Reduces hallucinations; no weight changes | Context window pressure, retrieval quality critical |
Parameter-Efficient FT (LoRA/QLoRA, adapters) | Stylistic/domain adjustments | Low cost, reversible | Overfit if low-quality data |
Full Fine-Tune | High-volume domain tasks unmet by above | Strong domain internalization | Expensive; catastrophic forgetting risk |
Tool / Function Calling | Deterministic operations (calc, lookup) | Reliability, reduces “made up” answers | Requires tool design & sandboxing |
Guardrails & Policy Layer | Safety, compliance, injection defense | Risk mitigation | Added latency; tuning thresholds |
Distillation / Quantization | Cost & latency reduction | Cheaper inference | Possible quality drop; re-eval needed |
Structured Output Control (JSON schema guidance) | Workflow integration | Parsing reliability | Must enforce validation loop |
9. Azure AI Studio (Lifecycle Enablement)
Azure AI Studio provides a unified hub to:
- Discover models in the Model Catalog (proprietary & open-weight collections; filter by modality, license, task).
- Inspect model cards (intended use, eval benchmarks, licensing, safety notes).
- Benchmark via integrated eval datasets or custom sets (latency, accuracy, safety metrics).
- Ground responses using integrated RAG pipelines (Azure AI Search / Vector Search for hybrid lexical + vector retrieval).
- Fine-Tune (where supported) with experiment tracking and lineage.
- Add Safety Layers (content filters, prompt shields, PII detection).
- Deploy:
- Managed real-time endpoint (scalable GPU fleet)
- Serverless pay-as-you-go API (for bursty, low-maintenance usage)
- Batch or offline pipelines
- Monitor (centralized logs, cost dashboards, drift signals, violation alerts).
Always check the model card for: availability of fine-tuning, max context length, safety disclaimers, and responsible use constraints.
10. Practical Example: Narrowing Candidates
Scenario: “Generate personalized reading comprehension questions (grade-level aligned) from student-submitted essays in Spanish or English, return JSON with difficulty tags.”
Steps:
- Baseline: Instruction-tuned GPT‑4o (proprietary) + prompt template → Evaluate.
- Cost Pressure: Trial Phi-3 medium + RAG for curriculum alignment → Compare cost per 100 tasks.
- Style Consistency Gap: Apply LoRA fine-tune on ~2K curated (essay → questions JSON) pairs.
- Production Hardening: Add structured JSON schema enforcement + injection-resistant retrieval (sanitize essay input).
- Monitoring: Collect user rating metadata to refine prompt schema (“too hard”, “off-topic”).
11. Common Misconceptions (Revisited)
Claim | Reality |
---|---|
“Fine-tuning always beats RAG.” | RAG often solves freshness + factual gaps more efficiently. |
“Bigger model = better for every task.” | Small models + retrieval can outperform for focused domains. |
“Evaluation once is enough.” | Continuous regression & safety evaluation required. |
“Quantization destroys quality.” | Properly calibrated 4–8 bit often preserves task performance. |
“Open-weight = unsafe.” | Safety depends on your layered controls & monitoring. |
12. Knowledge Check
Select all correct statements:
- Retrieval-Augmented Generation can reduce hallucinations by injecting authoritative context at inference time.
- Full fine-tuning should be your first adaptation step once prompts feel “messy.”
- Parameter-efficient fine-tuning (LoRA) changes only a small subset of weights, reducing cost.
- A structured eval set with edge cases helps prevent silent regressions after optimization.
- Larger models always yield lower total cost of ownership (TCO) at scale.
Answer: 1, 3, and 4 are correct.
2 is false (start with prompt + RAG).
5 is false (smaller or hybrid architectures can reduce TCO).
13. Challenge
Create an evaluation spreadsheet or JSON spec for your target use case:
- 8–12 representative prompts
- 3 edge cases (ambiguous, adversarial, long input)
- Expected output format (schema)
- Pass criteria (accuracy notes, latency target, safety checks)
- Two candidate models + initial metric gaps
Then prototype a RAG variant and log deltas.
Continue Learning
Next lesson: build with Generative AI responsibly—governance, safety evaluations, prompt injection defenses, and user experience patterns.
Updated for early 2025 landscape. Revisit periodically—model capabilities and best practices evolve rapidly.