How to navigate today’s model landscape, classify model types, choose and evaluate LLMs, and decide between prompt engineering, RAG, fine‑tuning, or training.

Exploring and Comparing Different LLMs

In the previous lesson we introduced Generative AI and how Large Language Models (LLMs) enable new educational startup scenarios. Here we dive into selecting and comparing models: taxonomy, trade‑offs, evaluation, and improvement pathways (prompting, retrieval augmentation, fine‑tuning, and—rarely—training from scratch).

Goal: Enable practical, evidence-based model decisions instead of brand-driven or “bigger-is-better” assumptions.


Lesson Overview

You will learn:

  • A modern taxonomy of foundation and large language models (2024–2025 landscape)
  • How licensing (open‑weight vs. proprietary) and modality affect choice
  • Architecture patterns (decoder-only, encoder-decoder, MoE, multimodal unification)
  • A structured selection & evaluation framework (capability, cost, latency, safety, control)
  • When to apply Prompt Engineering, RAG, Parameter-Efficient Fine-Tuning, or Full Fine-Tuning
  • How Azure AI Studio (Model Catalog, benchmarks, evaluation & deployment) supports the lifecycle

Learning Goals

After completing this lesson you can:

  • Classify an LLM by modality, architecture, and license.
  • Apply a decision path to select a model and customization approach.
  • Design an evaluation loop (offline + human + safety + regression).
  • Explain when not to fine-tune and why RAG often comes first.
  • Deploy and iterate using Azure AI Studio tools.

1. Model Landscape Taxonomy

Modern models can be compared along multiple axes:

AxisCategoriesWhy It Matters
ModalityText-only, Multimodal (text+image), Audio, Video, Code-specializedTask coverage; fewer models vs. pipeline composition
License / AccessProprietary API (hosted), Open-weight (downloadable), Restricted license (research/commercial clauses)Control, cost optimization, compliance, customization
ScaleFrontier (hundreds of billions / MoE), Mid (tens of billions), Small / Efficient (<10B)Latency, cost, edge/embedded feasibility
ArchitectureDecoder-only, Encoder-only, Encoder-Decoder, Mixture-of-Experts (MoE), Retrieval-augmented stacksGeneration quality vs. efficiency vs. bidirectional features
SpecializationGeneral chat, Code, Reasoning, Safety-aligned, Domain-tuned (medical, finance)Match to task shape; avoid overpaying
Adaptation SupportPrompt-only, Function calling, Tool use, Structured output, Fine-tuning / LoRA endpointsIntegration pathways

Representative Examples (Early 2025)

TypeOpen-Weight / DownloadableProprietary / Hosted
General Chat / MultimodalLlama 3 family, Mistral & Mixtral MoE, Gemma 2, Phi-3 (mini / small / medium), DeepSeekGPT‑4o, Claude 3 family, Gemini 1.5, Cohere Command R/R+, Azure OpenAI variants
CodeStarCoder2, Code Llama, Phi-3 medium (instruct)GPT‑4o (code), Claude 3 Sonnet for reasoning, Gemini Code
Small / EdgePhi-3 mini, Mistral 7B, Gemma 2 2B/9B, TinyLlamaHosted distilled endpoints / dynamic scaling
EmbeddingsInstructor models, E5 variants, bge-large, nomic-embedtext-embedding-3-large, Cohere Embed, Azure OpenAI embeddings
Speech / AudioWhisper (open), Distil-WhisperGPT‑4o audio, Azure Neural voices
Image GenerationStable Diffusion SDXL / Turbo, Flux derivativesDALL·E 3, Midjourney (service), Ideogram

Terminology: Foundation model (broadly pre-trained across large unlabeled corpora, often multimodal) and LLM (language-focused foundation model). Lines are increasingly blurred as unified multimodal models emerge.


2. Foundation Models vs. Chat-Tuned LLMs

  • Base / Foundation Model: Broadly pre-trained, not instruction-aligned, may output unstructured continuations.
  • Instruction / Chat-Tuned Model: Base model further aligned via supervised fine-tuning + reinforcement signals (RLHF / DPO / preference optimization).
  • Further Domain Adaptation: Lightweight fine-tuning (LoRA), RAG layering, tool orchestration, or safety guardrails.

Practical implication: Use an instruction/chat tuned variant for most application prototypes—only drop to base if you need raw continuation behaviors (e.g., research token statistics) or custom alignment pipelines.


3. Open-Weight vs. Proprietary

DimensionOpen-Weight AdvantageProprietary AdvantageConsiderations
ControlFull infra & fine-tune freedomManaged scaling & reliabilityOps burden vs. agility
Data PrivacyOn-prem / air-gapped possibleEnterprise assurances / regional hostingEvaluate data retention policies
Cost CurveLower marginal at scaleLower startup cost, pay-as-you-goVolume break-even shifts
Performance PaceYou apply updates selectivelyImmediate access to frontier releasesRisk of stagnation vs. forced upgrades
SafetyCustomizable policiesBuilt-in layered safety filtersNeed to implement guardrails yourself with open-weight

Hybrid patterns (proprietary for critical reasoning + open-weight for cost-sensitive batch tasks) are increasingly common.


4. Architectural Patterns

PatternUse Case StrengthExamples
Decoder-only (causal)Autoregressive generation (chat, code)GPT family, Llama, Mistral, Phi
Encoder-onlyUnderstanding / classification / embeddingsBERT, RoBERTa
Encoder-DecoderTranslation, summarization, seq2seqT5, BART
Mixture-of-Experts (Sparse)Higher quality per FLOPMixtral, DeepSeek MoE variants
Retrieval-Augmented StackGrounded QA, freshnessAny LLM + Azure AI Search / vector DB
Multimodal UnifiedText + image (and emerging audio/video)GPT‑4o, Gemini 1.5, Llama 3 multimodal variants

5. Model Selection Framework

Score candidate models across weighted criteria:

CriterionQuestionsMetrics / Signals
CapabilityDoes it meet task accuracy & reasoning depth?Benchmark scores, internal eval set
LatencyMeets UX threshold (e.g., p95 < 2s)?Tokens/sec, first-token latency
CostSustainable at projected volume?$/1K input + output tokens, GPU hours
ControlNeed on-prem / custom fine-tune?License, adaptation APIs
Safety & ComplianceBuilt-in filters? Region/data residency?Policy docs, SOC2/ISO reports
ScalabilitySpikes, concurrency, batching efficiency?Autoscaling tests
MaintenanceUpdate cadence manageable?Release notes frequency

Tip: Start with a narrow internal eval dataset (20–50 carefully curated prompts representing core tasks + edge cases) before scaling to larger synthetic sets.


6. Evaluation Methodology

  1. Define Tasks: Summarization, rubric feedback, misconception extraction, JSON structuring.
  2. Create Ground Truth / Reference Sets:
    • Human annotated examples (gold set).
    • Edge/failure cases (long inputs, ambiguous instructions, adversarial prompts).
  3. Automated Scoring (where possible):
    • Exact / semantic similarity (cosine over embeddings) for extraction.
    • Schema validation for structured outputs.
    • Latency & cost instrumentation.
  4. Model-Assisted Judging:
    • Use a separate higher-performing model to draft comparative judgments—human spot-check (avoids bias & scale issues).
  5. Safety & Red-Teaming:
    • Prompt injection attempts, PII leakage tests, harmful content triggers.
  6. Regression Gates:
    • Freeze baseline metrics; any change (prompt template, RAG chunking, quantization) must not degrade gated metrics > tolerance band.

Track: helpfulness, grounded accuracy, hallucination rate, harmful content rate, latency p50/p95, cost ($/req), schema adherence %, user satisfaction.


7. Customization Decision Path

Can prompt engineering + better instructions + output schema achieve target?
  └─ Yes → Ship & monitor
  └─ No →
       Do we just need proprietary / current data?
         └─ Yes → Add RAG (vector + retrieval ranking)
         └─ No →
             Are failure modes stylistic or narrow domain phrasing?
               └─ Yes → Parameter-Efficient Fine-Tuning (LoRA/QLoRA)
               └─ No →
                   Do we have large, high-quality labeled pairs + budget?
                     └─ Yes → Full fine-tune / alignment
                     └─ No → Revisit requirements or hybrid approach

Training a new LLM from scratch is rarely justified—requires massive curated data, sustained infra, safety alignment pipeline, and continuous evaluation.


8. Improving LLM Results (Updated Approaches)

ApproachWhenBenefitsRisks / Costs
Prompt Engineering (role, style, schema)First lineFast iterationPrompt drift, verbosity costs
Retrieval-Augmented Generation (RAG)Need current or proprietary truthReduces hallucinations; no weight changesContext window pressure, retrieval quality critical
Parameter-Efficient FT (LoRA/QLoRA, adapters)Stylistic/domain adjustmentsLow cost, reversibleOverfit if low-quality data
Full Fine-TuneHigh-volume domain tasks unmet by aboveStrong domain internalizationExpensive; catastrophic forgetting risk
Tool / Function CallingDeterministic operations (calc, lookup)Reliability, reduces “made up” answersRequires tool design & sandboxing
Guardrails & Policy LayerSafety, compliance, injection defenseRisk mitigationAdded latency; tuning thresholds
Distillation / QuantizationCost & latency reductionCheaper inferencePossible quality drop; re-eval needed
Structured Output Control (JSON schema guidance)Workflow integrationParsing reliabilityMust enforce validation loop

9. Azure AI Studio (Lifecycle Enablement)

Azure AI Studio provides a unified hub to:

  • Discover models in the Model Catalog (proprietary & open-weight collections; filter by modality, license, task).
  • Inspect model cards (intended use, eval benchmarks, licensing, safety notes).
  • Benchmark via integrated eval datasets or custom sets (latency, accuracy, safety metrics).
  • Ground responses using integrated RAG pipelines (Azure AI Search / Vector Search for hybrid lexical + vector retrieval).
  • Fine-Tune (where supported) with experiment tracking and lineage.
  • Add Safety Layers (content filters, prompt shields, PII detection).
  • Deploy:
    • Managed real-time endpoint (scalable GPU fleet)
    • Serverless pay-as-you-go API (for bursty, low-maintenance usage)
    • Batch or offline pipelines
  • Monitor (centralized logs, cost dashboards, drift signals, violation alerts).

Always check the model card for: availability of fine-tuning, max context length, safety disclaimers, and responsible use constraints.


10. Practical Example: Narrowing Candidates

Scenario: “Generate personalized reading comprehension questions (grade-level aligned) from student-submitted essays in Spanish or English, return JSON with difficulty tags.”

Steps:

  1. Baseline: Instruction-tuned GPT‑4o (proprietary) + prompt template → Evaluate.
  2. Cost Pressure: Trial Phi-3 medium + RAG for curriculum alignment → Compare cost per 100 tasks.
  3. Style Consistency Gap: Apply LoRA fine-tune on ~2K curated (essay → questions JSON) pairs.
  4. Production Hardening: Add structured JSON schema enforcement + injection-resistant retrieval (sanitize essay input).
  5. Monitoring: Collect user rating metadata to refine prompt schema (“too hard”, “off-topic”).

11. Common Misconceptions (Revisited)

ClaimReality
“Fine-tuning always beats RAG.”RAG often solves freshness + factual gaps more efficiently.
“Bigger model = better for every task.”Small models + retrieval can outperform for focused domains.
“Evaluation once is enough.”Continuous regression & safety evaluation required.
“Quantization destroys quality.”Properly calibrated 4–8 bit often preserves task performance.
“Open-weight = unsafe.”Safety depends on your layered controls & monitoring.

12. Knowledge Check

Select all correct statements:

  1. Retrieval-Augmented Generation can reduce hallucinations by injecting authoritative context at inference time.
  2. Full fine-tuning should be your first adaptation step once prompts feel “messy.”
  3. Parameter-efficient fine-tuning (LoRA) changes only a small subset of weights, reducing cost.
  4. A structured eval set with edge cases helps prevent silent regressions after optimization.
  5. Larger models always yield lower total cost of ownership (TCO) at scale.

Answer: 1, 3, and 4 are correct.
2 is false (start with prompt + RAG).
5 is false (smaller or hybrid architectures can reduce TCO).


13. Challenge

Create an evaluation spreadsheet or JSON spec for your target use case:

  • 8–12 representative prompts
  • 3 edge cases (ambiguous, adversarial, long input)
  • Expected output format (schema)
  • Pass criteria (accuracy notes, latency target, safety checks)
  • Two candidate models + initial metric gaps

Then prototype a RAG variant and log deltas.


Continue Learning

Next lesson: build with Generative AI responsibly—governance, safety evaluations, prompt injection defenses, and user experience patterns.


Updated for early 2025 landscape. Revisit periodically—model capabilities and best practices evolve rapidly.