Exploring and Comparing Different LLMs

In the previous lesson we introduced Generative AI and how Large Language Models (LLMs) enable new educational startup scenarios. Here we dive into selecting and comparing models: taxonomy, trade‑offs, evaluation, and improvement pathways (prompting, retrieval augmentation, fine‑tuning, and—rarely—training from scratch).

Goal: Enable practical, evidence-based model decisions instead of brand-driven or “bigger-is-better” assumptions.

Lesson Overview

You will learn:

A modern taxonomy of foundation and large language models (2024–2025 landscape)
How licensing (open‑weight vs. proprietary) and modality affect choice
Architecture patterns (decoder-only, encoder-decoder, MoE, multimodal unification)
A structured selection & evaluation framework (capability, cost, latency, safety, control)
When to apply Prompt Engineering, RAG, Parameter-Efficient Fine-Tuning, or Full Fine-Tuning
How Azure AI Studio (Model Catalog, benchmarks, evaluation & deployment) supports the lifecycle

Learning Goals

After completing this lesson you can:

Classify an LLM by modality, architecture, and license.
Apply a decision path to select a model and customization approach.
Design an evaluation loop (offline + human + safety + regression).
Explain when not to fine-tune and why RAG often comes first.
Deploy and iterate using Azure AI Studio tools.

1. Model Landscape Taxonomy

Modern models can be compared along multiple axes:

Axis	Categories	Why It Matters
Modality	Text-only, Multimodal (text+image), Audio, Video, Code-specialized	Task coverage; fewer models vs. pipeline composition
License / Access	Proprietary API (hosted), Open-weight (downloadable), Restricted license (research/commercial clauses)	Control, cost optimization, compliance, customization
Scale	Frontier (hundreds of billions / MoE), Mid (tens of billions), Small / Efficient (<10B)	Latency, cost, edge/embedded feasibility
Architecture	Decoder-only, Encoder-only, Encoder-Decoder, Mixture-of-Experts (MoE), Retrieval-augmented stacks	Generation quality vs. efficiency vs. bidirectional features
Specialization	General chat, Code, Reasoning, Safety-aligned, Domain-tuned (medical, finance)	Match to task shape; avoid overpaying
Adaptation Support	Prompt-only, Function calling, Tool use, Structured output, Fine-tuning / LoRA endpoints	Integration pathways

Representative Examples (Early 2025)

Type	Open-Weight / Downloadable	Proprietary / Hosted
General Chat / Multimodal	Llama 3 family, Mistral & Mixtral MoE, Gemma 2, Phi-3 (mini / small / medium), DeepSeek	GPT‑4o, Claude 3 family, Gemini 1.5, Cohere Command R/R+, Azure OpenAI variants
Code	StarCoder2, Code Llama, Phi-3 medium (instruct)	GPT‑4o (code), Claude 3 Sonnet for reasoning, Gemini Code
Small / Edge	Phi-3 mini, Mistral 7B, Gemma 2 2B/9B, TinyLlama	Hosted distilled endpoints / dynamic scaling
Embeddings	Instructor models, E5 variants, bge-large, nomic-embed	text-embedding-3-large, Cohere Embed, Azure OpenAI embeddings
Speech / Audio	Whisper (open), Distil-Whisper	GPT‑4o audio, Azure Neural voices
Image Generation	Stable Diffusion SDXL / Turbo, Flux derivatives	DALL·E 3, Midjourney (service), Ideogram

Terminology: Foundation model (broadly pre-trained across large unlabeled corpora, often multimodal) and LLM (language-focused foundation model). Lines are increasingly blurred as unified multimodal models emerge.

2. Foundation Models vs. Chat-Tuned LLMs

Base / Foundation Model: Broadly pre-trained, not instruction-aligned, may output unstructured continuations.
Instruction / Chat-Tuned Model: Base model further aligned via supervised fine-tuning + reinforcement signals (RLHF / DPO / preference optimization).
Further Domain Adaptation: Lightweight fine-tuning (LoRA), RAG layering, tool orchestration, or safety guardrails.

Practical implication: Use an instruction/chat tuned variant for most application prototypes—only drop to base if you need raw continuation behaviors (e.g., research token statistics) or custom alignment pipelines.

3. Open-Weight vs. Proprietary

Dimension	Open-Weight Advantage	Proprietary Advantage	Considerations
Control	Full infra & fine-tune freedom	Managed scaling & reliability	Ops burden vs. agility
Data Privacy	On-prem / air-gapped possible	Enterprise assurances / regional hosting	Evaluate data retention policies
Cost Curve	Lower marginal at scale	Lower startup cost, pay-as-you-go	Volume break-even shifts
Performance Pace	You apply updates selectively	Immediate access to frontier releases	Risk of stagnation vs. forced upgrades
Safety	Customizable policies	Built-in layered safety filters	Need to implement guardrails yourself with open-weight

Hybrid patterns (proprietary for critical reasoning + open-weight for cost-sensitive batch tasks) are increasingly common.

4. Architectural Patterns

Pattern	Use Case Strength	Examples
Decoder-only (causal)	Autoregressive generation (chat, code)	GPT family, Llama, Mistral, Phi
Encoder-only	Understanding / classification / embeddings	BERT, RoBERTa
Encoder-Decoder	Translation, summarization, seq2seq	T5, BART
Mixture-of-Experts (Sparse)	Higher quality per FLOP	Mixtral, DeepSeek MoE variants
Retrieval-Augmented Stack	Grounded QA, freshness	Any LLM + Azure AI Search / vector DB
Multimodal Unified	Text + image (and emerging audio/video)	GPT‑4o, Gemini 1.5, Llama 3 multimodal variants

5. Model Selection Framework

Score candidate models across weighted criteria:

Criterion	Questions	Metrics / Signals
Capability	Does it meet task accuracy & reasoning depth?	Benchmark scores, internal eval set
Latency	Meets UX threshold (e.g., p95 < 2s)?	Tokens/sec, first-token latency
Cost	Sustainable at projected volume?	$/1K input + output tokens, GPU hours
Control	Need on-prem / custom fine-tune?	License, adaptation APIs
Safety & Compliance	Built-in filters? Region/data residency?	Policy docs, SOC2/ISO reports
Scalability	Spikes, concurrency, batching efficiency?	Autoscaling tests
Maintenance	Update cadence manageable?	Release notes frequency

Tip: Start with a narrow internal eval dataset (20–50 carefully curated prompts representing core tasks + edge cases) before scaling to larger synthetic sets.

6. Evaluation Methodology

Define Tasks: Summarization, rubric feedback, misconception extraction, JSON structuring.
Create Ground Truth / Reference Sets:
- Human annotated examples (gold set).
- Edge/failure cases (long inputs, ambiguous instructions, adversarial prompts).
Automated Scoring (where possible):
- Exact / semantic similarity (cosine over embeddings) for extraction.
- Schema validation for structured outputs.
- Latency & cost instrumentation.
Model-Assisted Judging:
- Use a separate higher-performing model to draft comparative judgments—human spot-check (avoids bias & scale issues).
Safety & Red-Teaming:
- Prompt injection attempts, PII leakage tests, harmful content triggers.
Regression Gates:
- Freeze baseline metrics; any change (prompt template, RAG chunking, quantization) must not degrade gated metrics > tolerance band.

Track: helpfulness, grounded accuracy, hallucination rate, harmful content rate, latency p50/p95, cost ($/req), schema adherence %, user satisfaction.

7. Customization Decision Path

Can prompt engineering + better instructions + output schema achieve target?
  └─ Yes → Ship & monitor
  └─ No →
       Do we just need proprietary / current data?
         └─ Yes → Add RAG (vector + retrieval ranking)
         └─ No →
             Are failure modes stylistic or narrow domain phrasing?
               └─ Yes → Parameter-Efficient Fine-Tuning (LoRA/QLoRA)
               └─ No →
                   Do we have large, high-quality labeled pairs + budget?
                     └─ Yes → Full fine-tune / alignment
                     └─ No → Revisit requirements or hybrid approach

Training a new LLM from scratch is rarely justified—requires massive curated data, sustained infra, safety alignment pipeline, and continuous evaluation.

8. Improving LLM Results (Updated Approaches)

Approach	When	Benefits	Risks / Costs
Prompt Engineering (role, style, schema)	First line	Fast iteration	Prompt drift, verbosity costs
Retrieval-Augmented Generation (RAG)	Need current or proprietary truth	Reduces hallucinations; no weight changes	Context window pressure, retrieval quality critical
Parameter-Efficient FT (LoRA/QLoRA, adapters)	Stylistic/domain adjustments	Low cost, reversible	Overfit if low-quality data
Full Fine-Tune	High-volume domain tasks unmet by above	Strong domain internalization	Expensive; catastrophic forgetting risk
Tool / Function Calling	Deterministic operations (calc, lookup)	Reliability, reduces “made up” answers	Requires tool design & sandboxing
Guardrails & Policy Layer	Safety, compliance, injection defense	Risk mitigation	Added latency; tuning thresholds
Distillation / Quantization	Cost & latency reduction	Cheaper inference	Possible quality drop; re-eval needed
Structured Output Control (JSON schema guidance)	Workflow integration	Parsing reliability	Must enforce validation loop

9. Azure AI Studio (Lifecycle Enablement)

Azure AI Studio provides a unified hub to:

Discover models in the Model Catalog (proprietary & open-weight collections; filter by modality, license, task).
Inspect model cards (intended use, eval benchmarks, licensing, safety notes).
Benchmark via integrated eval datasets or custom sets (latency, accuracy, safety metrics).
Ground responses using integrated RAG pipelines (Azure AI Search / Vector Search for hybrid lexical + vector retrieval).
Fine-Tune (where supported) with experiment tracking and lineage.
Add Safety Layers (content filters, prompt shields, PII detection).
Deploy:
- Managed real-time endpoint (scalable GPU fleet)
- Serverless pay-as-you-go API (for bursty, low-maintenance usage)
- Batch or offline pipelines
Monitor (centralized logs, cost dashboards, drift signals, violation alerts).

Always check the model card for: availability of fine-tuning, max context length, safety disclaimers, and responsible use constraints.

10. Practical Example: Narrowing Candidates

Scenario: “Generate personalized reading comprehension questions (grade-level aligned) from student-submitted essays in Spanish or English, return JSON with difficulty tags.”

Steps:

Baseline: Instruction-tuned GPT‑4o (proprietary) + prompt template → Evaluate.
Cost Pressure: Trial Phi-3 medium + RAG for curriculum alignment → Compare cost per 100 tasks.
Style Consistency Gap: Apply LoRA fine-tune on ~2K curated (essay → questions JSON) pairs.
Production Hardening: Add structured JSON schema enforcement + injection-resistant retrieval (sanitize essay input).
Monitoring: Collect user rating metadata to refine prompt schema (“too hard”, “off-topic”).

11. Common Misconceptions (Revisited)

Claim	Reality
“Fine-tuning always beats RAG.”	RAG often solves freshness + factual gaps more efficiently.
“Bigger model = better for every task.”	Small models + retrieval can outperform for focused domains.
“Evaluation once is enough.”	Continuous regression & safety evaluation required.
“Quantization destroys quality.”	Properly calibrated 4–8 bit often preserves task performance.
“Open-weight = unsafe.”	Safety depends on your layered controls & monitoring.

12. Knowledge Check

Select all correct statements:

Retrieval-Augmented Generation can reduce hallucinations by injecting authoritative context at inference time.
Full fine-tuning should be your first adaptation step once prompts feel “messy.”
Parameter-efficient fine-tuning (LoRA) changes only a small subset of weights, reducing cost.
A structured eval set with edge cases helps prevent silent regressions after optimization.
Larger models always yield lower total cost of ownership (TCO) at scale.

Answer: 1, 3, and 4 are correct.
2 is false (start with prompt + RAG).
5 is false (smaller or hybrid architectures can reduce TCO).

13. Challenge

Create an evaluation spreadsheet or JSON spec for your target use case:

8–12 representative prompts
3 edge cases (ambiguous, adversarial, long input)
Expected output format (schema)
Pass criteria (accuracy notes, latency target, safety checks)
Two candidate models + initial metric gaps

Then prototype a RAG variant and log deltas.

Continue Learning

Next lesson: build with Generative AI responsibly—governance, safety evaluations, prompt injection defenses, and user experience patterns.

Updated for early 2025 landscape. Revisit periodically—model capabilities and best practices evolve rapidly.