Insights on Oracle & Tech: Beyond the model: Enhancing LLM applications (Stanford CS230)

TL;DR

A practical breakdown of how CS230 approaches modern LLM engineering—focusing on prompting, chaining, RAG, agents, and evals—while emphasizing modular design, debuggability, and fundamentals over hype. Fine‑tuning is used sparingly; strong engineering habits matter most as the field evolves rapidly.

The Three Stages Shaping Modern RAG: Pre‑Train, Fine‑Tune, Infer (YouTube link)

Lecture Goal & Agenda

The Stanford CS230 lecture moves beyond basic neural networks and shifts the focus to the engineering practices that make modern AI systems actually work in production. It opens with the core pillars of contemporary LLM development—strong prompting, multi‑step chains, Retrieval‑Augmented Generation (RAG), agent workflows, and rigorous evaluation—then walks through each theme in a structured progression:

Augmenting LLMs: Challenges and Opportunities
Prompt Engineering: The First Line of Optimization
Fine-Tuning: Proceed with Caution
Retrieval-Augmented Generation (RAG): Enhancing Model Utility
Agentic AI Workflows: Toward Autonomous and Specialized Systems
Case Study: Evals
Multi-Agent Workflows: Parallelism
What’s Next in AI? Personal Thoughts

As the lecture moves through these topics, a consistent message emerges: fine‑tuning should be used sparingly, not reflexively. The emphasis is on building modular, debuggable systems and grounding decisions in measurable performance rather than hype. In a field evolving at breakneck speed, broad fundamentals and adaptable engineering habits remain the most durable advantage.

Challenges & Opportunities of Augmenting Base LLMs

Prompting methods
Fine-tuning (why the lecturer avoids it)
Retrieval-Augmented Generation (RAG)
Agentic AI workflows (definition + examples)
Case study on agentic workflows + evals
Multi-agent workflows
Open discussion on what's next in AI

1. Limitations of Vanilla Pre-trained LLMs (e.g., GPT-3.5 Turbo, GPT-4)

Students and lecturer discussed key issues:

Lack of domain-specific knowledge (e.g., specialized crop disease detection)
Distribution shift (real-world data differs from training data, e.g., low-quality/dark images)
Outdated knowledge (cutoff dates; struggles with new trends, slang like "rizz", or events like "Covfefe")
Breadth vs. depth: Good at general knowledge but poor on narrow, high-precision enterprise tasks
Inefficiency: Uses a massive model when only ~2% of capabilities are needed (pruning/quantization possible)
Hard to control: Can produce racist/offensive outputs (e.g., Microsoft's Tay bot, political bias debates between Grok & OpenAI)
Underperformance on specialized tasks: Medical diagnosis, legal contracts (style/precision matters), task-specific classification (e.g., NPS thresholds vary by industry)
Limited context handling: Context windows max 200k tokens (2 books); attention struggles with "needle in a haystack" problems in large corpora
No reliable sourcing: Hallucinates references; critical for legal/medical/education use cases

Two dimensions for improvement:

Horizontal: Better foundation models (GPT-3.5 → GPT-4 → GPT-4o → GPT-5)
Vertical (focus of lecture): Engineering techniques around a fixed model (prompting, RAG, agents, etc.)

In theory, with infinite compute/context, RAG might become unnecessary (just feed everything). In practice, latency, sourcing, and efficiency make RAG valuable long-term (analogous to search engines narrowing the web).

2. Prompt Engineering (First Line of Optimization)

Prompting significantly boosts performance without changing model weights.

Key study (HBS/UPenn/Wharton on BCG consultants):

AI helped on some tasks ("within the jagged frontier") but hurt others ("falling asleep at the wheel").
Training on prompting made the biggest difference.
Two interaction styles: Centaurs (delegate big tasks to AI) vs. Cyborgs (rapid back-and-forth collaboration).^[3] Students tend toward cyborgs; enterprises toward centaurs.

Basic principles & techniques:

Be specific (length, focus, audience)
Role prompting: "Act as a renewable energy expert presenting at Davos"
Few-shot prompting: Provide examples to align the model to subjective tasks (e.g., tone classification of reviews)
Chain-of-Thought (CoT): "Think step by step" + explicit steps (improves reasoning)
Reflection: Generate → critique → improve
Prompt templates: Reusable, scalable (insert user metadata); many open-source on GitHub ("awesome prompt templates")
Chaining: Break complex tasks into sequential prompts (easier debugging, better control, modular optimization) vs. one monolithic prompt

Testing & Evals for prompts:

Manual human rating
Automated: Platforms like PromptFoo
LLM-as-judge: Pairwise comparison, single-answer grading (1-5), or rubric-based scoring (can combine with few-shot)

Zero-shot vs. Few-shot: Few-shot aligns model to your specific criteria quickly without fine-tuning.

3. Fine-Tuning (Why the Lecturer Avoids It)

Disadvantages: Requires substantial labeled data

Risk of overfitting → loses general-purpose utility
Time- and cost-intensive
By the time you're done, newer base models often outperform your fine-tuned version

When it might still make sense: High-precision, repeated domain-specific tasks (legal, scientific) with specialized language.

Funny cautionary example: Fine-tuning on internal Slack messages made the model respond like lazy colleagues ("I shall work on that in the morning...") instead of following instructions.

Trend: Boundaries between few-shot prompting and lightweight fine-tuning are blurring.

4. Retrieval-Augmented Generation (RAG)

Why RAG? Addresses knowledge gaps, cutoff dates, hallucinations, sourcing, and large-context issues without retraining the model.

How vanilla RAG works:

Embed documents → store in vector database
Embed user query
Retrieve most similar documents (via distance metrics)
Add retrieved docs to prompt + instructions ("Answer based only on these documents; say 'I don't know' otherwise; cite sources")

Advanced RAG techniques:

Chunking: Store embeddings at document, chapter, or passage level for better sourcing/precision
HyDE (Hypothetical Document Embeddings): Generate a fake document from the query, then embed it (better matches real documents)
Many other research branches (survey papers available)

Limitations & debates: Vanilla RAG struggles with very long documents; attention issues persist.

5. Agentic AI Workflows

Coined/popularized by Andrew Ng. Refers to multi-step, autonomous workflows using prompts + tools + memory + resources, rather than single prompts.

Paradigm shift (especially for software engineers):

From structured/deterministic data & code → fuzzy/free-form text, images, dynamic interpretation
Think like a manager: Decompose tasks into roles (e.g., researcher → drafter → editor → analyst)
Experimentation is cheap → more comfortable discarding code
Need human-in-the-loop for fuzzy parts + guardrails

Core components of an agent:

Prompts (optimized as above)
Memory: Working (fast) vs. archival/long-term (slower)
Tools: APIs, code execution, web search, etc.
Resources: Databases, CRMs, documents
MCP (Model Context Protocol) by Anthropic: More scalable agent-to-system communication than raw APIs (agent discovers requirements via conversation)

Degrees of autonomy:

Hard-coded steps (least autonomous)
Hard-coded tools only
Fully autonomous (decides steps, creates tools, writes code)

Example: Simple refund policy response (RAG) vs. full agentic workflow (retrieve policy → ask for order # → check API → confirm & process).

6. Case Study: Customer Support Agent + Evals

Task decomposition (key starting point):

Extract key info from user message (LLM)
Lookup/update customer record (tool)
Check policy (RAG/tool)
Draft & send response (LLM + tool)

How to evaluate & improve:

LLM traces (critical for debugging)
End-to-end metrics: User satisfaction ratings
Component-based: Debug individual prompts/tools
Objective (e.g., correct order ID extracted) vs. Subjective (politeness, helpfulness)
Quantitative (success rate, latency) vs. Qualitative (error analysis, hallucinations)
Use LLM judges with rubrics for scalable subjective evals
Mix of human review + automated proxies

7. Multi-Agent Workflows

Why multi-agent?

Parallelization (run independent subtasks simultaneously)
Reusability (one specialized agent shared across teams)
Better debugging (specialized agents easier to isolate)

Example: Smart home automation

Biometric/location tracking
Climate control
Energy management
Security & permissions
Fridge/grocery agent
Weather integration
Entertainment
Orchestrator (user-facing, coordinates others)

Organization patterns: Flat (all-to-all) vs. Hierarchical (orchestrator on top) — hierarchical often preferred for UX.

Interaction: Agents communicate via MCP-like protocols (treat other agents as tools).

8. What's Next in AI (Closing Thoughts)

Scaling laws & potential plateau: More compute helps, but architecture search (beyond transformers) will be key. Human brain is more efficient (no backprop? forward-only?).
Multi-modality: Text → image → audio/video → robotics; cross-modal gains improve overall performance.
Harmonizing methods: Combine supervised/unsupervised/self-supervised/RL/prompting/RAG/etc. (like how babies learn).
Human-centric vs. non-human-centric research: Learn from brain but optimize beyond biological limits.
High velocity of change: Half-life of specific skills is short → focus on breadth + ability to learn fast.

Overall message: Master these engineering techniques (prompting, chaining, RAG, agents, evals) to maximize any base LLM. Fine-tuning sparingly. Build modular, debuggable, evaluable systems. The field moves extremely fast — breadth + strong fundamentals will serve you best.

Further Inspiration & Resources

Stanford’s Artificial Intelligence professional and graduate programs
Stanford CS230 | Autumn 2025
Randazzo, S., et al. (2025). Cyborgs, centaurs and self-automators: The three modes of human-GenAI knowledge work and their implications for skilling and the future of expertise (Harvard Business School Working Paper No. 26-036).
Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1762–1777). Association for Computational Linguistics.
Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Cross Column

Sunday, April 12, 2026

Beyond the model: Enhancing LLM applications (Stanford CS230)