Cross Column

Sunday, April 12, 2026

Beyond the model: Enhancing LLM applications (Stanford CS230)

TL;DR
A practical breakdown of how CS230 approaches modern LLM engineering—focusing on prompting, chaining, RAG, agents, and evals—while emphasizing modular design, debuggability, and fundamentals over hype. Fine‑tuning is used sparingly; strong engineering habits matter most as the field evolves rapidly.
The Three Stages Shaping Modern RAG: Pre‑Train, Fine‑Tune, Infer (YouTube link)


Lecture Goal & Agenda

The Stanford CS230 lecture moves beyond basic neural networks and shifts the focus to the engineering practices that make modern AI systems actually work in production. It opens with the core pillars of contemporary LLM development—strong prompting, multi‑step chains, Retrieval‑Augmented Generation (RAG), agent workflows, and rigorous evaluation—then walks through each theme in a structured progression:

  1. Augmenting LLMs: Challenges and Opportunities
  2. Prompt Engineering: The First Line of Optimization
  3. Fine-Tuning: Proceed with Caution
  4. Retrieval-Augmented Generation (RAG): Enhancing Model Utility
  5. Agentic AI Workflows: Toward Autonomous and Specialized Systems
  6. Case Study: Evals
  7. Multi-Agent Workflows: Parallelism
  8. What’s Next in AI? Personal Thoughts

As the lecture moves through these topics, a consistent message emerges: fine‑tuning should be used sparingly, not reflexively. The emphasis is on building modular, debuggable systems and grounding decisions in measurable performance rather than hype. In a field evolving at breakneck speed, broad fundamentals and adaptable engineering habits remain the most durable advantage.


Challenges & Opportunities of Augmenting Base LLMs

  • Prompting methods
  • Fine-tuning (why the lecturer avoids it)
  • Retrieval-Augmented Generation (RAG)
  • Agentic AI workflows (definition + examples)
  • Case study on agentic workflows + evals
  • Multi-agent workflows
  • Open discussion on what's next in AI

1. Limitations of Vanilla Pre-trained LLMs (e.g., GPT-3.5 Turbo, GPT-4)

Students and lecturer discussed key issues:

  • Lack of domain-specific knowledge (e.g., specialized crop disease detection)
  • Distribution shift (real-world data differs from training data, e.g., low-quality/dark images)
  • Outdated knowledge (cutoff dates; struggles with new trends, slang like "rizz", or events like "Covfefe")
  • Breadth vs. depth: Good at general knowledge but poor on narrow, high-precision enterprise tasks
  • Inefficiency: Uses a massive model when only ~2% of capabilities are needed (pruning/quantization possible)
  • Hard to control: Can produce racist/offensive outputs (e.g., Microsoft's Tay bot, political bias debates between Grok & OpenAI)
  • Underperformance on specialized tasks: Medical diagnosis, legal contracts (style/precision matters), task-specific classification (e.g., NPS thresholds vary by industry)
  • Limited context handling: Context windows max 200k tokens (2 books); attention struggles with "needle in a haystack" problems in large corpora
  • No reliable sourcing: Hallucinates references; critical for legal/medical/education use cases

Two dimensions for improvement:

  • Horizontal: Better foundation models (GPT-3.5 → GPT-4 → GPT-4o → GPT-5)
  • Vertical (focus of lecture)Engineering techniques around a fixed model (prompting, RAG, agents, etc.)
    • In theory, with infinite compute/context, RAG might become unnecessary (just feed everything). In practice, latency, sourcing, and efficiency make RAG valuable long-term (analogous to search engines narrowing the web).

2. Prompt Engineering (First Line of Optimization)


Prompting significantly boosts performance without changing model weights.

Key study (HBS/UPenn/Wharton on BCG consultants):

  • AI helped on some tasks ("within the jagged frontier") but hurt others ("falling asleep at the wheel").
  • Training on prompting made the biggest difference.
  • Two interaction styles: Centaurs (delegate big tasks to AI) vs. Cyborgs (rapid back-and-forth collaboration).[3] Students tend toward cyborgs; enterprises toward centaurs.

Basic principles & techniques:

  • Be specific (length, focus, audience)
  • Role prompting: "Act as a renewable energy expert presenting at Davos"
  • Few-shot prompting: Provide examples to align the model to subjective tasks (e.g., tone classification of reviews)
  • Chain-of-Thought (CoT): "Think step by step" + explicit steps (improves reasoning)
  • Reflection: Generate → critique → improve
  • Prompt templates: Reusable, scalable (insert user metadata); many open-source on GitHub ("awesome prompt templates")
  • Chaining: Break complex tasks into sequential prompts (easier debugging, better control, modular optimization) vs. one monolithic prompt

Testing & Evals for prompts:

  • Manual human rating
  • Automated: Platforms like PromptFoo
  • LLM-as-judge: Pairwise comparison, single-answer grading (1-5), or rubric-based scoring (can combine with few-shot)

Zero-shot vs. Few-shot: Few-shot aligns model to your specific criteria quickly without fine-tuning.


3. Fine-Tuning (Why the Lecturer Avoids It)


Disadvantages: Requires substantial labeled data

  • Risk of overfitting → loses general-purpose utility
  • Time- and cost-intensive
  • By the time you're done, newer base models often outperform your fine-tuned version

When it might still make sense: High-precision, repeated domain-specific tasks (legal, scientific) with specialized language.

Funny cautionary example: Fine-tuning on internal Slack messages made the model respond like lazy colleagues ("I shall work on that in the morning...") instead of following instructions.

Trend: Boundaries between few-shot prompting and lightweight fine-tuning are blurring.


4. Retrieval-Augmented Generation (RAG)


Why RAG? Addresses knowledge gaps, cutoff dates, hallucinations, sourcing, and large-context issues without retraining the model.
How vanilla RAG works:
  1. Embed documents → store in vector database
  2. Embed user query
  3. Retrieve most similar documents (via distance metrics)
  4. Add retrieved docs to prompt + instructions ("Answer based only on these documents; say 'I don't know' otherwise; cite sources")

Advanced RAG techniques:

  • Chunking: Store embeddings at document, chapter, or passage level for better sourcing/precision
  • HyDE (Hypothetical Document Embeddings): Generate a fake document from the query, then embed it (better matches real documents)
  • Many other research branches (survey papers available)

Limitations & debates: Vanilla RAG struggles with very long documents; attention issues persist.


5. Agentic AI Workflows


Coined/popularized by Andrew Ng. Refers to multi-step, autonomous workflows using prompts + tools + memory + resources, rather than single prompts.

Paradigm shift (especially for software engineers):

  • From structured/deterministic data & code → fuzzy/free-form text, images, dynamic interpretation
  • Think like a manager: Decompose tasks into roles (e.g., researcher → drafter → editor → analyst)
  • Experimentation is cheap → more comfortable discarding code
  • Need human-in-the-loop for fuzzy parts + guardrails

Core components of an agent:

  • Prompts (optimized as above)
  • Memory: Working (fast) vs. archival/long-term (slower)
  • Tools: APIs, code execution, web search, etc.
  • Resources: Databases, CRMs, documents
  • MCP (Model Context Protocol) by Anthropic: More scalable agent-to-system communication than raw APIs (agent discovers requirements via conversation)

Degrees of autonomy:

  • Hard-coded steps (least autonomous)
  • Hard-coded tools only
  • Fully autonomous (decides steps, creates tools, writes code)

Example: Simple refund policy response (RAG) vs. full agentic workflow (retrieve policy → ask for order # → check API → confirm & process).


6. Case Study: Customer Support Agent + Evals


Task decomposition (key starting point):

  • Extract key info from user message (LLM)
  • Lookup/update customer record (tool)
  • Check policy (RAG/tool)
  • Draft & send response (LLM + tool)

How to evaluate & improve:

  • LLM traces (critical for debugging)
  • End-to-end metrics: User satisfaction ratings
  • Component-based: Debug individual prompts/tools
  • Objective (e.g., correct order ID extracted) vs. Subjective (politeness, helpfulness)
  • Quantitative (success rate, latency) vs. Qualitative (error analysis, hallucinations)
  • Use LLM judges with rubrics for scalable subjective evals
  • Mix of human review + automated proxies


7. Multi-Agent Workflows


Why multi-agent? 

  • Parallelization (run independent subtasks simultaneously)
  • Reusability (one specialized agent shared across teams)
  • Better debugging (specialized agents easier to isolate)

Example: Smart home automation

  • Biometric/location tracking
  • Climate control
  • Energy management
  • Security & permissions
  • Fridge/grocery agent
  • Weather integration
  • Entertainment
  • Orchestrator (user-facing, coordinates others)

Organization patterns: Flat (all-to-all) vs. Hierarchical (orchestrator on top) — hierarchical often preferred for UX.

Interaction: Agents communicate via MCP-like protocols (treat other agents as tools).


8. What's Next in AI (Closing Thoughts)

  • Scaling laws & potential plateau: More compute helps, but architecture search (beyond transformers) will be key. Human brain is more efficient (no backprop? forward-only?).
  • Multi-modality: Text → image → audio/video → robotics; cross-modal gains improve overall performance.
  • Harmonizing methods: Combine supervised/unsupervised/self-supervised/RL/prompting/RAG/etc. (like how babies learn).
  • Human-centric vs. non-human-centric research: Learn from brain but optimize beyond biological limits.
  • High velocity of change: Half-life of specific skills is short → focus on breadth + ability to learn fast.

Overall message: Master these engineering techniques (prompting, chaining, RAG, agents, evals) to maximize any base LLM. Fine-tuning sparingly. Build modular, debuggable, evaluable systems. The field moves extremely fast — breadth + strong fundamentals will serve you best.


Further Inspiration & Resources

  1. Stanford’s Artificial Intelligence professional and graduate programs
  2. Stanford CS230 | Autumn 2025
  3. Randazzo, S., et al. (2025). Cyborgs, centaurs and self-automators: The three modes of human-GenAI knowledge work and their implications for skilling and the future of expertise (Harvard Business School Working Paper No. 26-036). 
  4. Gao, L., Ma, X., Lin, J., & Callan, J. (2023). Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1762–1777). Association for Computational Linguistics.  
  5. Stanford CS230 | Autumn 2025 | Lecture 8: Agents, Prompts, and RAG

Tuesday, April 7, 2026

The Company That Makes Modern Computing Possible

 📦 TL;DR — At a Glance

Shin‑Etsu began as a 1920s fertilizer maker but evolved—slowly and deliberately—into the world’s leading supplier of semiconductor‑grade silicon wafers.  

Its early expertise in purification and high‑temperature chemistry paved the way for mastering 11‑nines purity silicon, now essential for chips made by TSMC, Intel, and Samsung.

Today, through SEH, the company controls about one‑third of the global wafer market, making it an “invisible emperor” quietly powering the modern semiconductor industry.



🏭 Origins in Fertilizer and Hydropower

Shin‑Etsu’s story begins far from the cleanrooms of modern chipmaking. Founded in 1926 as Shin‑Etsu Nitrogen Fertilizer Co., the firm drew on the Shin’etsu region’s limestone deposits and hydroelectric power to produce chemical fertilizers. By 1927, operations centered on its Naoetsu plant; in 1940, the company rebranded as Shin-Etsu Chemical Co., Ltd., signaling broader industrial ambitions.

Those early decades in carbides and hydroelectric fertilizer production demanded tight impurity control and high‑temperature electrochemistry—skills that would later become essential in the world of ultrapure materials.


🔧 A Slow, Strategic Shift Into Advanced Materials

Shin‑Etsu’s transformation into a semiconductor powerhouse was gradual and deliberate. As fertilizers declined in strategic importance after World War II, the company diversified into silicones (1953), PVC, and a growing portfolio of electronics materials. By the 1960s, it began investing in silicon wafer research—long before the global chip boom made such materials indispensable.

This steady, long‑horizon approach reflects the company’s hallmark: quiet, methodical mastery rather than dramatic pivots.


🔬 Mastering Eleven‑Nines Purity

Producing semiconductor‑grade silicon requires extraordinary precision. Device‑class wafers demand 11‑nines purity—99.999999999%. Shin‑Etsu refines silicon metal into polycrystalline silicon at this level before growing single‑crystal ingots, the standard pathway for wafers used in advanced processors.

Here, the company’s chemical‑engineering heritage becomes a competitive advantage. Decades of expertise in purification, temperature control, and materials processing—rooted in its “fertilizer company” origins—now underpin some of the world’s most advanced computing hardware.


🌐 The World’s Leading Silicon Wafer Supplier

Through its wafer subsidiary SEH, Shin‑Etsu has become the largest producer of semiconductor silicon wafers globally, with an estimated 30–33% market share. It leads in 300mm wafers and other high‑spec substrates essential for cutting‑edge logic and memory chips, outpacing rivals such as SUMCO and GlobalWafers.

Every advanced chip from TSMC, Intel, Samsung, and others begins on a wafer that companies like Shin‑Etsu quietly perfect.


👑 An “Invisible Emperor” of the Semiconductor Age

Shin‑Etsu’s rise illustrates a broader truth: modern technology rests on deep, often overlooked chemical‑engineering expertise. What began as a fertilizer maker in rural Japan has become a foundational pillar of the global semiconductor supply chain—an “invisible emperor” whose materials quietly enable the world’s computing power.

© Travel for Life Guide. All Rights Reserved.

Analytical Insights on Health, Culture, and Security.