Back to Home

AI Guide for Senior Software Engineers

Large Language Models (LLMs)

Understanding the engineering and science behind models like GPT, Claude, and Gemini that power modern AI applications.

What Makes an LLM "Large"?

Large Language Models are transformer-based neural networks with billions (or trillions) of parameters, trained on vast amounts of text data. Their size and training scale enable emergent capabilities not seen in smaller models.

Scale Milestones (recent)

Notable recent releases and capabilities (2024–2025). Model details change rapidly — refer to provider docs for exact specs and context-window limits.

  • GPT-2 (2019): 1.5B parameters — early large LM milestone
  • GPT-3 (2020): 175B parameters — in-context learning emerged
  • PaLM (2022): 540B parameters — strong reasoning benchmarks
  • GPT-4 (2023): multimodal capabilities in service
  • GPT-4o (2024): omni-modal text/vision/audio with very large context support (provider docs cite 128K contexts for many o-models)
  • Claude Sonnet family (2024–2025): extended contexts (100K–200K+ depending on variant) and strong coding/vision performance; Anthropic continued releases in 2025
  • GPT-5 (OpenAI, 2025): flagship release focused on deeper reasoning, agentic/tool-call improvements, and workplace integrations (see OpenAI release notes)
  • Sora 2 (OpenAI, 2025): visual/audio synthesis family for photorealistic, synchronized outputs
  • Gemini (Google DeepMind): native multimodal foundation models with product rollouts and increasing context windows through 2024–2025
  • Anthropic (2025): Sonnet/Claude 4.x with larger context variants announced for specific releases (1M+ tokens in some announcements)

Training LLMs

Pre-training

Models learn language by predicting the next token on massive text corpora (Common Crawl, books, code, etc.). This requires enormous compute: thousands of GPUs/TPUs running for weeks or months.

  • Data scale: Trillions of tokens (TB to PB of text)
  • Compute: Thousands of A100/H100 GPUs
  • Cost: Millions to tens of millions of dollars
  • Time: Weeks to months of continuous training

Fine-tuning

After pre-training, models are adapted for specific tasks or behaviors:

  • Instruction tuning: Teach model to follow instructions
  • RLHF: Reinforcement Learning from Human Feedback for alignment
  • Task-specific: Adapt for domain-specific applications

Emergent Capabilities

As models scale, they develop abilities not explicitly programmed or trained for. These emerge from the combination of scale, architecture, and training data.

In-Context Learning

Learn new tasks from examples in the prompt, without parameter updates

Chain-of-Thought Reasoning

Break down complex problems into steps (enhanced in o1/o3 models)

Multimodal Understanding

Process and generate text, images, audio, and video seamlessly

Extended Context Windows

Handle up to 2M tokens (Gemini) for entire codebases or books

Engineering LLM Systems

Inference Optimization

  • Quantization: Reduce precision (FP16, INT8) to save memory and speed up inference
  • KV caching: Cache key-value pairs to avoid recomputation
  • Flash Attention: Optimized attention implementation
  • Model sharding: Split model across multiple GPUs (tensor/pipeline parallelism)

Prompt Engineering

The art and science of crafting prompts to elicit desired behaviors:

  • Zero-shot, few-shot, and chain-of-thought prompting
  • System messages and role-playing
  • Temperature and sampling strategies
  • Context window management

LLM Architectures

GPT-4o / GPT-4 Turbo

Native multimodal (text, vision, audio). 128K context. 2x faster than GPT-4.

Claude 3.5 Sonnet

200K context. Best-in-class coding. 64% on agentic coding benchmarks.

Gemini 2.0

Up to 2M token context. Native multimodal from ground up. Best multilingual support.

OpenAI o1/o3 Series

Reasoning-focused models with extended thinking time for complex problems.

Challenges & Limitations

  • Hallucinations: Models can confidently generate false information
  • Context limits (improving): Now 128K-2M tokens, but still finite
  • Computational cost: Expensive to train and run, especially long-context inference
  • Biases: Reflect and amplify biases in training data
  • Lack of grounding: No direct world knowledge or real-time information without RAG
  • Reasoning consistency: Can still struggle with logical consistency across long chains

Key Takeaways

  • LLMs are transformer models scaled to billions of parameters
  • Pre-training on massive data enables emergent capabilities
  • Fine-tuning and RLHF align models with human preferences
  • Engineering systems around LLMs requires optimization and careful prompting
  • LLMs have significant limitations and biases to be aware of