SA
Skip to main content

How to Build Software like an SRE

Coding

  • no in-code fallbacks for configs.
    • just crash it.
  • stringent Remote Procedure Call settings.
    • no retry or long timeouts
  • local testing
  • avoid state managing

Merging

  • use Git
  • don't die for code coverage
  • real-world validation matters. unit test < integration test < prod test
  • make infra changes obvious
  • keep error logs, CPU usage, and request error rates

Deploying

  • use Docker
  • deploy everything all-the-time
    • prevents silent breakage
  • validate at the same time
  • instant deploy
    • it may sound counterintuitive, but it buys us early time to disable a problematic Feature Flag

Operating

  • use Kubernetes
  • use helm
    • never touch k8s directly.
  • avoid operators
    • keep it simple
  • run 3 of everything
    • Like with backups, two is one, and one is none. Additionally, ensure (like, actually verify, in production) that 2 of the 3 "things" can handle the full load by themselves – otherwise, you don't really have the failure tolerance you think you do.
    • do you see Kakao?
  • structured logs