Site reliability

Site Reliability makes sure the PollEverywhere.com production environment is stable for customers. They also make sure Poll Everywhere engineers have the tools they need to provision arbirtrary environments for load, stage, and usability testing.

Responsibilities

  • Stability and security for all environments
  • Developer CLI for deploying code & projects to a production or staging endpoints
  • Production, staging, load testing, and arbitrary environments
  • Real-time server & service monitoring and instrumentation that are usable by development team
  • Manage incident response plan, call rotations, and documentation

Metrics

Metric Requirement
Uptime 99.99%
Commit → CI → Staging → Production < 60 min
Provision new environment < 30 min
Disaster recovery from only DB-snapshot < 6 hours
Incident response time < 15 min

Conventions