
phase-08-operations-maintenance
by Harery
Enterprise 8-Phase Software Development Lifecycle Framework with Quality Gates, Multi-Agent Orchestration, and AI-Assisted Development using Claude Code. Perfect for regulated industries (HIPAA, SOC 2, PCI DSS, SOX, GDPR).
SKILL.md
name: "phase_08_operations_maintenance" description: "Operational support and maintenance for released software. Monitoring and observability, incident management, maintenance and patch management, continuous improvement, security operations, compliance and audit, support and user satisfaction, backup and disaster recovery." phase: 8 phase_name: "Operations & Maintenance" owner: "SRE Lead" secondary_owner: "DevOps Lead" participants: ["SRE", "DevOps", "Tech Lead", "Security Lead", "Compliance Officer", "Support Team", "Product Owner"] entry_criteria: [ "Production deployment successful", "Post-deployment validation passing", "Monitoring and alerting active", "Release documentation published" ] exit_criteria: [ "Ongoing operations established", "Incident management process active", "Maintenance schedule defined", "Continuous improvement process active", "Security operations running", "Compliance monitoring active", "Support process established", "Backup and DR validated" ] estimated_duration: "Ongoing (product lifetime)" dependencies: ["phase_07_deployment_release"] outputs: [ "Monitoring Dashboards", "Alerting Rules", "Incident Response Procedures", "Maintenance Schedule", "Security Operations Plan", "Compliance Reports", "Support Metrics", "Backup and DR Procedures" ] next_phase: "None (ongoing operations)"
Phase 8: Operations & Maintenance
The Long-Term Care for Your Software
Welcome to Phase 8 - the marathon, not the sprint. Unlike the previous phases which have clear beginnings and endings, Phase 8 continues for the lifetime of your software.
This phase is about stability, reliability, and continuous improvement. You're not building new features (usually); you're keeping what you've built running smoothly and making it better over time.
The mindset here: Sleep soundly knowing your systems are monitored, your team is prepared, and your users are supported.
What You'll Achieve
In this ongoing phase, you'll maintain:
- High availability (users can access your software when they need it)
- Fast response to incidents (when things break, you fix them quickly)
- Regular maintenance (patches, updates, upgrades)
- Continuous improvement (always getting better)
- Strong security (vigilance against threats)
- Compliance readiness (always audit-ready)
- Happy users (support that actually helps)
What makes this different: This phase never really ends. As long as your software is in production, Phase 8 continues.
Who's Driving This Phase?
| Role | What They're Responsible For |
|---|---|
| SRE Lead | Your reliability champion - keeping systems running and users happy |
| DevOps Lead | Infrastructure and automation - making operations efficient |
| Tech Lead | Technical support - fixing bugs and implementing improvements |
| Security Lead | Security operations - protecting against threats |
| Compliance Officer | Ongoing compliance - staying audit-ready |
| Support Team | User support - helping users when they need help |
The Operations Lifecycle
Operations isn't static - it's a continuous cycle of monitoring, responding, improving, and repeating.
Monitor → Detect → Respond → Learn → Improve → Monitor
Every incident teaches you something. Every improvement makes you more resilient.
Step 1: Monitoring and Observability (See Everything)
Time needed: Ongoing Led by: SRE Lead
Why Monitoring Matters
You can't fix what you can't see. Monitoring is your eyes and ears in production, telling you what's happening, what's broken, and what needs attention.
What to Monitor
1. Latency (How fast is it?)
- Response time (p50, p95, p99 percentiles)
- Database query time
- API call duration
- Page load time
2. Traffic (How much is it?)
- Requests per second
- Concurrent users
- Data transfer volume
- API call rate
3. Errors (What's broken?)
- HTTP error codes (4xx, 5xx)
- Application error rate
- Failed database queries
- Timeout rate
4. Saturation (How full is it?)
- CPU utilization
- Memory usage
- Disk I/O
- Network bandwidth
Pro tip: Monitor these four signals, and you'll catch 90% of issues before they become incidents.
Setting Up Monitoring
- Application Performance Monitoring (APM) - Track application-level metrics
- Infrastructure Monitoring - Track server, database, network metrics
- Log Aggregation - Collect and search logs from all services
- Distributed Tracing - Follow requests across microservices
- Synthetic Monitoring - Simulate user interactions proactively
- Real User Monitoring (RUM) - Measure actual user experience
- Business Metrics - Track business-critical metrics (signups, purchases, etc.)
Tools for 2026
| Need | Recommended Tools | Why? |
|---|---|---|
| Metrics + Dashboards | Prometheus + Grafana | Open source, industry standard, highly flexible |
| APM | Datadog, New Relic, Dynatrace | Comprehensive application insights |
| Logging | ELK Stack, Loki | Centralized, searchable logs |
| Distributed Tracing | Jaeger, Tempo | Follow requests across services |
| Uptime Monitoring | Pingdom, UptimeRobot | Simple, reliable uptime checks |
What You'll Have
Expected Output:
- Monitoring dashboards (visual representation of metrics)
- Alerting rules (when to notify someone)
- SLO/SLI definitions (what "good" looks like)
- Runbooks (what to do when alerts fire)
- On-call procedures (who responds when)
Who Approves:
- SRE owns monitoring setup
- Tech Lead validates technical monitoring
- Product Owner validates business metrics
Step 2: Incident Management (When Things Break)
Time needed: Ongoing Led by: SRE Lead
Incidents Will Happen
No matter how well you build, test, and deploy, things will break. It's not a question of if, but when. Incident management is how you respond without panicking.
Incident Severity Levels
| Severity | Description | Examples | Response Target |
|---|---|---|---|
| SEV-1 | Critical - System down, data loss, security breach | Website unavailable, database corruption, data breach | 15 minutes |
| SEV-2 | High - Major feature broken, significant impact | Checkout not working, users can't login | 1 hour |
| SEV-3 | Medium - Partial degradation, workaround available | Slow performance, some features affected | 4 hours |
| SEV-4 | Low - Minor issues, cosmetic | Typos, non-critical bugs, documentation errors | 1 business day |
1. Detect (Alert Fires)
- Automated monitoring detects an issue
- Alert is sent to on-call engineer
- Severity is assessed
2. Acknowledge (Someone Takes Ownership)
- On-call engineer acknowledges the alert
- Incident is declared (if severity warrants)
- Incident commander is identified
3. Respond (Fix the Issue)
- Gather the incident response team
- Diagnose the root cause
- Implement a fix or workaround
- Validate the fix
4. Resolve (Back to Normal)
- System is restored to normal operation
- Monitoring confirms no ongoing issues
- Incident is marked as resolved
5. Learn (Post-Incident Review)
- Conduct a blameless post-mortem
- Document what happened and why
- Identify action items to prevent recurrence **Share learnings with the team
Pro tip: The goal isn't to never have incidents (that's impossible). The goal is to learn from each incident so you don't have the same one twice.
On-Call Best Practices
Do This:
- Rotate on-call responsibilities (avoid burnout)
- Provide on-call compensation (it's hard work)
- Document runbooks (don't rely on heroics)
- Use escalation policies (if on-call doesn't respond, escalate)
- Track on-call metrics (are we getting better?)
Avoid This:
- Single person on-call (what if they're sick?)
- No backup (what if primary is unreachable?)
- Hero culture (relying on individuals rather than processes)
- Ignoring on-call feedback (they know what's broken)
What You'll Have
Expected Output:
- Incident management procedures
- On-call schedule (who's covering when)
- Incident response runbooks
- Incident metrics dashboard (MTTA, MTTR, MTBF)
- Post-incident review reports
Who Approves:
- SRE owns incident management
- Tech Lead validates technical procedures
- Security Lead validates security incident procedures
Step 3: Maintenance and Patch Management (Stay Current)
Time needed: Ongoing (scheduled) Led by: DevOps Lead
Why Maintenance Matters
Software doesn't stay perfect. Dependencies get vulnerabilities, operating systems need updates, databases need optimization. Maintenance is the ongoing care that keeps your software healthy.
Maintenance Schedule
| Activity | Frequency | Duration | What Gets Done |
|---|---|---|---|
| Security Patches | As needed (within 48 hours of critical CVE) | Varies | Apply critical security fixes immediately |
| Dependency Updates | Monthly | 2-4 hours | Update libraries, frameworks, packages |
| OS Patches | Monthly | 2-4 hours | Apply operating system security patches |
| Database Maintenance | Weekly | 1-2 hours | Index rebuilds, statistics updates, vacuuming |
| Infrastructure Updates | Quarterly | 4-8 hours | Upgrade infrastructure components |
| Major Upgrades | Annually | Varies | Upgrade to new major versions (e.g., Python 3.12 → 3.13) |
Do This:
- Schedule maintenance during low-traffic windows
- Communicate maintenance windows to users in advance
- Test maintenance procedures in staging first
- Have rollback plans for each maintenance activity
- Document what was changed and why
- Monitor after maintenance (catch issues early)
Avoid This:
- Skipping maintenance (technical debt accumulates)
- Maintenance without testing (surprises in production are bad)
- Maintenance without communication (users hate unexpected downtime)
- "While we're in there" changes (scope creep causes issues)
Pro tip: Automate what you can. Automated maintenance is more reliable and less error-prone than manual maintenance.
What You'll Have
Expected Output:
- Maintenance schedule (what happens when)
- Patch management procedures (how to apply updates)
- Maintenance logs (what was done)
- Upgrade plans (for major version changes)
- Change records (audit trail of changes)
Who Approves:
- DevOps executes maintenance
- Tech Lead validates technical changes
- Security Lead validates security patches
- Product Owner approves maintenance windows
Step 4: Continuous Improvement (Always Getting Better)
Time needed: Ongoing Led by: SRE Lead + Tech Lead
The Improvement Engine
Operations isn't just about keeping the lights on - it's about making things better. Every incident, every user complaint, every metric is an opportunity to improve.
Sources of Improvement
- Incidents - What broke? Why? How do we prevent it?
- User feedback - What are users complaining about?
- Metrics - What's slow? What's error-prone?
- Team feedback - What's painful for the team? What's repetitive?
- Industry trends - What are others doing better?
The Improvement Process
Identify → Prioritize → Implement → Validate → Standardize → Repeat
Improvements compound. Small gains add up to big results over time.
What is a blameless post-mortem? An analysis of an incident that focuses on what happened and why, not who to blame.
Why blameless?
- Blame makes people hide mistakes
- Mistakes are opportunities to learn
- Systems are usually at fault, not people
- Psychological safety encourages honesty
Post-mortem template:
## Incident Summary
- What happened?
- When did it happen?
- How long did it last?
- Who was affected?
## Root Cause Analysis
- What was the triggering event?
- What factors contributed?
- What was the underlying cause? (5 whys technique)
## Timeline
- [Time] - Event occurred
- [Time] - Alert fired
- [Time] - Response initiated
- [Time] - Issue resolved
## Action Items
- [ ] Prevent recurrence: (what changes to prevent this?)
- [ ] Improve detection: (how to catch this faster?)
- [ ] Improve documentation: (what docs need updating?)
- [ ] Improve training: (what training would help?)
## Lessons Learned
- What did we learn?
- What should we share with other teams?
Pro tip: The best post-mortems are the ones you learn from. If you're not learning, you're not improving.
What You'll Have
Expected Output:
- Improvement backlog (ideas for improvements)
- Performance optimization results
- Automation enhancements
- Post-incident review reports
- Improvement metrics (are we getting better?)
Who Approves:
- SRE drives continuous improvement
- Tech Lead validates technical improvements
- Product Owner prioritizes improvements
Step 5: Security Operations (Stay Vigilant)
Time needed: Ongoing Led by: Security Lead
Security Is a Process, Not a Project
Security in Phase 8 is about vigilance. Threats are constantly evolving, and your security operations need to evolve with them.
Ongoing Security Activities
- Monitor security alerts - Watch for suspicious activity
- Scan for vulnerabilities - Regular SAST/DAST/SCA scans
- Manage access - Review and revoke access as needed
- Update security tools - Keep security software current
- Conduct security assessments - Regular penetration tests
- Security awareness training - Keep the team security-conscious
- Incident response - Respond to security incidents
- Threat intelligence - Stay informed about new threats
Automated Security:
- Continuous vulnerability scanning (not just at deployment time)
- Automated security policy enforcement (e.g., prevent secrets in code)
- Automated incident response (contain threats automatically)
Cloud Security:
- Cloud Security Posture Management (CSPM) - Monitor cloud configuration
- Cloud Workload Protection (CWP) - Protect cloud workloads
- Zero Trust Architecture - Verify every request, never trust implicitly
DevSecOps:
- Security as code (define security policies in code)
- Shift left (find vulnerabilities early)
- Security testing in CI/CD (automated security gates)
Compliance:
- Continuous compliance monitoring (always audit-ready)
- Automated evidence collection (for audits)
- Policy as code (enforce compliance automatically)
Pro tip: Security operations is a team sport. Everyone has a role to play, not just the security team.
What You'll Have
Expected Output:
- Security monitoring dashboard
- Vulnerability assessments (regular scans)
- Security patch reports
- Access review reports
- Security metrics (are we secure?)
- Incident response reports (when security incidents happen)
Who Approves:
- Security Lead owns security operations
- CISO approves overall security posture
- Compliance Officer validates compliance
Step 6: Compliance and Audit (Always Ready)
Time needed: Ongoing, with periodic audits Led by: Compliance Officer
Compliance Is Ongoing, Not Occasional
Compliance isn't something you do once a year for an audit. It's an ongoing state of being audit-ready, all the time.
Ongoing Compliance Activities
- Maintain compliance documentation - Keep it current
- Conduct internal audits - Self-assess regularly
- Prepare for external audits - Don't wait for the audit notice
- Manage regulatory changes - Stay aware of new requirements
- Track compliance exceptions - Document and address gaps
- Conduct compliance training - Keep the team aware
- Generate compliance reports - Evidence of compliance
- Manage audit findings - Address audit recommendations
The Audit Trap: Many organizations scramble when an audit is announced. They're not prepared because they treat compliance as an event, not a state.
The Audit-Ready Approach:
- Maintain continuous compliance (every day, not just at audit time)
- Automate evidence collection (generate reports on demand)
- Document everything (if it's not documented, it didn't happen)
- Conduct regular self-audits (find issues before auditors do)
- Treat audits as validations, not investigations (you're already ready)
Continuous Compliance Monitoring:
- Automated policy checks (verify controls are in place)
- Automated evidence collection (gather proof of compliance)
- Compliance dashboards (visualize compliance status)
- Alert on compliance drift (know when you're falling out of compliance)
Pro tip: Auditors love good documentation. If you can show them what they need quickly and easily, audits go smoothly.
What You'll Have
Expected Output:
- Compliance documentation (current and complete)
- Internal audit reports (regular self-assessments)
- External audit coordination (smooth audit experience)
- Compliance status reports (are we compliant?)
- Remediation plans (address gaps)
- Audit findings closure (show you've addressed issues)
Who Approves:
- Compliance Officer owns compliance
- Legal reviews compliance position
- CISO approves security compliance
- Executive Sponsor approves compliance approach
Step 7: Support and User Satisfaction (Happy Users)
Time needed: Ongoing Led by: Support Lead + Product Owner
Support Is Your Front Line
Your support team is often the only human contact users have with your organization. Great support turns frustrated users into loyal advocates.
Support Tiers and SLAs
| Tier | Description | Examples | Response Time | Resolution Time |
|---|---|---|---|---|
| P1 | Critical - System down, data loss | Website unavailable, can't access data | 15 minutes | 4 hours |
| P2 | High - Major feature broken | Can't checkout, can't save work | 1 hour | 24 hours |
| P3 | Medium - Workaround available | Feature slow, minor bug | 4 hours | 1 week |
| P4 | Low - Question, enhancement | How do I?, feature request | 1 business day | 2 weeks |
Why a Knowledge Base Matters:
- Reduces support ticket volume (users find answers themselves)
- Improves consistency (everyone gives the same answers)
- Speeds onboarding (new support team members get up to speed faster)
- Captures tribal knowledge (doesn't leave when people leave)
What to Include:
- Frequently asked questions (FAQs)
- How-to guides (step-by-step instructions)
- Troubleshooting guides (what to do when things go wrong)
- Video tutorials (visual learners prefer video)
- Glossary (explain jargon)
- Release notes (what's new in each version)
Keep It Current:
- Update with every release (new features, new issues)
- Review quarterly (remove outdated content)
- Track usage (what's popular? what's ignored?)
- Get feedback (is it helpful?)
Pro tip: Every support ticket should end with two outcomes: the user's problem is solved, and the knowledge base is updated (if a new issue was discovered).
What You'll Have
Expected Output:
- Support tickets and resolutions (track every issue)
- Support metrics dashboard (are we meeting SLAs?)
- User satisfaction reports (are users happy?)
- Knowledge base articles (self-service support)
- Training materials (help users help themselves)
- User feedback analysis (what are users telling us?)
Who Approves:
- Support Lead owns support operations
- Product Owner validates user satisfaction
Step 8: Backup and Disaster Recovery (Your Last Resort)
Time needed: Ongoing, with periodic testing Led by: SRE Lead + DevOps
Hope for the Best, Plan for the Worst
Backups and disaster recovery (DR) are your insurance policies. You hope you never need them, but if you do, you're really glad you have them.
Backup Strategy
| Data Type | Frequency | Retention | Location |
|---|---|---|---|
| Database | Hourly | 90 days | Off-site (different region) |
| Application | Daily | 30 days | Off-site |
| Logs | Daily | 1 year | Off-site |
| Configs | Per change | 1 year | Version control (Git) |
| User Data | Real-time | 7 years | Off-site (compliance requirement) |
What is Disaster Recovery? DR is about recovering from catastrophic events: data center failure, natural disasters, region-wide outages.
DR Metrics:
- RPO (Recovery Point Objective): How much data can you lose? (Measured in time)
- RTO (Recovery Time Objective): How long can you be down? (Measured in time)
DR Approaches:
- Hot Standby: Full duplicate system running, ready to take over (expensive, fast recovery)
- Warm Standby: System ready but not running (moderate cost, moderate recovery)
- Cold Standby: System configured but not provisioned (cheap, slow recovery)
DR Testing:
- Quarterly DR drills (simulate a disaster, practice recovery)
- Document lessons learned (what went wrong in the drill?)
- Update DR plans (continuous improvement)
- Rotate DR team members (everyone should know what to do)
Pro tip: A DR plan you've never tested is not a DR plan - it's a wish. Test regularly, or prepare to be surprised when it really matters.
What You'll Have
Expected Output:
- Backup schedules and procedures
- Backup validation reports (are backups working?)
- Restore test results (can we actually restore?)
- DR drill reports (did DR work in the drill?)
- DR plan updates (continuous improvement)
- Backup compliance reports (are we meeting requirements?)
Who Approves:
- SRE owns backup and DR
- Security Lead validates data protection
- Compliance Officer validates compliance requirements
Quality Gates: Ongoing Operational Excellence
Operational excellence isn't a destination - it's a continuous state. Regularly confirm:
- Monitoring and alerting active - We can see what's happening
- Incident management process active - We know how to respond
- Maintenance schedule defined and followed - We're keeping things current
- Continuous improvement process active - We're always getting better
- Security operations running - We're staying secure
- Compliance monitoring active - We're always audit-ready
- Support process established - Users can get help
- Backup and DR validated - We can recover from disasters
If any gate fails: Prioritize fixing it. Operational excellence prevents small issues from becoming big incidents.
Operational Metrics: Are We Getting Better?
| Metric | What It Tells You | Target | Current |
|---|---|---|---|
| Availability | Is the system up? | >99.9% | TBD |
| MTTA (Mean Time to Acknowledge) | How fast do we respond? | <15 minutes | TBD |
| MTTR (Mean Time to Resolve) | How fast do we fix? | <4 hours | TBD |
| MTBF (Mean Time Between Failures) | How long between incidents? | >720 hours | TBD |
| Support Satisfaction | Are users happy with support? | >4.5/5 | TBD |
| Backup Success Rate | Are backups working? | 100% | TBD |
| Patch Compliance | Are we up to date? | >95% | TBD |
Remember: Metrics are for learning, not blame. Use trends to identify areas for improvement.
Words of Encouragement
Phase 8 can feel like a lot. You're maintaining vigilance 24/7/365, responding to incidents, keeping systems current, and always improving. It's exhausting, but it's also essential.
Remember:
- You're not alone (you have a team, a process, and tools)
- You're getting better (every incident teaches you something)
- You're providing value (users rely on you)
- You're building resilience (each challenge makes you stronger)
Operational excellence isn't about being perfect. It's about being prepared, responding well, learning constantly, and never stopping.
You're doing important work. The world runs on software, and you're keeping that software running. That's worth celebrating.
The Journey Continues
Phase 8 has no end date - it continues as long as your software is in production. But that doesn't mean nothing changes. You'll:
- Release new versions (go back to Phase 5 for new features)
- Respond to incidents (always learning)
- Improve continuously (always getting better)
- Adapt to change (new requirements, new threats, new opportunities)
The lifecycle is a circle, not a line. Every release brings you back through the phases, each time with more experience, more wisdom, and more resilience.
Thank you for being part of this journey. Your users thank you (even if they don't say it often enough), your team thanks you, and future you will thank present you for doing the hard work of operations and maintenance.
Previous Phase: Phase 7: Deployment & Release Next Phase: None (ongoing operations, but new releases will return to Phase 5)
Version 1.0.0 | OCTALUME Enterprise Lifecycle Framework
Score
Total Score
Based on repository quality metrics
SKILL.mdファイルが含まれている
ライセンスが設定されている
100文字以上の説明がある
GitHub Stars 100以上
1ヶ月以内に更新
10回以上フォークされている
オープンIssueが50未満
プログラミング言語が設定されている
1つ以上のタグが設定されている
Reviews
Reviews coming soon
