Back to list
Harery

phase-08-operations-maintenance

by Harery

Enterprise 8-Phase Software Development Lifecycle Framework with Quality Gates, Multi-Agent Orchestration, and AI-Assisted Development using Claude Code. Perfect for regulated industries (HIPAA, SOC 2, PCI DSS, SOX, GDPR).

1🍴 0📅 Jan 22, 2026

SKILL.md


name: "phase_08_operations_maintenance" description: "Operational support and maintenance for released software. Monitoring and observability, incident management, maintenance and patch management, continuous improvement, security operations, compliance and audit, support and user satisfaction, backup and disaster recovery." phase: 8 phase_name: "Operations & Maintenance" owner: "SRE Lead" secondary_owner: "DevOps Lead" participants: ["SRE", "DevOps", "Tech Lead", "Security Lead", "Compliance Officer", "Support Team", "Product Owner"] entry_criteria: [ "Production deployment successful", "Post-deployment validation passing", "Monitoring and alerting active", "Release documentation published" ] exit_criteria: [ "Ongoing operations established", "Incident management process active", "Maintenance schedule defined", "Continuous improvement process active", "Security operations running", "Compliance monitoring active", "Support process established", "Backup and DR validated" ] estimated_duration: "Ongoing (product lifetime)" dependencies: ["phase_07_deployment_release"] outputs: [ "Monitoring Dashboards", "Alerting Rules", "Incident Response Procedures", "Maintenance Schedule", "Security Operations Plan", "Compliance Reports", "Support Metrics", "Backup and DR Procedures" ] next_phase: "None (ongoing operations)"

Phase 8: Operations & Maintenance

The Long-Term Care for Your Software

Welcome to Phase 8 - the marathon, not the sprint. Unlike the previous phases which have clear beginnings and endings, Phase 8 continues for the lifetime of your software.

This phase is about stability, reliability, and continuous improvement. You're not building new features (usually); you're keeping what you've built running smoothly and making it better over time.

The mindset here: Sleep soundly knowing your systems are monitored, your team is prepared, and your users are supported.


What You'll Achieve

In this ongoing phase, you'll maintain:

  • High availability (users can access your software when they need it)
  • Fast response to incidents (when things break, you fix them quickly)
  • Regular maintenance (patches, updates, upgrades)
  • Continuous improvement (always getting better)
  • Strong security (vigilance against threats)
  • Compliance readiness (always audit-ready)
  • Happy users (support that actually helps)

What makes this different: This phase never really ends. As long as your software is in production, Phase 8 continues.


Who's Driving This Phase?

RoleWhat They're Responsible For
SRE LeadYour reliability champion - keeping systems running and users happy
DevOps LeadInfrastructure and automation - making operations efficient
Tech LeadTechnical support - fixing bugs and implementing improvements
Security LeadSecurity operations - protecting against threats
Compliance OfficerOngoing compliance - staying audit-ready
Support TeamUser support - helping users when they need help

The Operations Lifecycle

Operations isn't static - it's a continuous cycle of monitoring, responding, improving, and repeating.

Monitor → Detect → Respond → Learn → Improve → Monitor

Every incident teaches you something. Every improvement makes you more resilient.

Step 1: Monitoring and Observability (See Everything)

Time needed: Ongoing Led by: SRE Lead

Why Monitoring Matters

You can't fix what you can't see. Monitoring is your eyes and ears in production, telling you what's happening, what's broken, and what needs attention.

What to Monitor

1. Latency (How fast is it?)

  • Response time (p50, p95, p99 percentiles)
  • Database query time
  • API call duration
  • Page load time

2. Traffic (How much is it?)

  • Requests per second
  • Concurrent users
  • Data transfer volume
  • API call rate

3. Errors (What's broken?)

  • HTTP error codes (4xx, 5xx)
  • Application error rate
  • Failed database queries
  • Timeout rate

4. Saturation (How full is it?)

  • CPU utilization
  • Memory usage
  • Disk I/O
  • Network bandwidth

Pro tip: Monitor these four signals, and you'll catch 90% of issues before they become incidents.

Setting Up Monitoring

  1. Application Performance Monitoring (APM) - Track application-level metrics
  2. Infrastructure Monitoring - Track server, database, network metrics
  3. Log Aggregation - Collect and search logs from all services
  4. Distributed Tracing - Follow requests across microservices
  5. Synthetic Monitoring - Simulate user interactions proactively
  6. Real User Monitoring (RUM) - Measure actual user experience
  7. Business Metrics - Track business-critical metrics (signups, purchases, etc.)

Tools for 2026

NeedRecommended ToolsWhy?
Metrics + DashboardsPrometheus + GrafanaOpen source, industry standard, highly flexible
APMDatadog, New Relic, DynatraceComprehensive application insights
LoggingELK Stack, LokiCentralized, searchable logs
Distributed TracingJaeger, TempoFollow requests across services
Uptime MonitoringPingdom, UptimeRobotSimple, reliable uptime checks

What You'll Have

Expected Output:

  • Monitoring dashboards (visual representation of metrics)
  • Alerting rules (when to notify someone)
  • SLO/SLI definitions (what "good" looks like)
  • Runbooks (what to do when alerts fire)
  • On-call procedures (who responds when)

Who Approves:

  • SRE owns monitoring setup
  • Tech Lead validates technical monitoring
  • Product Owner validates business metrics

Step 2: Incident Management (When Things Break)

Time needed: Ongoing Led by: SRE Lead

Incidents Will Happen

No matter how well you build, test, and deploy, things will break. It's not a question of if, but when. Incident management is how you respond without panicking.

Incident Severity Levels

SeverityDescriptionExamplesResponse Target
SEV-1Critical - System down, data loss, security breachWebsite unavailable, database corruption, data breach15 minutes
SEV-2High - Major feature broken, significant impactCheckout not working, users can't login1 hour
SEV-3Medium - Partial degradation, workaround availableSlow performance, some features affected4 hours
SEV-4Low - Minor issues, cosmeticTypos, non-critical bugs, documentation errors1 business day

1. Detect (Alert Fires)

  • Automated monitoring detects an issue
  • Alert is sent to on-call engineer
  • Severity is assessed

2. Acknowledge (Someone Takes Ownership)

  • On-call engineer acknowledges the alert
  • Incident is declared (if severity warrants)
  • Incident commander is identified

3. Respond (Fix the Issue)

  • Gather the incident response team
  • Diagnose the root cause
  • Implement a fix or workaround
  • Validate the fix

4. Resolve (Back to Normal)

  • System is restored to normal operation
  • Monitoring confirms no ongoing issues
  • Incident is marked as resolved

5. Learn (Post-Incident Review)

  • Conduct a blameless post-mortem
  • Document what happened and why
  • Identify action items to prevent recurrence **Share learnings with the team

Pro tip: The goal isn't to never have incidents (that's impossible). The goal is to learn from each incident so you don't have the same one twice.

On-Call Best Practices

Do This:

  • Rotate on-call responsibilities (avoid burnout)
  • Provide on-call compensation (it's hard work)
  • Document runbooks (don't rely on heroics)
  • Use escalation policies (if on-call doesn't respond, escalate)
  • Track on-call metrics (are we getting better?)

Avoid This:

  • Single person on-call (what if they're sick?)
  • No backup (what if primary is unreachable?)
  • Hero culture (relying on individuals rather than processes)
  • Ignoring on-call feedback (they know what's broken)

What You'll Have

Expected Output:

  • Incident management procedures
  • On-call schedule (who's covering when)
  • Incident response runbooks
  • Incident metrics dashboard (MTTA, MTTR, MTBF)
  • Post-incident review reports

Who Approves:

  • SRE owns incident management
  • Tech Lead validates technical procedures
  • Security Lead validates security incident procedures

Step 3: Maintenance and Patch Management (Stay Current)

Time needed: Ongoing (scheduled) Led by: DevOps Lead

Why Maintenance Matters

Software doesn't stay perfect. Dependencies get vulnerabilities, operating systems need updates, databases need optimization. Maintenance is the ongoing care that keeps your software healthy.

Maintenance Schedule

ActivityFrequencyDurationWhat Gets Done
Security PatchesAs needed (within 48 hours of critical CVE)VariesApply critical security fixes immediately
Dependency UpdatesMonthly2-4 hoursUpdate libraries, frameworks, packages
OS PatchesMonthly2-4 hoursApply operating system security patches
Database MaintenanceWeekly1-2 hoursIndex rebuilds, statistics updates, vacuuming
Infrastructure UpdatesQuarterly4-8 hoursUpgrade infrastructure components
Major UpgradesAnnuallyVariesUpgrade to new major versions (e.g., Python 3.12 → 3.13)

Do This:

  • Schedule maintenance during low-traffic windows
  • Communicate maintenance windows to users in advance
  • Test maintenance procedures in staging first
  • Have rollback plans for each maintenance activity
  • Document what was changed and why
  • Monitor after maintenance (catch issues early)

Avoid This:

  • Skipping maintenance (technical debt accumulates)
  • Maintenance without testing (surprises in production are bad)
  • Maintenance without communication (users hate unexpected downtime)
  • "While we're in there" changes (scope creep causes issues)

Pro tip: Automate what you can. Automated maintenance is more reliable and less error-prone than manual maintenance.

What You'll Have

Expected Output:

  • Maintenance schedule (what happens when)
  • Patch management procedures (how to apply updates)
  • Maintenance logs (what was done)
  • Upgrade plans (for major version changes)
  • Change records (audit trail of changes)

Who Approves:

  • DevOps executes maintenance
  • Tech Lead validates technical changes
  • Security Lead validates security patches
  • Product Owner approves maintenance windows

Step 4: Continuous Improvement (Always Getting Better)

Time needed: Ongoing Led by: SRE Lead + Tech Lead

The Improvement Engine

Operations isn't just about keeping the lights on - it's about making things better. Every incident, every user complaint, every metric is an opportunity to improve.

Sources of Improvement

  1. Incidents - What broke? Why? How do we prevent it?
  2. User feedback - What are users complaining about?
  3. Metrics - What's slow? What's error-prone?
  4. Team feedback - What's painful for the team? What's repetitive?
  5. Industry trends - What are others doing better?

The Improvement Process

Identify → Prioritize → Implement → Validate → Standardize → Repeat

Improvements compound. Small gains add up to big results over time.

What is a blameless post-mortem? An analysis of an incident that focuses on what happened and why, not who to blame.

Why blameless?

  • Blame makes people hide mistakes
  • Mistakes are opportunities to learn
  • Systems are usually at fault, not people
  • Psychological safety encourages honesty

Post-mortem template:

## Incident Summary
- What happened?
- When did it happen?
- How long did it last?
- Who was affected?

## Root Cause Analysis
- What was the triggering event?
- What factors contributed?
- What was the underlying cause? (5 whys technique)

## Timeline
- [Time] - Event occurred
- [Time] - Alert fired
- [Time] - Response initiated
- [Time] - Issue resolved

## Action Items
- [ ] Prevent recurrence: (what changes to prevent this?)
- [ ] Improve detection: (how to catch this faster?)
- [ ] Improve documentation: (what docs need updating?)
- [ ] Improve training: (what training would help?)

## Lessons Learned
- What did we learn?
- What should we share with other teams?

Pro tip: The best post-mortems are the ones you learn from. If you're not learning, you're not improving.

What You'll Have

Expected Output:

  • Improvement backlog (ideas for improvements)
  • Performance optimization results
  • Automation enhancements
  • Post-incident review reports
  • Improvement metrics (are we getting better?)

Who Approves:

  • SRE drives continuous improvement
  • Tech Lead validates technical improvements
  • Product Owner prioritizes improvements

Step 5: Security Operations (Stay Vigilant)

Time needed: Ongoing Led by: Security Lead

Security Is a Process, Not a Project

Security in Phase 8 is about vigilance. Threats are constantly evolving, and your security operations need to evolve with them.

Ongoing Security Activities

  1. Monitor security alerts - Watch for suspicious activity
  2. Scan for vulnerabilities - Regular SAST/DAST/SCA scans
  3. Manage access - Review and revoke access as needed
  4. Update security tools - Keep security software current
  5. Conduct security assessments - Regular penetration tests
  6. Security awareness training - Keep the team security-conscious
  7. Incident response - Respond to security incidents
  8. Threat intelligence - Stay informed about new threats

Automated Security:

  • Continuous vulnerability scanning (not just at deployment time)
  • Automated security policy enforcement (e.g., prevent secrets in code)
  • Automated incident response (contain threats automatically)

Cloud Security:

  • Cloud Security Posture Management (CSPM) - Monitor cloud configuration
  • Cloud Workload Protection (CWP) - Protect cloud workloads
  • Zero Trust Architecture - Verify every request, never trust implicitly

DevSecOps:

  • Security as code (define security policies in code)
  • Shift left (find vulnerabilities early)
  • Security testing in CI/CD (automated security gates)

Compliance:

  • Continuous compliance monitoring (always audit-ready)
  • Automated evidence collection (for audits)
  • Policy as code (enforce compliance automatically)

Pro tip: Security operations is a team sport. Everyone has a role to play, not just the security team.

What You'll Have

Expected Output:

  • Security monitoring dashboard
  • Vulnerability assessments (regular scans)
  • Security patch reports
  • Access review reports
  • Security metrics (are we secure?)
  • Incident response reports (when security incidents happen)

Who Approves:

  • Security Lead owns security operations
  • CISO approves overall security posture
  • Compliance Officer validates compliance

Step 6: Compliance and Audit (Always Ready)

Time needed: Ongoing, with periodic audits Led by: Compliance Officer

Compliance Is Ongoing, Not Occasional

Compliance isn't something you do once a year for an audit. It's an ongoing state of being audit-ready, all the time.

Ongoing Compliance Activities

  1. Maintain compliance documentation - Keep it current
  2. Conduct internal audits - Self-assess regularly
  3. Prepare for external audits - Don't wait for the audit notice
  4. Manage regulatory changes - Stay aware of new requirements
  5. Track compliance exceptions - Document and address gaps
  6. Conduct compliance training - Keep the team aware
  7. Generate compliance reports - Evidence of compliance
  8. Manage audit findings - Address audit recommendations

The Audit Trap: Many organizations scramble when an audit is announced. They're not prepared because they treat compliance as an event, not a state.

The Audit-Ready Approach:

  • Maintain continuous compliance (every day, not just at audit time)
  • Automate evidence collection (generate reports on demand)
  • Document everything (if it's not documented, it didn't happen)
  • Conduct regular self-audits (find issues before auditors do)
  • Treat audits as validations, not investigations (you're already ready)

Continuous Compliance Monitoring:

  • Automated policy checks (verify controls are in place)
  • Automated evidence collection (gather proof of compliance)
  • Compliance dashboards (visualize compliance status)
  • Alert on compliance drift (know when you're falling out of compliance)

Pro tip: Auditors love good documentation. If you can show them what they need quickly and easily, audits go smoothly.

What You'll Have

Expected Output:

  • Compliance documentation (current and complete)
  • Internal audit reports (regular self-assessments)
  • External audit coordination (smooth audit experience)
  • Compliance status reports (are we compliant?)
  • Remediation plans (address gaps)
  • Audit findings closure (show you've addressed issues)

Who Approves:

  • Compliance Officer owns compliance
  • Legal reviews compliance position
  • CISO approves security compliance
  • Executive Sponsor approves compliance approach

Step 7: Support and User Satisfaction (Happy Users)

Time needed: Ongoing Led by: Support Lead + Product Owner

Support Is Your Front Line

Your support team is often the only human contact users have with your organization. Great support turns frustrated users into loyal advocates.

Support Tiers and SLAs

TierDescriptionExamplesResponse TimeResolution Time
P1Critical - System down, data lossWebsite unavailable, can't access data15 minutes4 hours
P2High - Major feature brokenCan't checkout, can't save work1 hour24 hours
P3Medium - Workaround availableFeature slow, minor bug4 hours1 week
P4Low - Question, enhancementHow do I?, feature request1 business day2 weeks

Why a Knowledge Base Matters:

  • Reduces support ticket volume (users find answers themselves)
  • Improves consistency (everyone gives the same answers)
  • Speeds onboarding (new support team members get up to speed faster)
  • Captures tribal knowledge (doesn't leave when people leave)

What to Include:

  • Frequently asked questions (FAQs)
  • How-to guides (step-by-step instructions)
  • Troubleshooting guides (what to do when things go wrong)
  • Video tutorials (visual learners prefer video)
  • Glossary (explain jargon)
  • Release notes (what's new in each version)

Keep It Current:

  • Update with every release (new features, new issues)
  • Review quarterly (remove outdated content)
  • Track usage (what's popular? what's ignored?)
  • Get feedback (is it helpful?)

Pro tip: Every support ticket should end with two outcomes: the user's problem is solved, and the knowledge base is updated (if a new issue was discovered).

What You'll Have

Expected Output:

  • Support tickets and resolutions (track every issue)
  • Support metrics dashboard (are we meeting SLAs?)
  • User satisfaction reports (are users happy?)
  • Knowledge base articles (self-service support)
  • Training materials (help users help themselves)
  • User feedback analysis (what are users telling us?)

Who Approves:

  • Support Lead owns support operations
  • Product Owner validates user satisfaction

Step 8: Backup and Disaster Recovery (Your Last Resort)

Time needed: Ongoing, with periodic testing Led by: SRE Lead + DevOps

Hope for the Best, Plan for the Worst

Backups and disaster recovery (DR) are your insurance policies. You hope you never need them, but if you do, you're really glad you have them.

Backup Strategy

Data TypeFrequencyRetentionLocation
DatabaseHourly90 daysOff-site (different region)
ApplicationDaily30 daysOff-site
LogsDaily1 yearOff-site
ConfigsPer change1 yearVersion control (Git)
User DataReal-time7 yearsOff-site (compliance requirement)

What is Disaster Recovery? DR is about recovering from catastrophic events: data center failure, natural disasters, region-wide outages.

DR Metrics:

  • RPO (Recovery Point Objective): How much data can you lose? (Measured in time)
  • RTO (Recovery Time Objective): How long can you be down? (Measured in time)

DR Approaches:

  • Hot Standby: Full duplicate system running, ready to take over (expensive, fast recovery)
  • Warm Standby: System ready but not running (moderate cost, moderate recovery)
  • Cold Standby: System configured but not provisioned (cheap, slow recovery)

DR Testing:

  • Quarterly DR drills (simulate a disaster, practice recovery)
  • Document lessons learned (what went wrong in the drill?)
  • Update DR plans (continuous improvement)
  • Rotate DR team members (everyone should know what to do)

Pro tip: A DR plan you've never tested is not a DR plan - it's a wish. Test regularly, or prepare to be surprised when it really matters.

What You'll Have

Expected Output:

  • Backup schedules and procedures
  • Backup validation reports (are backups working?)
  • Restore test results (can we actually restore?)
  • DR drill reports (did DR work in the drill?)
  • DR plan updates (continuous improvement)
  • Backup compliance reports (are we meeting requirements?)

Who Approves:

  • SRE owns backup and DR
  • Security Lead validates data protection
  • Compliance Officer validates compliance requirements

Quality Gates: Ongoing Operational Excellence

Operational excellence isn't a destination - it's a continuous state. Regularly confirm:

  • Monitoring and alerting active - We can see what's happening
  • Incident management process active - We know how to respond
  • Maintenance schedule defined and followed - We're keeping things current
  • Continuous improvement process active - We're always getting better
  • Security operations running - We're staying secure
  • Compliance monitoring active - We're always audit-ready
  • Support process established - Users can get help
  • Backup and DR validated - We can recover from disasters

If any gate fails: Prioritize fixing it. Operational excellence prevents small issues from becoming big incidents.


Operational Metrics: Are We Getting Better?

MetricWhat It Tells YouTargetCurrent
AvailabilityIs the system up?>99.9%TBD
MTTA (Mean Time to Acknowledge)How fast do we respond?<15 minutesTBD
MTTR (Mean Time to Resolve)How fast do we fix?<4 hoursTBD
MTBF (Mean Time Between Failures)How long between incidents?>720 hoursTBD
Support SatisfactionAre users happy with support?>4.5/5TBD
Backup Success RateAre backups working?100%TBD
Patch ComplianceAre we up to date?>95%TBD

Remember: Metrics are for learning, not blame. Use trends to identify areas for improvement.


Words of Encouragement

Phase 8 can feel like a lot. You're maintaining vigilance 24/7/365, responding to incidents, keeping systems current, and always improving. It's exhausting, but it's also essential.

Remember:

  • You're not alone (you have a team, a process, and tools)
  • You're getting better (every incident teaches you something)
  • You're providing value (users rely on you)
  • You're building resilience (each challenge makes you stronger)

Operational excellence isn't about being perfect. It's about being prepared, responding well, learning constantly, and never stopping.

You're doing important work. The world runs on software, and you're keeping that software running. That's worth celebrating.


The Journey Continues

Phase 8 has no end date - it continues as long as your software is in production. But that doesn't mean nothing changes. You'll:

  • Release new versions (go back to Phase 5 for new features)
  • Respond to incidents (always learning)
  • Improve continuously (always getting better)
  • Adapt to change (new requirements, new threats, new opportunities)

The lifecycle is a circle, not a line. Every release brings you back through the phases, each time with more experience, more wisdom, and more resilience.

Thank you for being part of this journey. Your users thank you (even if they don't say it often enough), your team thanks you, and future you will thank present you for doing the hard work of operations and maintenance.


Previous Phase: Phase 7: Deployment & Release Next Phase: None (ongoing operations, but new releases will return to Phase 5)



Version 1.0.0 | OCTALUME Enterprise Lifecycle Framework

Score

Total Score

75/100

Based on repository quality metrics

SKILL.md

SKILL.mdファイルが含まれている

+20
LICENSE

ライセンスが設定されている

+10
説明文

100文字以上の説明がある

+10
人気

GitHub Stars 100以上

0/15
最近の活動

1ヶ月以内に更新

+10
フォーク

10回以上フォークされている

0/5
Issue管理

オープンIssueが50未満

+5
言語

プログラミング言語が設定されている

+5
タグ

1つ以上のタグが設定されている

+5

Reviews

💬

Reviews coming soon