phase-08-operations-maintenance

Name: phase-08-operations-maintenance
Rating: 75
Author: Harery

by Harery

Enterprise 8-Phase Software Development Lifecycle Framework with Quality Gates, Multi-Agent Orchestration, and AI-Assisted Development using Claude Code. Perfect for regulated industries (HIPAA, SOC 2, PCI DSS, SOX, GDPR).

⭐ 1🍴 0📅 Jan 22, 2026

ai-governance autonomous-sdlc claude-code compliance-ready enterprise-framework hipaa linux-framework modelcontextprotocol

View on GitHub Run in Manus

SKILL.md

name: "phase_08_operations_maintenance" description: "Operational support and maintenance for released software. Monitoring and observability, incident management, maintenance and patch management, continuous improvement, security operations, compliance and audit, support and user satisfaction, backup and disaster recovery." phase: 8 phase_name: "Operations & Maintenance" owner: "SRE Lead" secondary_owner: "DevOps Lead" participants: ["SRE", "DevOps", "Tech Lead", "Security Lead", "Compliance Officer", "Support Team", "Product Owner"] entry_criteria: [ "Production deployment successful", "Post-deployment validation passing", "Monitoring and alerting active", "Release documentation published" ] exit_criteria: [ "Ongoing operations established", "Incident management process active", "Maintenance schedule defined", "Continuous improvement process active", "Security operations running", "Compliance monitoring active", "Support process established", "Backup and DR validated" ] estimated_duration: "Ongoing (product lifetime)" dependencies: ["phase_07_deployment_release"] outputs: [ "Monitoring Dashboards", "Alerting Rules", "Incident Response Procedures", "Maintenance Schedule", "Security Operations Plan", "Compliance Reports", "Support Metrics", "Backup and DR Procedures" ] next_phase: "None (ongoing operations)"

Phase 8: Operations & Maintenance

The Long-Term Care for Your Software

Welcome to Phase 8 - the marathon, not the sprint. Unlike the previous phases which have clear beginnings and endings, Phase 8 continues for the lifetime of your software.

This phase is about stability, reliability, and continuous improvement. You're not building new features (usually); you're keeping what you've built running smoothly and making it better over time.

The mindset here: Sleep soundly knowing your systems are monitored, your team is prepared, and your users are supported.

What You'll Achieve

In this ongoing phase, you'll maintain:

High availability (users can access your software when they need it)
Fast response to incidents (when things break, you fix them quickly)
Regular maintenance (patches, updates, upgrades)
Continuous improvement (always getting better)
Strong security (vigilance against threats)
Compliance readiness (always audit-ready)
Happy users (support that actually helps)

What makes this different: This phase never really ends. As long as your software is in production, Phase 8 continues.

Who's Driving This Phase?

Role	What They're Responsible For
SRE Lead	Your reliability champion - keeping systems running and users happy
DevOps Lead	Infrastructure and automation - making operations efficient
Tech Lead	Technical support - fixing bugs and implementing improvements
Security Lead	Security operations - protecting against threats
Compliance Officer	Ongoing compliance - staying audit-ready
Support Team	User support - helping users when they need help

The Operations Lifecycle

Operations isn't static - it's a continuous cycle of monitoring, responding, improving, and repeating.

Monitor → Detect → Respond → Learn → Improve → Monitor

Every incident teaches you something. Every improvement makes you more resilient.

Step 1: Monitoring and Observability (See Everything)

Time needed: Ongoing Led by: SRE Lead

Why Monitoring Matters

You can't fix what you can't see. Monitoring is your eyes and ears in production, telling you what's happening, what's broken, and what needs attention.

What to Monitor

1. Latency (How fast is it?)

Response time (p50, p95, p99 percentiles)
Database query time
API call duration
Page load time

2. Traffic (How much is it?)

Requests per second
Concurrent users
Data transfer volume
API call rate

3. Errors (What's broken?)

HTTP error codes (4xx, 5xx)
Application error rate
Failed database queries
Timeout rate

4. Saturation (How full is it?)

CPU utilization
Memory usage
Disk I/O
Network bandwidth

Pro tip: Monitor these four signals, and you'll catch 90% of issues before they become incidents.

Setting Up Monitoring

Application Performance Monitoring (APM) - Track application-level metrics
Infrastructure Monitoring - Track server, database, network metrics
Log Aggregation - Collect and search logs from all services
Distributed Tracing - Follow requests across microservices
Synthetic Monitoring - Simulate user interactions proactively
Real User Monitoring (RUM) - Measure actual user experience
Business Metrics - Track business-critical metrics (signups, purchases, etc.)

Tools for 2026

Need	Recommended Tools	Why?
Metrics + Dashboards	Prometheus + Grafana	Open source, industry standard, highly flexible
APM	Datadog, New Relic, Dynatrace	Comprehensive application insights
Logging	ELK Stack, Loki	Centralized, searchable logs
Distributed Tracing	Jaeger, Tempo	Follow requests across services
Uptime Monitoring	Pingdom, UptimeRobot	Simple, reliable uptime checks

What You'll Have

Expected Output:

Monitoring dashboards (visual representation of metrics)
Alerting rules (when to notify someone)
SLO/SLI definitions (what "good" looks like)
Runbooks (what to do when alerts fire)
On-call procedures (who responds when)

Who Approves:

SRE owns monitoring setup
Tech Lead validates technical monitoring
Product Owner validates business metrics

Step 2: Incident Management (When Things Break)

Time needed: Ongoing Led by: SRE Lead

Incidents Will Happen

No matter how well you build, test, and deploy, things will break. It's not a question of if, but when. Incident management is how you respond without panicking.

Incident Severity Levels

Severity	Description	Examples	Response Target
SEV-1	Critical - System down, data loss, security breach	Website unavailable, database corruption, data breach	15 minutes
SEV-2	High - Major feature broken, significant impact	Checkout not working, users can't login	1 hour
SEV-3	Medium - Partial degradation, workaround available	Slow performance, some features affected	4 hours
SEV-4	Low - Minor issues, cosmetic	Typos, non-critical bugs, documentation errors	1 business day

1. Detect (Alert Fires)

Automated monitoring detects an issue
Alert is sent to on-call engineer
Severity is assessed

2. Acknowledge (Someone Takes Ownership)

On-call engineer acknowledges the alert
Incident is declared (if severity warrants)
Incident commander is identified

3. Respond (Fix the Issue)

Gather the incident response team
Diagnose the root cause
Implement a fix or workaround
Validate the fix

4. Resolve (Back to Normal)

System is restored to normal operation
Monitoring confirms no ongoing issues
Incident is marked as resolved

5. Learn (Post-Incident Review)

Conduct a blameless post-mortem
Document what happened and why
Identify action items to prevent recurrence **Share learnings with the team

Pro tip: The goal isn't to never have incidents (that's impossible). The goal is to learn from each incident so you don't have the same one twice.

On-Call Best Practices

Do This:

Rotate on-call responsibilities (avoid burnout)
Provide on-call compensation (it's hard work)
Document runbooks (don't rely on heroics)
Use escalation policies (if on-call doesn't respond, escalate)
Track on-call metrics (are we getting better?)

Avoid This:

Single person on-call (what if they're sick?)
No backup (what if primary is unreachable?)
Hero culture (relying on individuals rather than processes)
Ignoring on-call feedback (they know what's broken)

What You'll Have

Expected Output:

Incident management procedures
On-call schedule (who's covering when)
Incident response runbooks
Incident metrics dashboard (MTTA, MTTR, MTBF)
Post-incident review reports

Who Approves:

SRE owns incident management
Tech Lead validates technical procedures
Security Lead validates security incident procedures

Step 3: Maintenance and Patch Management (Stay Current)

Time needed: Ongoing (scheduled) Led by: DevOps Lead

Why Maintenance Matters

Software doesn't stay perfect. Dependencies get vulnerabilities, operating systems need updates, databases need optimization. Maintenance is the ongoing care that keeps your software healthy.

Maintenance Schedule

Activity	Frequency	Duration	What Gets Done
Security Patches	As needed (within 48 hours of critical CVE)	Varies	Apply critical security fixes immediately
Dependency Updates	Monthly	2-4 hours	Update libraries, frameworks, packages
OS Patches	Monthly	2-4 hours	Apply operating system security patches
Database Maintenance	Weekly	1-2 hours	Index rebuilds, statistics updates, vacuuming
Infrastructure Updates	Quarterly	4-8 hours	Upgrade infrastructure components
Major Upgrades	Annually	Varies	Upgrade to new major versions (e.g., Python 3.12 → 3.13)

Do This:

Schedule maintenance during low-traffic windows
Communicate maintenance windows to users in advance
Test maintenance procedures in staging first
Have rollback plans for each maintenance activity
Document what was changed and why
Monitor after maintenance (catch issues early)

Avoid This:

Skipping maintenance (technical debt accumulates)
Maintenance without testing (surprises in production are bad)
Maintenance without communication (users hate unexpected downtime)
"While we're in there" changes (scope creep causes issues)

Pro tip: Automate what you can. Automated maintenance is more reliable and less error-prone than manual maintenance.

What You'll Have

Expected Output:

Maintenance schedule (what happens when)
Patch management procedures (how to apply updates)
Maintenance logs (what was done)
Upgrade plans (for major version changes)
Change records (audit trail of changes)

Who Approves:

DevOps executes maintenance
Tech Lead validates technical changes
Security Lead validates security patches
Product Owner approves maintenance windows

Step 4: Continuous Improvement (Always Getting Better)

Time needed: Ongoing Led by: SRE Lead + Tech Lead

The Improvement Engine

Operations isn't just about keeping the lights on - it's about making things better. Every incident, every user complaint, every metric is an opportunity to improve.

Sources of Improvement

Incidents - What broke? Why? How do we prevent it?
User feedback - What are users complaining about?
Metrics - What's slow? What's error-prone?
Team feedback - What's painful for the team? What's repetitive?
Industry trends - What are others doing better?

The Improvement Process

Identify → Prioritize → Implement → Validate → Standardize → Repeat

Improvements compound. Small gains add up to big results over time.

What is a blameless post-mortem? An analysis of an incident that focuses on what happened and why, not who to blame.

Why blameless?

Blame makes people hide mistakes
Mistakes are opportunities to learn
Systems are usually at fault, not people
Psychological safety encourages honesty

Post-mortem template:

## Incident Summary
- What happened?
- When did it happen?
- How long did it last?
- Who was affected?

## Root Cause Analysis
- What was the triggering event?
- What factors contributed?
- What was the underlying cause? (5 whys technique)

## Timeline
- [Time] - Event occurred
- [Time] - Alert fired
- [Time] - Response initiated
- [Time] - Issue resolved

## Action Items
- [ ] Prevent recurrence: (what changes to prevent this?)
- [ ] Improve detection: (how to catch this faster?)
- [ ] Improve documentation: (what docs need updating?)
- [ ] Improve training: (what training would help?)

## Lessons Learned
- What did we learn?
- What should we share with other teams?

Pro tip: The best post-mortems are the ones you learn from. If you're not learning, you're not improving.

What You'll Have

Expected Output:

Improvement backlog (ideas for improvements)
Performance optimization results
Automation enhancements
Post-incident review reports
Improvement metrics (are we getting better?)

Who Approves:

SRE drives continuous improvement
Tech Lead validates technical improvements
Product Owner prioritizes improvements

Step 5: Security Operations (Stay Vigilant)

Time needed: Ongoing Led by: Security Lead

Security Is a Process, Not a Project

Security in Phase 8 is about vigilance. Threats are constantly evolving, and your security operations need to evolve with them.

Ongoing Security Activities

Monitor security alerts - Watch for suspicious activity
Scan for vulnerabilities - Regular SAST/DAST/SCA scans
Manage access - Review and revoke access as needed
Update security tools - Keep security software current
Conduct security assessments - Regular penetration tests
Security awareness training - Keep the team security-conscious
Incident response - Respond to security incidents
Threat intelligence - Stay informed about new threats

Automated Security:

Continuous vulnerability scanning (not just at deployment time)
Automated security policy enforcement (e.g., prevent secrets in code)
Automated incident response (contain threats automatically)

Cloud Security:

Cloud Security Posture Management (CSPM) - Monitor cloud configuration
Cloud Workload Protection (CWP) - Protect cloud workloads
Zero Trust Architecture - Verify every request, never trust implicitly

DevSecOps:

Security as code (define security policies in code)
Shift left (find vulnerabilities early)
Security testing in CI/CD (automated security gates)

Compliance:

Continuous compliance monitoring (always audit-ready)
Automated evidence collection (for audits)
Policy as code (enforce compliance automatically)

Pro tip: Security operations is a team sport. Everyone has a role to play, not just the security team.

What You'll Have

Expected Output:

Security monitoring dashboard
Vulnerability assessments (regular scans)
Security patch reports
Access review reports
Security metrics (are we secure?)
Incident response reports (when security incidents happen)

Who Approves:

Security Lead owns security operations
CISO approves overall security posture
Compliance Officer validates compliance

Step 6: Compliance and Audit (Always Ready)

Time needed: Ongoing, with periodic audits Led by: Compliance Officer

Compliance Is Ongoing, Not Occasional

Compliance isn't something you do once a year for an audit. It's an ongoing state of being audit-ready, all the time.

Ongoing Compliance Activities

Maintain compliance documentation - Keep it current
Conduct internal audits - Self-assess regularly
Prepare for external audits - Don't wait for the audit notice
Manage regulatory changes - Stay aware of new requirements
Track compliance exceptions - Document and address gaps
Conduct compliance training - Keep the team aware
Generate compliance reports - Evidence of compliance
Manage audit findings - Address audit recommendations

The Audit Trap: Many organizations scramble when an audit is announced. They're not prepared because they treat compliance as an event, not a state.

The Audit-Ready Approach:

Maintain continuous compliance (every day, not just at audit time)
Automate evidence collection (generate reports on demand)
Document everything (if it's not documented, it didn't happen)
Conduct regular self-audits (find issues before auditors do)
Treat audits as validations, not investigations (you're already ready)

Continuous Compliance Monitoring:

Automated policy checks (verify controls are in place)
Automated evidence collection (gather proof of compliance)
Compliance dashboards (visualize compliance status)
Alert on compliance drift (know when you're falling out of compliance)

Pro tip: Auditors love good documentation. If you can show them what they need quickly and easily, audits go smoothly.

What You'll Have

Expected Output:

Compliance documentation (current and complete)
Internal audit reports (regular self-assessments)
External audit coordination (smooth audit experience)
Compliance status reports (are we compliant?)
Remediation plans (address gaps)
Audit findings closure (show you've addressed issues)

Who Approves:

Compliance Officer owns compliance
Legal reviews compliance position
CISO approves security compliance
Executive Sponsor approves compliance approach

Step 7: Support and User Satisfaction (Happy Users)

Time needed: Ongoing Led by: Support Lead + Product Owner

Support Is Your Front Line

Your support team is often the only human contact users have with your organization. Great support turns frustrated users into loyal advocates.

Support Tiers and SLAs

Tier	Description	Examples	Response Time	Resolution Time
P1	Critical - System down, data loss	Website unavailable, can't access data	15 minutes	4 hours
P2	High - Major feature broken	Can't checkout, can't save work	1 hour	24 hours
P3	Medium - Workaround available	Feature slow, minor bug	4 hours	1 week
P4	Low - Question, enhancement	How do I?, feature request	1 business day	2 weeks

Why a Knowledge Base Matters:

Reduces support ticket volume (users find answers themselves)
Improves consistency (everyone gives the same answers)
Speeds onboarding (new support team members get up to speed faster)
Captures tribal knowledge (doesn't leave when people leave)

What to Include:

Frequently asked questions (FAQs)
How-to guides (step-by-step instructions)
Troubleshooting guides (what to do when things go wrong)
Video tutorials (visual learners prefer video)
Glossary (explain jargon)
Release notes (what's new in each version)

Keep It Current:

Update with every release (new features, new issues)
Review quarterly (remove outdated content)
Track usage (what's popular? what's ignored?)
Get feedback (is it helpful?)

Pro tip: Every support ticket should end with two outcomes: the user's problem is solved, and the knowledge base is updated (if a new issue was discovered).

What You'll Have

Expected Output:

Support tickets and resolutions (track every issue)
Support metrics dashboard (are we meeting SLAs?)
User satisfaction reports (are users happy?)
Knowledge base articles (self-service support)
Training materials (help users help themselves)
User feedback analysis (what are users telling us?)

Who Approves:

Support Lead owns support operations
Product Owner validates user satisfaction

Step 8: Backup and Disaster Recovery (Your Last Resort)

Time needed: Ongoing, with periodic testing Led by: SRE Lead + DevOps

Hope for the Best, Plan for the Worst

Backups and disaster recovery (DR) are your insurance policies. You hope you never need them, but if you do, you're really glad you have them.

Backup Strategy

Data Type	Frequency	Retention	Location
Database	Hourly	90 days	Off-site (different region)
Application	Daily	30 days	Off-site
Logs	Daily	1 year	Off-site
Configs	Per change	1 year	Version control (Git)
User Data	Real-time	7 years	Off-site (compliance requirement)

What is Disaster Recovery? DR is about recovering from catastrophic events: data center failure, natural disasters, region-wide outages.

DR Metrics:

RPO (Recovery Point Objective): How much data can you lose? (Measured in time)
RTO (Recovery Time Objective): How long can you be down? (Measured in time)

DR Approaches:

Hot Standby: Full duplicate system running, ready to take over (expensive, fast recovery)
Warm Standby: System ready but not running (moderate cost, moderate recovery)
Cold Standby: System configured but not provisioned (cheap, slow recovery)

DR Testing:

Quarterly DR drills (simulate a disaster, practice recovery)
Document lessons learned (what went wrong in the drill?)
Update DR plans (continuous improvement)
Rotate DR team members (everyone should know what to do)

Pro tip: A DR plan you've never tested is not a DR plan - it's a wish. Test regularly, or prepare to be surprised when it really matters.

What You'll Have

Expected Output:

Backup schedules and procedures
Backup validation reports (are backups working?)
Restore test results (can we actually restore?)
DR drill reports (did DR work in the drill?)
DR plan updates (continuous improvement)
Backup compliance reports (are we meeting requirements?)

Who Approves:

SRE owns backup and DR
Security Lead validates data protection
Compliance Officer validates compliance requirements

Quality Gates: Ongoing Operational Excellence

Operational excellence isn't a destination - it's a continuous state. Regularly confirm:

Monitoring and alerting active - We can see what's happening
Incident management process active - We know how to respond
Maintenance schedule defined and followed - We're keeping things current
Continuous improvement process active - We're always getting better
Security operations running - We're staying secure
Compliance monitoring active - We're always audit-ready
Support process established - Users can get help
Backup and DR validated - We can recover from disasters

If any gate fails: Prioritize fixing it. Operational excellence prevents small issues from becoming big incidents.

Operational Metrics: Are We Getting Better?

Metric	What It Tells You	Target	Current
Availability	Is the system up?	>99.9%	TBD
MTTA (Mean Time to Acknowledge)	How fast do we respond?	<15 minutes	TBD
MTTR (Mean Time to Resolve)	How fast do we fix?	<4 hours	TBD
MTBF (Mean Time Between Failures)	How long between incidents?	>720 hours	TBD
Support Satisfaction	Are users happy with support?	>4.5/5	TBD
Backup Success Rate	Are backups working?	100%	TBD
Patch Compliance	Are we up to date?	>95%	TBD

Remember: Metrics are for learning, not blame. Use trends to identify areas for improvement.

Words of Encouragement

Phase 8 can feel like a lot. You're maintaining vigilance 24/7/365, responding to incidents, keeping systems current, and always improving. It's exhausting, but it's also essential.

Remember:

You're not alone (you have a team, a process, and tools)
You're getting better (every incident teaches you something)
You're providing value (users rely on you)
You're building resilience (each challenge makes you stronger)

Operational excellence isn't about being perfect. It's about being prepared, responding well, learning constantly, and never stopping.

You're doing important work. The world runs on software, and you're keeping that software running. That's worth celebrating.

The Journey Continues

Phase 8 has no end date - it continues as long as your software is in production. But that doesn't mean nothing changes. You'll:

Release new versions (go back to Phase 5 for new features)
Respond to incidents (always learning)
Improve continuously (always getting better)
Adapt to change (new requirements, new threats, new opportunities)

The lifecycle is a circle, not a line. Every release brings you back through the phases, each time with more experience, more wisdom, and more resilience.

Thank you for being part of this journey. Your users thank you (even if they don't say it often enough), your team thanks you, and future you will thank present you for doing the hard work of operations and maintenance.

Previous Phase: Phase 7: Deployment & Release Next Phase: None (ongoing operations, but new releases will return to Phase 5)

Version 1.0.0 | OCTALUME Enterprise Lifecycle Framework

Score

Total Score

75/100

Based on repository quality metrics

✓SKILL.md

SKILL.mdファイルが含まれている

+20

✓LICENSE

ライセンスが設定されている

+10

✓説明文

100文字以上の説明がある

+10

○人気

GitHub Stars 100以上

0/15

○最近の活動

3ヶ月以内に更新がある

0/10

○フォーク

10回以上フォークされている

0/5

✓Issue管理

オープンIssueが50未満

✓言語

プログラミング言語が設定されている

✓タグ

1つ以上のタグが設定されている

Reviews

💬

Reviews coming soon

phase-08-operations-maintenance

SKILL.md

Phase 8: Operations & Maintenance

The Long-Term Care for Your Software

What You'll Achieve

Who's Driving This Phase?

The Operations Lifecycle

Step 1: Monitoring and Observability (See Everything)

Why Monitoring Matters

What to Monitor

Setting Up Monitoring

Tools for 2026

What You'll Have

Step 2: Incident Management (When Things Break)

Incidents Will Happen

Incident Severity Levels

On-Call Best Practices

What You'll Have

Step 3: Maintenance and Patch Management (Stay Current)

Why Maintenance Matters

Maintenance Schedule

What You'll Have

Step 4: Continuous Improvement (Always Getting Better)

The Improvement Engine

Sources of Improvement

The Improvement Process

What You'll Have

Step 5: Security Operations (Stay Vigilant)

Security Is a Process, Not a Project

Ongoing Security Activities

What You'll Have

Step 6: Compliance and Audit (Always Ready)

Compliance Is Ongoing, Not Occasional

Ongoing Compliance Activities

What You'll Have

Step 7: Support and User Satisfaction (Happy Users)

Support Is Your Front Line

Support Tiers and SLAs

What You'll Have

Step 8: Backup and Disaster Recovery (Your Last Resort)

Hope for the Best, Plan for the Worst

Backup Strategy

What You'll Have

Quality Gates: Ongoing Operational Excellence

Operational Metrics: Are We Getting Better?

Words of Encouragement

The Journey Continues

Score

Reviews

Related

Related Skills

changelog-automation

web-component-design

dbt-transformation-patterns

market-sizing-analysis

on-call-handoff-patterns

architecture-decision-records