AI is eliminating the most tedious parts of DevOps work — boilerplate IaC, repetitive script writing, log triage, and incident documentation. This guide covers practical applications across the DevOps lifecycle.
1. Infrastructure as Code Generation
Terraform
Prompt: Generate Terraform code for the following AWS infrastructure:
- VPC with 3 public subnets and 3 private subnets across us-east-1a/b/c
- Internet gateway and NAT gateways (one per AZ for HA)
- Application Load Balancer in public subnets
- ECS Fargate cluster in private subnets
- RDS PostgreSQL (Multi-AZ) in private subnets
- Security groups following principle of least privilege
- All resources tagged with: Environment=production, Team=platform
Use the AWS provider v5.x. Include outputs for the ALB DNS name and RDS endpoint.
Kubernetes YAML
Prompt: Generate Kubernetes manifests for a Node.js API service:
- Deployment: 3 replicas, resource limits (500m CPU, 512Mi memory),
readiness and liveness probes on /health endpoint
- Service: ClusterIP type
- HorizontalPodAutoscaler: scale 3-20 replicas based on 70% CPU
- ConfigMap for non-sensitive configuration
- Secret reference for DATABASE_URL
- PodDisruptionBudget: minimum 2 replicas available
- NetworkPolicy: allow ingress from ingress controller namespace only
The app runs on port 3000.
Ansible Playbook
Prompt: Write an Ansible playbook to configure a new Ubuntu 22.04 server:
1. Update all packages
2. Install: nginx, docker, docker-compose, fail2ban
3. Configure UFW firewall (allow 22, 80, 443; deny all else)
4. Set up unattended upgrades for security patches
5. Configure fail2ban for SSH protection
6. Add a deploy user with sudo access (no password)
7. Copy SSH authorized_keys from /tmp/authorized_keys on the control node
8. Start and enable nginx and docker services
Use best practices for idempotency.
2. CI/CD Pipeline Creation
GitHub Actions
Prompt: Create a complete GitHub Actions CI/CD workflow for a Dockerized Node.js application:
Triggers:
- PR: run lint, test, and Docker build (no push)
- Push to main: full pipeline including push to ECR and deploy to ECS
Jobs:
1. Lint and test (Node.js 20, npm ci, npm test)
2. Build and push Docker image to AWS ECR (tag with git SHA and 'latest')
3. Deploy to ECS Fargate (update service with new image)
4. Smoke test: curl the ALB health endpoint and fail if not 200
Use OIDC for AWS authentication (no long-lived credentials).
Secrets needed: AWS_ROLE_ARN, ECR_REGISTRY, ECS_CLUSTER, ECS_SERVICE
Dockerfile Optimization
Prompt: Review and optimize this Dockerfile for a Python FastAPI application:
[PASTE YOUR DOCKERFILE]
Goals:
1. Minimize final image size
2. Maximize build cache utilization
3. Run as non-root user
4. Use multi-stage build
5. Follow security best practices (no secrets in layers, specific base image tags)
Show the optimized Dockerfile with comments explaining each decision.
3. Log Analysis and Debugging
Application Error Triage
Prompt: Analyze these application error logs and identify the root cause:
[PASTE LOG EXCERPT]
Tell me:
1. What is the root cause of these errors?
2. What triggered the error cascade?
3. What's the blast radius (which services/users are affected)?
4. What's the most likely fix?
5. What should I check first in the codebase?
Kubernetes Pod Debugging
Prompt: My Kubernetes pod is in CrashLoopBackOff. Here's the diagnostic output:
kubectl describe pod:
[PASTE OUTPUT]
kubectl logs:
[PASTE LOGS]
Diagnose the issue and give me step-by-step remediation. Also suggest what monitoring
I should add to detect this earlier next time.
Database Query Performance
Prompt: This PostgreSQL query is taking 45 seconds on a table with 50M rows:
[PASTE QUERY]
EXPLAIN ANALYZE output:
[PASTE OUTPUT]
Suggest:
1. Index strategy
2. Query rewrite if applicable
3. Any schema changes worth considering
4. Whether this should be moved to a read replica
4. Monitoring and Alerting
Prometheus Alerting Rules
Prompt: Generate Prometheus alerting rules for a production web service:
- API error rate > 5% for 5 minutes → critical alert
- API p99 latency > 2s for 5 minutes → warning
- Pod CPU > 80% for 10 minutes → warning
- Pod memory > 90% of limit for 5 minutes → critical
- No pods ready for 1 minute → critical (PagerDuty escalation)
- Disk usage > 85% → warning; > 95% → critical
Use standard labels: severity, team, service. Include annotations with runbook URLs.
Grafana Dashboard JSON
Prompt: Create a Grafana dashboard JSON for monitoring a Redis cache:
Panels:
- Operations per second (commands/sec)
- Memory usage (used vs maxmemory)
- Cache hit rate (%)
- Connected clients
- Blocked clients
- Evicted keys rate
- Network I/O (bytes in/out)
Use variables for: instance, time range. Include appropriate thresholds and color coding.
Use Prometheus as datasource.
5. Incident Response
Runbook Generation
Prompt: Create a runbook for: "High API Error Rate Alert"
Include:
1. Alert details (what threshold, what metric)
2. Immediate triage steps (first 5 minutes)
3. Investigation steps with specific commands
4. Common causes and resolutions
5. Escalation criteria
6. Communication template for status page
7. Resolution verification steps
8. Post-incident actions
Target audience: on-call engineers with intermediate experience.
Postmortem Writing
Prompt: Help me write a blameless postmortem for this incident:
Timeline:
- 14:32: Alert fires for high error rate
- 14:35: On-call engineer paged
- 14:41: Engineer identifies CPU spike
- 14:50: Identified cause: unindexed query from new feature deploy at 14:15
- 15:05: Hotfix deployed, error rate returns to normal
- Total duration: 33 minutes, ~12% of users affected
Write a complete postmortem with: summary, impact, timeline, root cause analysis (5 whys), contributing factors, and action items. Keep it blameless and constructive.
6. Security and Compliance
IAM Policy Generation
Prompt: Generate an AWS IAM policy for a Lambda function that needs to:
- Read from S3 bucket "my-data-bucket" (specific prefix: "input/")
- Write to S3 bucket "my-data-bucket" (specific prefix: "output/")
- Read from SQS queue ARN: arn:aws:sqs:us-east-1:123456:my-queue
- Write to DynamoDB table "processing-state" (specific operations: PutItem, GetItem, UpdateItem)
- Write CloudWatch logs (create log group, create log stream, put log events)
Follow least privilege. Include conditions where appropriate.
7. Documentation
Architecture Diagram Descriptions
Prompt: Based on this Terraform code, generate:
1. A text description of the architecture suitable for a README
2. A Mermaid diagram showing the components and their relationships
3. A list of security controls in place
4. Estimated monthly AWS cost breakdown (use current pricing)
[PASTE TERRAFORM CODE]
AI Tools for DevOps by Task
| Task | Best AI Tool |
|---|---|
| IaC generation | GitHub Copilot or Claude |
| Pipeline writing | Claude or ChatGPT |
| Log analysis | Claude (paste logs) or Datadog Bits AI |
| Incident triage | PagerDuty AI or Claude |
| Monitoring setup | Claude + Prometheus/Grafana docs |
| Cost optimization | Infracost + Claude analysis |
| Security review | Checkov + Claude explanation |
| Documentation | Claude or ChatGPT |
The pattern: use purpose-built DevOps AI tools (Datadog, PagerDuty, Harness) for real-time operational tasks, and general LLMs (Claude, ChatGPT) for generation and analysis tasks where you provide the context.