How to Use AI for DevOps: From IaC to Incident Response (2026)

AI is eliminating the most tedious parts of DevOps work — boilerplate IaC, repetitive script writing, log triage, and incident documentation. This guide covers practical applications across the DevOps lifecycle.

1. Infrastructure as Code Generation

Terraform

Prompt: Generate Terraform code for the following AWS infrastructure:
- VPC with 3 public subnets and 3 private subnets across us-east-1a/b/c
- Internet gateway and NAT gateways (one per AZ for HA)
- Application Load Balancer in public subnets
- ECS Fargate cluster in private subnets
- RDS PostgreSQL (Multi-AZ) in private subnets
- Security groups following principle of least privilege
- All resources tagged with: Environment=production, Team=platform

Use the AWS provider v5.x. Include outputs for the ALB DNS name and RDS endpoint.

Kubernetes YAML

Prompt: Generate Kubernetes manifests for a Node.js API service:
- Deployment: 3 replicas, resource limits (500m CPU, 512Mi memory), 
  readiness and liveness probes on /health endpoint
- Service: ClusterIP type
- HorizontalPodAutoscaler: scale 3-20 replicas based on 70% CPU
- ConfigMap for non-sensitive configuration
- Secret reference for DATABASE_URL
- PodDisruptionBudget: minimum 2 replicas available
- NetworkPolicy: allow ingress from ingress controller namespace only

The app runs on port 3000.

Ansible Playbook

Prompt: Write an Ansible playbook to configure a new Ubuntu 22.04 server:
1. Update all packages
2. Install: nginx, docker, docker-compose, fail2ban
3. Configure UFW firewall (allow 22, 80, 443; deny all else)
4. Set up unattended upgrades for security patches
5. Configure fail2ban for SSH protection
6. Add a deploy user with sudo access (no password)
7. Copy SSH authorized_keys from /tmp/authorized_keys on the control node
8. Start and enable nginx and docker services

Use best practices for idempotency.

2. CI/CD Pipeline Creation

GitHub Actions

Prompt: Create a complete GitHub Actions CI/CD workflow for a Dockerized Node.js application:

Triggers:
- PR: run lint, test, and Docker build (no push)
- Push to main: full pipeline including push to ECR and deploy to ECS

Jobs:
1. Lint and test (Node.js 20, npm ci, npm test)
2. Build and push Docker image to AWS ECR (tag with git SHA and 'latest')
3. Deploy to ECS Fargate (update service with new image)
4. Smoke test: curl the ALB health endpoint and fail if not 200

Use OIDC for AWS authentication (no long-lived credentials).
Secrets needed: AWS_ROLE_ARN, ECR_REGISTRY, ECS_CLUSTER, ECS_SERVICE

Dockerfile Optimization

Prompt: Review and optimize this Dockerfile for a Python FastAPI application:

[PASTE YOUR DOCKERFILE]

Goals:
1. Minimize final image size
2. Maximize build cache utilization
3. Run as non-root user
4. Use multi-stage build
5. Follow security best practices (no secrets in layers, specific base image tags)

Show the optimized Dockerfile with comments explaining each decision.

3. Log Analysis and Debugging

Application Error Triage

Prompt: Analyze these application error logs and identify the root cause:

[PASTE LOG EXCERPT]

Tell me:
1. What is the root cause of these errors?
2. What triggered the error cascade?
3. What's the blast radius (which services/users are affected)?
4. What's the most likely fix?
5. What should I check first in the codebase?

Kubernetes Pod Debugging

Prompt: My Kubernetes pod is in CrashLoopBackOff. Here's the diagnostic output:

kubectl describe pod:
[PASTE OUTPUT]

kubectl logs:
[PASTE LOGS]

Diagnose the issue and give me step-by-step remediation. Also suggest what monitoring 
I should add to detect this earlier next time.

Database Query Performance

Prompt: This PostgreSQL query is taking 45 seconds on a table with 50M rows:

[PASTE QUERY]

EXPLAIN ANALYZE output:
[PASTE OUTPUT]

Suggest:
1. Index strategy
2. Query rewrite if applicable
3. Any schema changes worth considering
4. Whether this should be moved to a read replica

4. Monitoring and Alerting

Prometheus Alerting Rules

Prompt: Generate Prometheus alerting rules for a production web service:
- API error rate > 5% for 5 minutes → critical alert
- API p99 latency > 2s for 5 minutes → warning
- Pod CPU > 80% for 10 minutes → warning
- Pod memory > 90% of limit for 5 minutes → critical
- No pods ready for 1 minute → critical (PagerDuty escalation)
- Disk usage > 85% → warning; > 95% → critical

Use standard labels: severity, team, service. Include annotations with runbook URLs.

Grafana Dashboard JSON

Prompt: Create a Grafana dashboard JSON for monitoring a Redis cache:
Panels:
- Operations per second (commands/sec)
- Memory usage (used vs maxmemory)
- Cache hit rate (%)
- Connected clients
- Blocked clients
- Evicted keys rate
- Network I/O (bytes in/out)

Use variables for: instance, time range. Include appropriate thresholds and color coding.
Use Prometheus as datasource.

5. Incident Response

Runbook Generation

Prompt: Create a runbook for: "High API Error Rate Alert"

Include:
1. Alert details (what threshold, what metric)
2. Immediate triage steps (first 5 minutes)
3. Investigation steps with specific commands
4. Common causes and resolutions
5. Escalation criteria
6. Communication template for status page
7. Resolution verification steps
8. Post-incident actions

Target audience: on-call engineers with intermediate experience.

Postmortem Writing

Prompt: Help me write a blameless postmortem for this incident:

Timeline:
- 14:32: Alert fires for high error rate
- 14:35: On-call engineer paged
- 14:41: Engineer identifies CPU spike
- 14:50: Identified cause: unindexed query from new feature deploy at 14:15
- 15:05: Hotfix deployed, error rate returns to normal
- Total duration: 33 minutes, ~12% of users affected

Write a complete postmortem with: summary, impact, timeline, root cause analysis (5 whys), contributing factors, and action items. Keep it blameless and constructive.

6. Security and Compliance

IAM Policy Generation

Prompt: Generate an AWS IAM policy for a Lambda function that needs to:
- Read from S3 bucket "my-data-bucket" (specific prefix: "input/")
- Write to S3 bucket "my-data-bucket" (specific prefix: "output/")
- Read from SQS queue ARN: arn:aws:sqs:us-east-1:123456:my-queue
- Write to DynamoDB table "processing-state" (specific operations: PutItem, GetItem, UpdateItem)
- Write CloudWatch logs (create log group, create log stream, put log events)

Follow least privilege. Include conditions where appropriate.

7. Documentation

Architecture Diagram Descriptions

Prompt: Based on this Terraform code, generate:
1. A text description of the architecture suitable for a README
2. A Mermaid diagram showing the components and their relationships
3. A list of security controls in place
4. Estimated monthly AWS cost breakdown (use current pricing)

[PASTE TERRAFORM CODE]

AI Tools for DevOps by Task

Task	Best AI Tool
IaC generation	GitHub Copilot or Claude
Pipeline writing	Claude or ChatGPT
Log analysis	Claude (paste logs) or Datadog Bits AI
Incident triage	PagerDuty AI or Claude
Monitoring setup	Claude + Prometheus/Grafana docs
Cost optimization	Infracost + Claude analysis
Security review	Checkov + Claude explanation
Documentation	Claude or ChatGPT

The pattern: use purpose-built DevOps AI tools (Datadog, PagerDuty, Harness) for real-time operational tasks, and general LLMs (Claude, ChatGPT) for generation and analysis tasks where you provide the context.