Domain 5: Monitor and Maintain Azure Resources
Exam weight: 10–15%
This domain is about knowing that things are working, responding when they're not, backing up data, and planning for disaster recovery. It's more conceptual than the networking domain, but you'll see scenario questions comparing the tools.
5.1 Azure Monitor
Azure Monitor is the central hub for all monitoring in Azure. It collects metrics and logs from almost every Azure resource automatically.
Two Core Data Types
| Type | What it is | Default retention | Storage |
|---|---|---|---|
| Metrics | Numerical time-series data (CPU %, bytes/sec) | 93 days | Azure Monitor Metrics Store |
| Logs | Text/structured records of events | 30 days (configurable up to 730 days) | Log Analytics workspace (required) |
Exam trap: Metrics are stored by default for 93 days in Azure Monitor automatically — no workspace needed. Logs require a Log Analytics workspace and must be explicitly routed there via Diagnostic Settings.
Diagnostic Settings
Diagnostic settings control where resource telemetry flows. Every resource that supports monitoring lets you configure:
- Platform Metrics → Log Analytics workspace (for querying metrics with KQL)
- Resource Logs (e.g., Activity Logs, resource-specific logs) → Log Analytics workspace, Storage Account, Event Hub, or Partner solution
# Enable diagnostic settings: send activity log to a workspace
az monitor diagnostic-settings create \
--name "diag-vm-logs" \
--resource "/subscriptions/.../resourceGroups/rg/providers/Microsoft.Compute/virtualMachines/myvm" \
--workspace "/subscriptions/.../resourceGroups/rg/providers/Microsoft.OperationalInsights/workspaces/my-workspace" \
--logs '[{"category": "Administrative", "enabled": true}]' \
--metrics '[{"category": "AllMetrics", "enabled": true}]'
5.2 Log Analytics
Log Analytics is the query engine for Azure Monitor logs. Queries use KQL (Kusto Query Language).
Creating a Workspace
az monitor log-analytics workspace create \
--resource-group rg-monitor \
--workspace-name "law-prod" \
--location eastus \
--retention-time 90
Essential KQL Patterns
// Count errors in the last hour
AzureActivity
| where TimeGenerated > ago(1h)
| where ActivityStatusValue == "Failed"
| summarize count() by OperationNameValue
// Top 10 VMs by CPU
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(30m)
| summarize avg(CounterValue) by Computer
| top 10 by avg_CounterValue desc
// Storage account write errors
StorageBlobLogs
| where StatusCode >= 400
| project TimeGenerated, OperationName, StatusCode, Uri
Key KQL Operators (exam-relevant)
| Operator | What it does |
|---|---|
where | Filter rows |
summarize | Aggregate (count, avg, sum, max) |
project | Select specific columns |
order by / sort by | Sort results |
top N by | Return top N rows |
extend | Add computed column |
join | Join two tables |
ago() | Time relative to now (ago(1h), ago(7d)) |
5.3 Alerts
Alerts notify you (or trigger automated actions) when specific conditions are met.
Alert Rule Components
| Component | Description |
|---|---|
| Scope | Which resource(s) to monitor |
| Condition | Signal + threshold (e.g., CPU > 90%) |
| Action group | Who/what gets notified |
| Alert rule | Brings scope + condition + action group together |
Alert Signal Types
| Signal type | Source | Example |
|---|---|---|
| Metric | Real-time numeric value | CPU > 80% for 5 min |
| Log query | KQL query result | Count of errors > 10 in 15 min |
| Activity log | Azure control-plane operations | VM deallocated, resource deleted |
| Resource health | Azure-side platform issues | VM unavailable |
| Service health | Azure-wide incidents/maintenance | Region outage |
Exam tip: "Notify me when someone deletes a resource group" = Activity log alert. "Notify me when CPU exceeds 90%" = Metric alert. "Notify me when more than 5 failed logins appear in logs" = Log query alert.
Action Groups
Action groups define what happens when an alert fires:
| Action type | Use case |
|---|---|
| Email/SMS | Notify an on-call engineer |
| Azure Function | Run custom logic/automation |
| Logic App | Complex automated workflows |
| Webhook | Integrate with third-party systems (PagerDuty, Slack) |
| ITSM | Create an incident in ServiceNow |
| Automation Runbook | Execute an Azure Automation runbook |
# Create an action group
az monitor action-group create \
--resource-group rg-monitor \
--name "ag-ops-team" \
--short-name "ops" \
--action email oncall oncall@contoso.com
# Create a metric alert
az monitor metrics alert create \
--name "alert-high-cpu" \
--resource-group rg-monitor \
--scopes "/subscriptions/.../resourceGroups/rg/providers/Microsoft.Compute/virtualMachines/myvm" \
--condition "avg Percentage CPU > 90" \
--window-size 5m \
--evaluation-frequency 1m \
--action "/subscriptions/.../resourceGroups/rg-monitor/providers/microsoft.insights/actionGroups/ag-ops-team" \
--severity 2
5.4 Azure Backup
Azure Backup is Microsoft's managed backup service. It protects VMs, SQL databases, Azure Files, blobs, and more.
Recovery Services Vault
The Recovery Services Vault is the central container for backup data and backup policies. It's required for VM backup and Azure Site Recovery.
Exam trap: There is also a Backup vault (newer) — used specifically for Azure Disk backup, Azure Database for PostgreSQL, and Azure Blob backup. Recovery Services Vault is used for VM backup and ASR.
VM Backup
- Backup is crash-consistent by default; for VMs running SQL or VSS-aware apps, it can be application-consistent.
- Backup policy defines frequency (daily) and retention (daily, weekly, monthly, yearly).
- Soft delete: Deleted backup data is retained for 14 additional days (default; configurable). This prevents accidental data loss.
# Enable backup on a VM (vault must exist)
az backup protection enable-for-vm \
--vault-name "rsv-prod" \
--resource-group rg-backup \
--vm "/subscriptions/.../resourceGroups/rg/providers/Microsoft.Compute/virtualMachines/myvm" \
--policy-name "DefaultPolicy"
# List backup items
az backup item list \
--vault-name "rsv-prod" \
--resource-group rg-backup \
--backup-management-type AzureIaasVM \
-o table
# Trigger an on-demand backup
az backup protection backup-now \
--vault-name "rsv-prod" \
--resource-group rg-backup \
--item-name "vm;iaasvmcontainer;rg;myvm" \
--container-name "iaasvmcontainer;iaasvmcontainerv2;rg;myvm" \
--backup-management-type AzureIaasVM \
--retain-until "31-12-2025"
Restore Options
| Option | Description |
|---|---|
| Create new VM | Restore full VM from a recovery point |
| Replace existing disk | Replace OS or data disk on an existing VM |
| Restore files | Mount the recovery point as a drive and copy individual files |
5.5 Azure Site Recovery (ASR)
Azure Site Recovery provides disaster recovery (DR) — it replicates VMs continuously to a secondary region. In the event of a regional outage, you can fail over to the replicated VMs.
Key Concepts
| Term | Definition |
|---|---|
| RPO (Recovery Point Objective) | How much data you can afford to lose (e.g., 15 minutes of data) |
| RTO (Recovery Time Objective) | How long you can be offline before failover completes |
| Replication | Continuous block-level replication of VM disks to the target region |
| Test failover | Validates DR without affecting production (spins up in isolated network) |
| Failover | Activates the DR environment as production |
| Failback | Replicates back to primary region and shifts production back |
ASR RPO
ASR replicates continuously and maintains recovery points. Default RPO is 15 minutes (crash-consistent) or up to 4 hours (app-consistent, configurable).
Exam trap: Backup and ASR serve different purposes:
- Azure Backup = protect against accidental deletion, data corruption, ransomware
- ASR = protect against regional outage, datacenter failure (DR scenario)
5.6 Azure Update Manager
Azure Update Manager (formerly Update Management Center) manages OS updates across Azure VMs and Arc-connected on-premises servers.
- Provides a unified view of update compliance across all machines
- Supports scheduled assessment and scheduled patching
- Maintenance windows control when updates are applied
- Works without a Log Analytics agent (uses Azure VM extension)
5.7 Azure Advisor
Azure Advisor analyzes your Azure usage and provides personalized recommendations across five categories:
| Category | Examples |
|---|---|
| Cost | Right-size underutilized VMs, delete unused resources |
| Security | Enable MFA, apply security patches |
| Reliability | Add availability zones, configure backups |
| Operational Excellence | Enable diagnostics, follow best practices |
| Performance | Upgrade VM disks, increase throughput |
Exam tip: Advisor is read-only and advisory — it doesn't make changes. It surfaces recommendations; you act on them.
Section Takeaways
| Topic | Key Point |
|---|---|
| Metrics | 93-day default retention; no workspace needed |
| Logs | Require Log Analytics workspace; 30-day default |
| Diagnostic settings | Route logs and metrics to workspace, storage, Event Hub |
| Activity log alert | When Azure management actions trigger notifications |
| Metric alert | When numeric threshold is crossed |
| Log query alert | When KQL query result meets a condition |
| Recovery Services Vault | Required for VM backup and ASR |
| Backup vault | Used for Disk backup, PostgreSQL, Blob backup |
| ASR RPO | Default 15 minutes (crash-consistent) |
| Soft delete | Deleted backup data retained 14 days |
| Backup vs ASR | Backup = data protection; ASR = regional DR |
Confusing Points — Clarified
Q: What's the difference between Backup vault and Recovery Services vault? A: Recovery Services vault is the original (VM backup, SQL backup, ASR). Backup vault is newer and used for a specific set of newer workloads (Azure Disk backup, Blobs, PostgreSQL). For the AZ-104 exam, Recovery Services vault is what you configure for VM backup and ASR.
Q: Can I query metrics data with KQL?
A: Yes, but you need to first route platform metrics to a Log Analytics workspace via Diagnostic Settings. Once there, you query the Perf table (VM metrics) or the AzureMetrics table. By default, metrics are only in the Metrics Store (queryable via Metrics Explorer, not Log Analytics).
Q: What's the difference between Test Failover and Failover in ASR? A: Test Failover spins up the replicated VM in an isolated virtual network — production is not affected, replication continues. Failover is the real event — production shifts to the DR site. Always test before a real event.
Q: Does Azure Backup require internet connectivity from the VM? A: Azure Backup uses the Azure Backup extension inside the VM and sends backup data to the Recovery Services Vault. By default, this requires internet access (or service endpoints/private endpoints for the vault). You can configure private endpoints for the vault to eliminate internet traffic.