Infrastructure Operations

Monitoring, alert routing, remediation history and service health across Hozzt production environments.

Live · All core services online

Service availabilityLast 24 hours99.99%Based on active service checks

MTTDLast 24 hours7 minMedian detection time

MTTRLast 24 hours21 minMedian recovery time

Open incidentsLast 24 hours62 waiting operator review

Production nodesLast 24 hours476Grouped by service role

Domains scannedLast 24 hours6,758HTTP, DNS, SSL and mail checks

Anomaly Detection

Scoring activeOpen Details

0.21Mean risk score

17Hosts above baseline

4Recurring patterns

98.4%Telemetry coverage

00:0006:0012:0018:00Now

Host	Metric	Baseline	Current	Score	Disposition
TR-CPANEL-042	Disk I/O wait	7.2%	18.6%	0.74	Cleanup workflow queued
EU-WEB-118	HTTP p95 latency	210ms	520ms	0.69	Cache pool check
US-MAIL-021	SMTP queue depth	1,200	4,850	0.88	Queue drain completed
TR-DB-014	DB connections	62%	78%	0.46	Watch only

Response Metrics

24-hour windowView Report

98.7%Detection SLA compliance

97.9%Recovery SLA compliance

84%Automated resolution rate

check: service-health --scope production
ok: dns, http, smtp, mysql, backup-agent
warn: 14 hosts outside normal baseline
ok: no global outage condition detected

Automation Workflow

Policy engine onlineManage Workflows

CollectMetrics, logs, checks and host events

CorrelateGroup events by service and impact

ClassifyRisk level, blast radius and approval

ExecuteSafe action, webhook or operator task

VerifyHealth check, recovery note and closure

Monitoring Alerts

Active queueAll Alerts

Time	Alert	Severity	Source	Owner	Status
03:42	Mail queue growth above baseline	Critical	Zabbix	Automation	Resolved
03:35	HTTP latency drift on EU-WEB pool	Warning	Grafana	NOC	Investigating
03:19	Disk pressure forecast on shared hosting node	Warning	Prometheus	Automation	Queued
02:58	Database connection saturation normalized	Recovered	Zabbix	System	Closed

Remediation History

Last actionsHistory

SMTP queue drainedUS-MAIL-021 · completed in 6m 14s · post-check passed

Disk cleanup waiting verificationTR-CPANEL-042 · safe cleanup threshold matched

PHP-FPM pool recycledEU-WEB-118 · latency returned to normal range

Operator approval requiredTR-DB-014 · database action blocked by policy

Grafana Rules · Zabbix Actions · Prometheus Alerts

Configuration viewEdit Rules

Grafana: Latency Drift

Flags web pools when p95 response time remains outside normal operating range.

WHEN avg(http_latency_p95)
IS ABOVE service_baseline + 2.5σ
FOR 8 minutes
ROUTE TO web-ops

Zabbix: Service Recovery

Handles confirmed service failures with escalation, notification and recovery confirmation.

IF trigger = service.unavailable
AND group = production
THEN run safe_restart
ELSE escalate to NOC

Prometheus: Capacity Pressure

Tracks disk, inode, memory and queue behavior before the node reaches hard limits.

alert: CapacityPressure
expr: node_filesystem_avail_bytes < 12%
for: 10m
labels: severity="warning"

Service Impact Watchlist

Prioritized nodesOpen Watchlist

Server / Service Group	Role	Checks	Risk	Next Check
TR-WEB-034 / shared-web	cPanel web pool	842	Medium	4 min
US-MAIL-021 / mail-relay	SMTP queue node	318	Low	2 min
EU-VPS-076 / managed-vps	Virtualization host	146	Medium	6 min
TR-DNS-008 / dns-edge	Authoritative DNS	529	Low	1 min
TR-BKP-012 / backup-agent	Backup verification	204	Info	9 min

Operations Notes

InternalNotes

note: no customer-wide incident open
policy: db actions require approval
runbook: mail queue remediation v3.4
backup: daily verification completed
status: core infrastructure stable