Infrastructure Operations

Monitoring, alert routing, remediation history and service health across Hozzt production environments.

Live · All core services online
Service availabilityLast 24 hours99.99%Based on active service checks
MTTDLast 24 hours7 minMedian detection time
MTTRLast 24 hours21 minMedian recovery time
Open incidentsLast 24 hours62 waiting operator review
Production nodesLast 24 hours476Grouped by service role
Domains scannedLast 24 hours6,758HTTP, DNS, SSL and mail checks

Anomaly Detection

Scoring activeOpen Details
0.21Mean risk score
17Hosts above baseline
4Recurring patterns
98.4%Telemetry coverage
00:0006:0012:0018:00Now
HostMetricBaselineCurrentScoreDisposition
TR-CPANEL-042Disk I/O wait7.2%18.6%0.74Cleanup workflow queued
EU-WEB-118HTTP p95 latency210ms520ms0.69Cache pool check
US-MAIL-021SMTP queue depth1,2004,8500.88Queue drain completed
TR-DB-014DB connections62%78%0.46Watch only

Response Metrics

24-hour windowView Report
98.7%Detection SLA compliance
97.9%Recovery SLA compliance
84%Automated resolution rate
check: service-health --scope production
ok: dns, http, smtp, mysql, backup-agent
warn: 14 hosts outside normal baseline
ok: no global outage condition detected

Automation Workflow

Policy engine onlineManage Workflows
CollectMetrics, logs, checks and host events
CorrelateGroup events by service and impact
ClassifyRisk level, blast radius and approval
ExecuteSafe action, webhook or operator task
VerifyHealth check, recovery note and closure

Monitoring Alerts

Active queueAll Alerts
TimeAlertSeveritySourceOwnerStatus
03:42Mail queue growth above baselineCriticalZabbixAutomationResolved
03:35HTTP latency drift on EU-WEB poolWarningGrafanaNOCInvestigating
03:19Disk pressure forecast on shared hosting nodeWarningPrometheusAutomationQueued
02:58Database connection saturation normalizedRecoveredZabbixSystemClosed

Remediation History

Last actionsHistory
SMTP queue drainedUS-MAIL-021 · completed in 6m 14s · post-check passed
Disk cleanup waiting verificationTR-CPANEL-042 · safe cleanup threshold matched
PHP-FPM pool recycledEU-WEB-118 · latency returned to normal range
Operator approval requiredTR-DB-014 · database action blocked by policy

Grafana Rules · Zabbix Actions · Prometheus Alerts

Configuration viewEdit Rules

Grafana: Latency Drift

Flags web pools when p95 response time remains outside normal operating range.

WHEN avg(http_latency_p95) IS ABOVE service_baseline + 2.5σ FOR 8 minutes ROUTE TO web-ops

Zabbix: Service Recovery

Handles confirmed service failures with escalation, notification and recovery confirmation.

IF trigger = service.unavailable AND group = production THEN run safe_restart ELSE escalate to NOC

Prometheus: Capacity Pressure

Tracks disk, inode, memory and queue behavior before the node reaches hard limits.

alert: CapacityPressure expr: node_filesystem_avail_bytes < 12% for: 10m labels: severity="warning"

Service Impact Watchlist

Prioritized nodesOpen Watchlist
Server / Service GroupRoleChecksRiskNext Check
TR-WEB-034 / shared-webcPanel web pool842Medium4 min
US-MAIL-021 / mail-relaySMTP queue node318Low2 min
EU-VPS-076 / managed-vpsVirtualization host146Medium6 min
TR-DNS-008 / dns-edgeAuthoritative DNS529Low1 min
TR-BKP-012 / backup-agentBackup verification204Info9 min

Operations Notes

InternalNotes
note: no customer-wide incident open
policy: db actions require approval
runbook: mail queue remediation v3.4
backup: daily verification completed
status: core infrastructure stable