Skip to main content
more options

System Monitoring Practices

This document outlines the Systems Administration team's practices for the default monitoring parameters and expectations of servers we administer.

  • Overview
    • Systems Administration Responsibilities
    • Customer Responsibilities
      • Oncall for alerts
      • Thresholds
      • Filesystems / partitions
      • Application tests
  • Default monitoring Parameters
  • Changing the defaults
  • Email with Critical alerts and warnings

Overview

As a new server is added to the server farm, Systems Administration will add the system to our monitoring services to be monitored both by the NOC, and the systems administrators. By default, Systems monitors the basic functionality of the base operating system. Our standard practice is to delegate responsibility for application level filesystems and application tests to the service owner.

Systems Administration Responsibilities

Systems will set up default monitoring for the following:

  • Disk space availability and utilization (for OS related filesystems/partitions)
  • CPU utilization
  • Network connectivity and utilization
  • Memory utilization

Our monitoring thresholds for these are established to the defaults as described below. In the event of an alert for any of these tests, the NOC will contact the oncall Systems Administrator for assistance.

Customer Responsibilities

Oncall for alerts

We expect our customers to establish "oncall" information for their application if they require application level monitoring. The NOC will use this oncall information (oncall rotation if desired) to contact application support personnel in the event of an alert.

Thresholds

Any tests added for application owner will have alert thresholds which can be tuned either by the customer, or with the assistance of Systems Administration.

Filesystems / partitions

As a new system is created, Systems will set up monitoring for application level filesystems/partitions. We will work with the customer to establish reasonable monitoring thresholds.

Application tests

Application level tests (such as Web server tests) can be set up with the assistance of Systems Administration. The alerts for these tests will be directed to the application owner oncall lists, as established above.

Default monitoring Parameters

Thresholds for the various tests will vary, depending on the test. Most tests are configured so that they do not alert on the very first failure. Occasionally glitches in monitoring can cause false positives. By default, most test are set up to poll every 10 minutes, and alert after two failures. This means that it can be up to 20 minutes between a failure, and the first alert sent to the NOC.

Changing the defaults

Thresholds for both system level and application level tests can be adjusted based on the needs of the service.

Email with Critical alerts and warnings

Our monitoring software maintains different thresholds for tests to generate warnings and critical alerts. Emails can be generated for either or both. We recommend that service owners receive warnings as well as critical alerts, so that warnings may be caught during routine business hours.