= Status - Here be dragons =

Most users never need to understand the contents of the status section
and can be happy with the output from `crm_mon`.

However for those with a curious inclination, this section attempts to
provide an overview of its contents.
    
== Node Status ==

indexterm:[Node,Status]
indexterm:[Status of a Node]

In addition to the cluster's configuration, the CIB holds an
up-to-date representation of each cluster node in the status section.
      
.A bare-bones status entry for a healthy node called +cl-virt-1+
======
[source,XML]
-----
  <node_state id="cl-virt-1" uname="cl-virt-2" ha="active" in_ccm="true" crmd="online" join="member" expected="member" crm-debug-origin="do_update_resource">
   <transient_attributes id="cl-virt-1"/>
   <lrm id="cl-virt-1"/>
  </node_state>
-----
======      

Users are highly recommended _not to modify_ any part of a node's
state _directly_.  The cluster will periodically regenerate the entire
section from authoritative sources.  So any changes should be done
with the tools for those subsystems.
      
.Authoritative Sources for State Information
[width="95%",cols="5m,5<",options="header",align="center"]
|=========================================================

|Dataset  |Authoritative Source

|node_state fields |crmd

|transient_attributes tag |attrd

|lrm tag |lrmd

|=========================================================

The fields used in the +node_state+ objects are named as they are
largely for historical reasons and are rooted in Pacemaker's origins
as the Heartbeat resource manager.

They have remained unchanged to preserve compatibility with older
versions.
      
.Node Status Fields
[width="95%",cols="2m,5<",options="header",align="center"]
|=========================================================

|Field |Description


| id |
indexterm:[id,Node Status]
indexterm:[Node,Status,id]
Unique identifier for the node. Corosync based clusters use the uname
of the machine, Heartbeat clusters use a human-readable (but annoying)
UUID.

| uname |
indexterm:[uname,Node Status]
indexterm:[Node,Status,uname]
The node's machine name (output from `uname -n`).

| ha | 
indexterm:[ha,Node Status]
indexterm:[Node,Status,ha]
Flag specifying whether the cluster software is active on the
node. Allowed values: +active+, +dead+.

| in_ccm |            
indexterm:[in_ccm,Node Status]
indexterm:[Node,Status,in_ccm]
Flag for cluster membership; allowed values: +true+, +false+.

| crmd |
indexterm:[crmd,Node Status]
indexterm:[Node,Status,crmd]
Flag: is the crmd process active on the node? One of +online+, +offline+.

| join |
indexterm:[join,Node Status]
indexterm:[Node,Status,join]
Flag saying whether the node participates in hosting
resources. Possible values: +down+, +pending+, +member+, +banned+.
            
| expected |
indexterm:[expected,Node Status]
indexterm:[Node,Status,expected]
Expected value for +join+.

| crm-debug-origin |
indexterm:[crm-debug-origin,Node Status]
indexterm:[Node,Status,crm-debug-origin]
Diagnostic indicator: the origin of the most recent change(s).

|=========================================================

The cluster uses these fields to determine if, at the node level, the
node is healthy or is in a failed state and needs to be fenced.

== Transient Node Attributes ==

Like regular <<s-node-attributes,node attributes>>, the name/value
pairs listed here also help to describe the node.  However they are
forgotten by the cluster when the node goes offline.  This can be
useful, for instance, when you want a node to be in standby mode (not
able to run resources) until the next reboot.
      
In addition to any values the administrator sets, the cluster will
also store information about failed resources here.
      
.Example set of transient node attributes for node "cl-virt-1"
======
[source,XML]
-----
  <transient_attributes id="cl-virt-1">
    <instance_attributes id="status-cl-virt-1">
       <nvpair id="status-cl-virt-1-pingd" name="pingd" value="3"/>
       <nvpair id="status-cl-virt-1-probe_complete" name="probe_complete" value="true"/>
       <nvpair id="status-cl-virt-1-fail-count-pingd:0" name="fail-count-pingd:0" value="1"/>
       <nvpair id="status-cl-virt-1-last-failure-pingd:0" name="last-failure-pingd:0" value="1239009742"/>
    </instance_attributes>
  </transient_attributes>
-----
======

In the above example, we can see that the +pingd:0+ resource has
failed once, at +Mon Apr 6 11:22:22 2009+.
footnote:[
You can use the standard +date+ command to print a human readable of
any seconds-since-epoch value:
            # `date -d @<parameter>number</parameter>`
]
We also see that the node is connected to three "pingd" peers and that
all known resources have been checked for on this machine (+probe_complete+).
      
== Operation History ==
indexterm:[Operation History] 

A node's resource history is held in the +lrm_resources+ tag (a child
of the +lrm+ tag). The information stored here includes enough
information for the cluster to stop the resource safely if it is
removed from the +configuration+ section. Specifically the resource's
+id+, +class+, +type+ and +provider+ are stored.

.A record of the apcstonith resource
======
[source,XML]
<lrm_resource id="apcstonith" type="apcmastersnmp" class="stonith"/>
======

Additionally, we store the last job for every combination of
+resource, action+ and +interval+.  The concatenation of the values in
this tuple are used to create the id of the +lrm_rsc_op+ object.

.Contents of an +lrm_rsc_op+ job
[width="95%",cols="2m,5<",options="header",align="center"]
|=========================================================

|Field
|Description

| id |
indexterm:[id,Action Status]
indexterm:[Action,Status,id]

Identifier for the job constructed from the resource's +id+,
+operation+ and +interval+.
            
| call-id |
indexterm:[call-id,Action Status]
indexterm:[Action,Status,call-id]

The job's ticket number. Used as a sort key to determine the order in
which the jobs were executed.
            
| operation |
indexterm:[operation,Action Status]
indexterm:[Action,Status,operation]

The action the resource agent was invoked with.

| interval |
indexterm:[interval,Action Status]
indexterm:[Action,Status,interval]

The frequency, in milliseconds, at which the operation will be
repeated. A one-off job is indicated by 0.
            
| op-status |
indexterm:[op-status,Action Status]
indexterm:[Action,Status,op-status]

The job's status. Generally this will be either 0 (done) or -1
(pending). Rarely used in favor of +rc-code+.
            
| rc-code |
indexterm:[rc-code,Action Status]
indexterm:[Action,Status,rc-code]

The job's result. Refer to <<s-ocf-return-codes>> for
details on what the values here mean and how they are interpreted.

| last-run |
indexterm:[last-run,Action Status]
indexterm:[Action,Status,last-run]

Diagnostic indicator. Machine local date/time, in seconds since epoch,
at which the job was executed.

| last-rc-change |
indexterm:[last-rc-change,Action Status]
indexterm:[Action,Status,last-rc-change]

Diagnostic indicator. Machine local date/time, in seconds since epoch,
at which the job first returned the current value of +rc-code+.

| exec-time |
indexterm:[exec-time,Action Status]
indexterm:[Action,Status,exec-time]

Diagnostic indicator. Time, in milliseconds, that the job was running for.

| queue-time |
indexterm:[queue-time,Action Status]
indexterm:[Action,Status,queue-time]

Diagnostic indicator. Time, in seconds, that the job was queued for in the LRMd.

| crm_feature_set |
indexterm:[crm_feature_set,Action Status]
indexterm:[Action,Status,crm_feature_set]

The version which this job description conforms to. Used when
processing +op-digest+.

| transition-key |
indexterm:[transition-key,Action Status]
indexterm:[Action,Status,transition-key]

A concatenation of the job's graph action number, the graph number,
the expected result and the UUID of the crmd instance that scheduled
it. This is used to construct +transition-magic+ (below).

| transition-magic |
indexterm:[transition-magic,Action Status]
indexterm:[Action,Status,transition-magic]

A concatenation of the job's +op-status+, +rc-code+ and
+transition-key+. Guaranteed to be unique for the life of the cluster
(which ensures it is part of CIB update notifications) and contains
all the information needed for the crmd to correctly analyze and
process the completed job. Most importantly, the decomposed elements
tell the crmd if the job entry was expected and whether it failed.
            
| op-digest |
indexterm:[op-digest,Action Status]
indexterm:[Action,Status,op-digest]

An MD5 sum representing the parameters passed to the job. Used to
detect changes to the configuration, to restart resources if
necessary.

| crm-debug-origin |
indexterm:[crm-debug-origin,Action Status]
indexterm:[Action,Status,crm-debug-origin]

Diagnostic indicator. The origin of the current values.

|=========================================================

=== Simple Example ===
        
.A monitor operation (determines current state of the apcstonith resource)
======
[source,XML]
-----
<lrm_resource id="apcstonith" type="apcmastersnmp" class="stonith">
  <lrm_rsc_op id="apcstonith_monitor_0" operation="monitor" call-id="2"
    rc-code="7" op-status="0" interval="0"
    crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
    op-digest="2e3da9274d3550dc6526fb24bfcbcba0"
    transition-key="22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    transition-magic="0:7;22:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    last-run="1239008085" last-rc-change="1239008085" exec-time="10" queue-time="0"/>
</lrm_resource>
-----
======
        
In the above example, the job is a non-recurring monitor operation
often referred to as a "probe" for the +apcstonith+ resource.

The cluster schedules probes for every configured resource on when a
new node starts, in order to determine the resource's current state
before it takes any further action.
        
From the +transition-key+, we can see that this was the 22nd action of
the 2nd graph produced by this instance of the crmd
(2668bbeb-06d5-40f9-936d-24cb7f87006a).

The third field of the +transition-key+ contains a 7, this indicates
that the job expects to find the resource inactive.

By looking at the +rc-code+ property, we see that this was the case.
        

As that is the only job recorded for this node we can conclude that
the cluster started the resource elsewhere.

=== Complex Resource History Example ===
        
.Resource history of a pingd clone with multiple jobs
======
[source,XML]
-----
<lrm_resource id="pingd:0" type="pingd" class="ocf" provider="pacemaker">
  <lrm_rsc_op id="pingd:0_monitor_30000" operation="monitor" call-id="34"
    rc-code="0" op-status="0" interval="30000"
    crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
    transition-key="10:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    ...
    last-run="1239009741" last-rc-change="1239009741" exec-time="10" queue-time="0"/>
  <lrm_rsc_op id="pingd:0_stop_0" operation="stop"
    crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" call-id="32"
    rc-code="0" op-status="0" interval="0"
    transition-key="11:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    ...
    last-run="1239009741" last-rc-change="1239009741" exec-time="10" queue-time="0"/>
  <lrm_rsc_op id="pingd:0_start_0" operation="start" call-id="33"
    rc-code="0" op-status="0" interval="0"
    crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
    transition-key="31:11:0:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    ...
    last-run="1239009741" last-rc-change="1239009741" exec-time="10" queue-time="0" />
  <lrm_rsc_op id="pingd:0_monitor_0" operation="monitor" call-id="3"
    rc-code="0" op-status="0" interval="0"
    crm-debug-origin="do_update_resource" crm_feature_set="3.0.1"
    transition-key="23:2:7:2668bbeb-06d5-40f9-936d-24cb7f87006a"
    ...
    last-run="1239008085" last-rc-change="1239008085" exec-time="20" queue-time="0"/>
  </lrm_resource>
-----
======

When more than one job record exists, it is important to first sort
them by +call-id+ before interpreting them.

Once sorted, the above example can be summarized as:

. A non-recurring monitor operation returning 7 (not running), with a +call-id+ of 3
. A stop operation returning 0 (success), with a +call-id+ of 32
. A start operation returning 0 (success), with a +call-id+ of 33
. A recurring monitor returning 0 (success), with a +call-id+ of 34


The cluster processes each job record to build up a picture of the
resource's state.  After the first and second entries, it is
considered stopped and after the third it considered active.

Based on the last operation, we can tell that the resource is
currently active.

Additionally, from the presence of a +stop+ operation with a lower
+call-id+ than that of the +start+ operation, we can conclude that the
resource has been restarted.  Specifically this occurred as part of
actions 11 and 31 of transition 11 from the crmd instance with the key
+2668bbeb...+.  This information can be helpful for locating the
relevant section of the logs when looking for the source of a failure.
