= Read-Me-First =

== The Scope of this Document ==

The purpose of this document is to definitively explain the concepts
used to configure Pacemaker.  To achieve this, it will focus
exclusively on the XML syntax used to configure the CIB.
      
For those that are allergic to XML, there exist several unified shells
and GUIs for Pacemaker. However these tools will not be covered at all
in this document
footnote:[I hope, however, that the concepts explained here make the functionality of these tools more easily understood.]
, precisely because they hide the XML.
      
Additionally, this document is NOT a step-by-step how-to guide for
configuring a specific clustering scenario.

Although such guides exist, the purpose of this document is to provide
an understanding of the building blocks that can be used to construct
any type of Pacemaker cluster.

== What Is Pacemaker? ==

Pacemaker is a cluster resource manager.

It achieves maximum availability for your cluster services
(aka. resources) by detecting and recovering from node and
resource-level failures by making use of the messaging and membership
capabilities provided by your preferred cluster infrastructure (either
http://www.corosync.org/[Corosync] or
http://linux-ha.org/wiki/Heartbeat[Heartbeat].
      
Pacemaker's key features include:

* Detection and recovery of node and service-level failures
* Storage agnostic, no requirement for shared storage
* Resource agnostic, anything that can be scripted can be clustered
* Supports http://en.wikipedia.org/wiki/STONITH[STONITH] for ensuring data integrity
* Supports large and small clusters
* Supports both http://en.wikipedia.org/wiki/Quorum_(Distributed_Systems)[quorate] and http://devresources.linux-foundation.org/dev/clusters/docs/ResourceDrivenClusters.pdf[resource driven] clusters
* Supports practically any http://en.wikipedia.org/wiki/High-availability_cluster#Node_configurations[redundancy configuration]
* Automatically replicated configuration that can be updated from any node
* Ability to specify cluster-wide service ordering, colocation and anti-colocation
* Support for advanced services type
** Clones: for services which need to be active on multiple nodes
** Multi-state: for services with multiple modes (eg. master/slave, primary/secondary)

== Types of Pacemaker Clusters ==

Pacemaker makes no assumptions about your environment, this allows it
to support practically any
http://en.wikipedia.org/wiki/High-availability_cluster#Node_configurations[redundancy configuration]
including Active/Active, Active/Passive, N+1, N+M, N-to-1 and N-to-N.

.Active/Passive Redundancy
image::images/pcmk-active-passive.png["Active/Passive Redundancy",width="10cm",height="7.5cm",align="center"]

Two-node Active/Passive clusters using Pacemaker and DRBD are a
cost-effective solution for many High Availability situations.

.Shared Failover
image::images/pcmk-shared-failover.png["Shared Failover",width="10cm",height="7.5cm",align="center"]

By supporting many nodes, Pacemaker can dramatically reduce hardware
costs by allowing several active/passive clusters to be combined and
share a common backup node

.N to N Redundancy
image::images/pcmk-active-active.png["N to N Redundancy",width="10cm",height="7.5cm",align="center"]

When shared storage is available, every node can potentially be used
for failover.  Pacemaker can even run multiple copies of services to
spread out the workload.


== Pacemaker Architecture ==

At the highest level, the cluster is made up of three pieces:

* Core cluster infrastructure providing messaging and membership
  functionality (illustrated in red)       
* Non-cluster aware components (illustrated in green).  
+
In a Pacemaker cluster, these pieces include not only the scripts
that knows how to start, stop and monitor resources, but also a local
daemon that masks the differences between the different standards
these scripts implement.
+
* A brain (illustrated in blue) 
+
This component processes and reacts to events from the cluster (nodes
leaving or joining) and resources (eg. monitor failures) as well as
configuration changes from the administrator.  In response to all of
these events, Pacemaker will compute the ideal state of the cluster
and plot a path to achieve it.  This may include moving resources,
stopping nodes and even forcing nodes offline with remote power
switches.
+

////
Hack to force end list
////


=== Conceptual Stack Overview ===

.Conceptual overview of the cluster stack
image::images/pcmk-overview.png["Conceptual overview of the cluster stack",width="10cm",height="7.5cm",align="center"]

When combined with Corosync, Pacemaker also supports popular open
source cluster filesystems.
footnote:[
Even though Pacemaker also supports Heartbeat, the filesystems need to
use the stack for messaging and membership and Corosync seems to be
what they're standardizing on.

Technically it would be possible for them to support Heartbeat as
well, however there seems little interest in this.
]

Due to recent standardization within the cluster filesystem community,
they make use of a common distributed lock manager which makes use of
Corosync for its messaging capabilities and Pacemaker for its
membership (which nodes are up/down) and fencing services.
        
.The Pacemaker stack when running on Corosync
image::images/pcmk-stack.png["The Pacemaker stack when running on Corosync",width="10cm",height="7.5cm",align="center"]

=== Internal Components ===

Pacemaker itself is composed of four key components (illustrated below
in the same color scheme as the previous diagram):

* CIB (aka. Cluster Information Base)
* CRMd (aka. Cluster Resource Management daemon)
* PEngine (aka. PE or Policy Engine)
* STONITHd
          
.Subsystems of a Pacemaker cluster
image::images/pcmk-internals.png["Subsystems of a Pacemaker cluster",width="10cm",height="7.5cm",align="center"]

The CIB uses XML to represent both the cluster's configuration and
current state of all resources in the cluster.  The contents of the
CIB are automatically kept in sync across the entire cluster and are
used by the PEngine to compute the ideal state of the cluster and how
it should be achieved.

This list of instructions is then fed to the DC (Designated
Controller).  Pacemaker centralizes all cluster decision making by
electing one of the CRMd instances to act as a master.  Should the
elected CRMd process (or the node it is on) fail...  a new one is
quickly established.

The DC carries out PEngine's instructions in the required order by
passing them to either the LRMd (Local Resource Management daemon) or
CRMd peers on other nodes via the cluster messaging infrastructure
(which in turn passes them on to their LRMd process).

The peer nodes all report the results of their operations back to the
DC and, based on the expected and actual results, will either execute
any actions that needed to wait for the previous one to complete, or
abort processing and ask the PEngine to recalculate the ideal cluster
state based on the unexpected results.

In some cases, it may be necessary to power off nodes in order to
protect shared data or complete resource recovery.  For this Pacemaker
comes with STONITHd.  

STONITH is an acronym for Shoot-The-Other-Node-In-The-Head and is
usually implemented with a remote power switch.

In Pacemaker, STONITH devices are modeled as resources (and configured
in the CIB) to enable them to be easily monitored for failure, however
STONITHd takes care of understanding the STONITH topology such that
its clients simply request a node be fenced and it does the rest.
