Repository Server Is Now Clustered
Cormac was able to successfully install and configure
the Kimberlite clustering software on Linux, which will hopefully
increase the availability of the repository server (cssgate).
It does so by software and hardware that tracks the state of both computers
participating in this failover cluster. Both computers share a common
data store.
Normally, we are running on the primary node (called turing). If it fails
due to a hardware problem, the secondary node (called vonneumann) should
detect this, turn off turing's power to ensure data integrity for the
shared data, and assume control of the 128.208.250.8 IP address and the
domain name(s) associated with it (e.g., cssgate.tacoma.washington.edu).
From your point of view, what will happen is the system will look like it
has crashed -- you will lose what you were working on; all running processes
will be killed and all tomcat installations will not be remembered. This
is because the primary node (turing) really has crashed.
However, within about five minutes
you can login to cssgate again, recover from the failover (e.g., re-install
your tomcat/JSP project), and continue working. This is because
the old secondary node (vonneumann) is now functioning as cssgate, and
is therefore now the primary node.
What we will do when we are on-site is investigate the cause of the
crash, fix it and
eventually bring up the
failed node (e.g., in this case, turing) to participate in the cluster.
If the primary node fails later on, the cluster will then switch to
secondary node, and we will investigate again, and so on.
If we can do it with little impact, we may force a failover such that
turing is always the primary after investigation.
The idea is to let you continue working --
although not transparently -- until we get a chance to investigate the
source of the failure, which may occur when the lab is not
staffed (e.g., nights, weekends and holidays).
We do know of a couple of problems with the clustering, which may affect
the high availability we had anticipated. Please be patient
with us as we try to diagnose and solve them.
Consequently, we do not consider this
cluster to be rock-solid, but we think it is better than relying on only
one computer which has failed hardware-wise in the past.
We will attempt to improve the availability even further
in the next few months, by removing a single point of failure for the
shared data and by adopting a new data storage
technology called iSCSI (if it works!).