News
    Main Page
    News
 

Repository Server (cssgate) is up

18 Oct 2006: cssgate is currently fully operational.

For the technically oriented:

This was a difficult problem to solve, and it was uncovered while attempting to create a team account. The problem was that the new Windows account did not appear on cssgate, which means that none of the team requesting the account could use the account on cssgate.

We have known for some time that there is a problem with the way that Linux or Samba handles a lot of account names; we have encountered this problem before. But we were always able to resolve it by some other means -- restarting the services, clearing the cache, reducing the number of accounts, etc.. This time, none of the alternatives worked or were feasible.

We use Samba and the winbind service to provide the login account information to all non-Windows systems that we want to allow logins. In this case, winbind has to retrieve the login account and group names and map them to Linux account names and groups. There appears to be some kind of limit in the Red Hat supported version of Samba/winbind that prevented new login accounts from being made visible to Linux.

We had a similar problem with a Solaris server during the summer, and compiling the latest source code (which you can't do with most Microsoft products) allowed us to fix that problem. A similar issue occured with Fedora Core 5 Linux on the workstations, and that was fixed by compiling and installing the latest version. So, when nothing else worked, this held hope for a solution.

However, we did not want to break anything else in the process on cssgate, which is actually a fail-over cluster with two nodes (two computers participate in the cluster, watching each other for failure and taking control if it is detected). We also have a test cluster, brought up over the summertime. We decided to try to fix the problem on the test cluster first to minimize the impact on the production cluster.

Unfortunately, the test system also failed, even with the latest source code. Hours later, it was determined that the reason the test system failed was due to a configuration problem (pam (pluggable authentication module) configuration for ssh service), and once it was configured properly, everything worked, including being able to see the new accounts. Installing it on the secondary node of the cluster -- the one that is not active -- worked fine as well, allowing logins, so it was installed on the primary. Normally, we would fail over from the primary with the old software to the secondary with the new software since we know it works, but we encountered another problem with the secondary -- it was not able to join the cluster. So, with that experience under our belts, we decided to just install it on the live system, since no one else was able to login anyway. The system became operational shortly after midnight of 18 Oct 2006.

Today, we were busy finalizing the changes (e.g., new commands were available to mount Windows home directories), recording what we did, getting the secondary node operational again, creating team accounts, and opening holes in the firewall for class projects. We think we are running a much more robust version of samba/winbind, and that we won't have to fix this problem again for a long time. However, we may eventually fail the cluster over, which would cause an interruption of service.


Hours  |  Support Information  |  News  | 
Policies  |  Emergencies