High Availability with DRBD and Heartbeat (CRM)

December 16, 2008 by

One of the huge benefits of going with Open Source solutions is the ability to get Enterprise grade solutions for decidedly non Enterprise costs. We have a fairly typical Web site set up with two load balancers, 2 Apache/Tomcat servers and 2 Postgres database boxes. We wanted to be able to ensure that in the event of a  failure of any of those machines, we would be able to automatically recover and continue providing services.

Database

We have two machines, both with Postgres 8.1 installed on them (the latest version provided as part of CentOS 5.2). While apparently 8.3 can work in active/active mode, we decided to stick with 8.1 to reduce dependency hell with everything else on the machines and work with DRBD. Setup is incredibly simple – we created an /etc/drbd.conf file which had:

global {
  usage-count no;
}

common {
  protocol C;
}

resource r0 {
  device    /dev/drbd1;
  disk      /dev/LVMGroup/LVMVolume;
  meta-disk internal;

  on <node1> {
    address   <ip address1>:7789;
  }
  on <node2> {
    address   <ip address2>:7789;
  }
}

on both nodes and ran :

# drbdadm create-md r0  <-- on both nodes
# service drbd start    <-- on both nodes
# drbdadm -- --overwrite-data-of-peer primary r0  <-- on the primary node

This started DRBD and allowed the primary node to sync to the secondary. For more details about this (and heartbeat configuration below), have a look at this excellent CentOS HOWTO. Then we needed to configure heartbeat to manage the automatic failover for us. Create /etc/ha.d/ha.cf on both nodes to contain:

keepalive 2
deadtime 30
warntime 10
initdead 120
bcast   eth0
node    <node1 name as per uname -n>
node    <node2 name as per uname -n>
crm yes

The /etc/ha.d/authkeys on both nodes should contain:

auth 1
1 sha1 <passwd>

This will then result in a working heartbeat. Start the heartbeat service on both nodes and ywait a few minutes and the command crm_mon will show you a running cluster:

[root@<node1> ha.d]# crm_mon
Defaulting to one-shot mode
You need to have curses available at compile time to enable console mode

============
Last updated: Tue Dec 16 14:29:03 2008
Current DC: <node2> (33b76ea8-7368-442f-aef3-26916c567166)
2 Nodes configured.
0 Resources configured.
============

Node: <node2> (33b76ea8-7368-442f-aef3-26916c567166): online
Node: <node1> (bbccba14-0f40-4b1c-bc5d-8c03d9435a37): online

Then run hb_gui to configure the resources for heartbeat. The file that this gui configures is in /var/lib/heartbeat/crm and is defined via XML. While I would prefer to configure it manually, I haven’t worked out how to do that yet and the hb_gui tool is very easy to use.

Using that tool, you can create a resource group for the clustered services (click on Resources and Plus and select Group). Then within that group you need to configure 4 resources, a virtual IP address that can be used to communicate with the the primary node, a filesystem resource for the DRBD filesystem, drbddisk resource and a resource for postgres. To take each in turn:

  1. IP Address – click on plus and select Native, change the resource name to ip_<groupname>, select the group it should belong to, then select IPaddr from the list and click on Add Parameter. Then enter ip and a virtual ip address for the cluster. Add another parameter nic and select the interface for this to be configured against (i.e. eth0). Then click on OK.
  2. drbddisk resource – Same procedure, but this time select drbddisk  instead of ipaddr and select Add Parameter. Then enter 1 and the name of the drbd resource created (r0 in our case).
  3. filesystem – Same again, but select Filesystem and add the following parameters:
    1. device, /dev/drbd1 (in this example)
    2. directory, /var/lib/pgsql (for postgres)
    3. type, ext3 (or the filesystem you have created on it)
  4. postgres – Lastly add a postgres resource with no parameters.

Start the resource group and in a few minutes everything should be started on one of the nodes. To switch over the the other run service heartbeat stop on one and everything will migrate to the other. Good luck, you should now have an active/passive cluster for your database. This worked for us. Your mileage may vary, but any issues feel free to leave a comment and we’ll update this HOWTO.

Web

Creating the clustering for the Web was similarly easy. We kept the 2 web machines as they were with Apache and Tomcat running on both and instead clustered the load balancers initially in active/passive (until we can work out the active/active settings) in much the same way. The key difference was that for these machines we ran the load balancing software (HAProxy) on both all the time and the cluster just looked after the IP address. That way nothing was slowed down if the primary load balancer failed while waiting for services to start.

Cheers,

David

Advertisements

Managing Windows Domains

October 26, 2008 by

With the removal of Exchange and MS Office from the company we then wanted to look at the Domain Controllers. While a lot of Open Source systems have the ability to integrate into Active Directory, it always appeared to be a little complicated and fraught with problems. However, to get Single Sign On working, you need a single directory master. Zimbra includes OpenLDAP as one of it’s components, so we decided to bite the bullet and migrate the Windows desktops off the Windows Domain and onto a Samba domain instead. This would then give us a greatly reduced cost base as well as the ability to have the same username and password within Windows, Zimbra, RT etc etc.

Installing Samba was a little complicated, there are any number of HOWTOs out there to follow and while initially there was a big learning curve, it got easier quickly. Zimbra contains a very explicit HOWTO for integration between Samba and LDAP so while that took a little thought it also worked without too much heartache. It also meant that integration with the various systems were a little easier to debug as well.

Once we had followed the HOWTOs it was a simple matter of migrating the workstations to the new domain. Samba v3 is modelled on the pre Active Directory technology, so the whole process is a little more manual than with the latest Windows technology. However, v4 of Samba promises to include all that functionality in which will make things a lot less manual.

It did mean that we now could reuse 3 Domain Controllers (and not have to purchase Client Access Licences for all the users to be able to connect to the domain controllers with) and save the subsequent licencing costs. Creation of a new user account now occurs within Zimbra, but the login scripts have to be manually set on the Samba server. Telling Samba to obey the pam restrictions on the server ensured that home directories would be created automatically however.

The two biggest issues we encountered:

  1. Until a recent version of Zimbra, if someone changed their password within Zimbra it would not be reflected on the Windows machines. This is now rectified, via an extension, but did require user education.
  2. Samba inexplicably removed a feature by which we could filter the ldap searches on active users, without replacing it with anything similar. As a result I have not been able to find away to disable ex employees from being able to log on just by setting their account status in Zimbra. This is just plain stupid. Instead I have to write a script to check the Zimbra status and apply the same to the Samba statuses. Really, really irritating. If we weren’t saving so much money on the Windows Servers, I would seriously reconsider going to samba again. The workaround is such that if a user gets disabled inadvertently (i.e. they type their password in incorrectly and get locked out) then I need to re-enable them twice.

Leaving the negatives aside, Microsoft makes you buy Windows Server and then Client Access Licences to be able to use the server. Then more licences for anyone wanting to log in to the server (and I swear you have to be a rocket scientist to understand the Terminal Services licencing documentation). It all adds up. Samba is more stable than Windows, fast and works well. Other than the inability to hook into LDAP correctly which I did create a workaround for… Lets hope v4 works better in that respect.

Cheers,

David

Managing Customer Communications

October 22, 2008 by

We sell our products over the Internet and our customer service is likewise managed predominantly via email. When I started at the company we had a several support addresses which were sent directly to a staff member’s Inbox. These were then forwarded on to individuals within the team to answer. Obviously this was an appalling state of affairs as there was no real way to track workload, whether the customers were happy with the responses and generally no management oversight as to what was happening. It also meant that there was very little consistency in responses as everyone basically re-wrote their responses every time. Not very efficient.

As a result we had a look around at ticketing systems that we could use to manage the customer correspondance. Because we were dealing with end customers, a traditional CRM system didn’t really meet our needs. We do not have a long history with most of our customers – instead dealing with product questions, returns etc. We wanted something more lightweight, but still capable of being able to track customer history if necessary and provide management reporting etc. Any solution we look for now also has to integrate with LDAP. Zimbra uses LDAP internally and provides extensions to allow for single sign on. This will allow us to manage user accounts centrally.

Having used Zimbra, we now had some Linux expertise in house (more on this later) and decided to trial Request Tracker. This is a Web based tool that receives emails sent to specific addresses and delivers them to Queues set up within the system. It also has an FAQ Management system (RTFM) which integrates with Request Tracker allowing our staff to formulate standard responses to common questions. This saves time and effort. You can define arbitrary queues, and groups of users, permissions assigned to a user/group/queue combination. It is an incredibly powerful tool and because it is written in perl easily extensible (if you know perl). It supports a multitude of databases ensuring that it is likely to support a database  you are already using.

Of course, as with a lot of Open Source systems, this is also the major problem as well. Because *everything* is configurable, it means you have to configure everything. They options are mind numbing. Once it is set up, you should not have to play with the permissions too much more, but if you do, be ready to start scratching your head for a while. In addition to this, because the system is so easily extensible, some functionality is not easily available unless you write some perl code. For instance, it makes sense to define a single automated response, rather than trying to create 20-30 of them for each queue, but if you do that, there is no obvious way of stopping a particular queue from using the global one. Unless you write some perl code. There are also some extremely odd artifacts in the user interface – for instance, when replying to a ticket, rather than having the recipients checked and allowing you to uncheck people who shouldn’t receive the reply, instead everyone is unchecked and you have to explicitly click on the ones who should not receive the replies. A usability expert would have a field day with that one.

While RT integrates partially with LDAP, it will not take the group membership from LDAP, so you still have to manage every user in two places. This is painful. The other integration issue we have yet to address is to tie in customer tickets with our internally developed system to manage customer profiles etc. Ideally these would be linked. We plan on revisiting this when we look at ERP implementation.

So is RT a good solution? Well it is certainly very powerful and if you can live with the geekyness (and have perl expertise in house), I would recommend it. For us, I would drop it in a minute if there was a more user friendly solution out there. Any suggestions?

Cheers,

David