Recent Changes - Search:

Login Nodes with Linux Virtual Server

Motivation

According to a usual HPC Architecture, a cluster is composed (basically) by a frontend (or master node) and compute nodes. In this simplified layout, the frontend plays the roll of the login node as well as the storage node in addition to its natural role of being the head of the cluster. For large clusters, this layout might limits the system scalability when compute nodes are performing heavy I/O workloads, or it might compromise the system stability when users execute buggy applications (with memory leak or excessively load demanding). Thus, as a direct consequence of this, the frontend might decreases dramatically its performance and eventually leading to a system crash. For this reason, a more sophisticated layout shall be used when designing large clusters. A layout where the login and storage services are taken away from the frontend and allocated on dedicated servers. This new layout increases the system robustness since high I/O workloads do not affect the users directly, and users misbehaviours do not affect the cluster operational services. One resulting consequence of this approach is the potential scalability the cluster for login and storage services. I.e. by adding more login servers the cluster is able to serve more users; and by adding more storage servers, the cluster may deliver higher I/O operations to applications. These benefits seems to be quite interesting for large clusters, but to do so, some extra work need to be done. It is not that easy as to add more servers and that's it! There are several issues that must be addressed before implementing it, such as DNS configuration, ARPping problems, load balancing, dynamic routing, scheduling for allocation and failover of servers.

In this article, we discuss the implementation of an scalable login service for HPC clusters. It is valid for a single login node as well as for many login nodes, where it is possible the implementation of a failover or load balancing service by means of scheduled allocation rules, which can be enforced to provide a better service level to users.

In the following, we describe the initial naming conventions (DNS) and address layout to describe the problem raised when moving the login service away from the frontend. Then, we analyze these problems in order to propose a solution via Linux Virtual Services solution.

Rationale of the Problem

Let us define a cluster name, for instance, our development cluster syntagma.cmm.uchile.cl. We have two ways of publishing this name to the Internet. First, in an ideal naming configuration, the cluster full qualified domain name (fqdn) shall be a delegated sub-domain (slave hosted in the frontend) of the master DNS zone (cmm.uchile.cl for our cluster name). In other words, the syntagma.cmm.uchile.cl DNS zone should answer delegated queries from the master DNS zone asking for syntagma nodes (such as login nodes or the frontend it self). And, a second way is to have the cluster fqdn, the frontend and login nodes individually registered in the DNS with different IPs via an individual "A" record in the master DNS table (cmm.uchile.cl in our case). Either any of these two options, let us assume 2 login nodes beside the frontend node. The DNS table should answer the following queries:

syntagma.cmm.uchile.cl->IP1
frontend.syntagma.cmm.uchile.cl->IP2
login-1.syntagma.cmm.uchile.cl->IP3
login-2.syntagma.cmm.uchile.cl->IP4

If we assume IP1 and IP2 are assigned physically to the frontned server, and IP3 and IP4 to login servers, the only way to access login nodes is by their fqdn. For instance:

# ssh login-1.syntagma.cmm.uchile.cl

or

# ssh login-2.syntagma.cmm.uchile.cl

When a user will try to access syntagma.cmm.uchile.cl, it will reach the frontend node.

This naming configuration presents several drawbacks: (i) there is no control on the load of each login server since the user has the responsibility of choosing to which one will be connected with; (i) the name of the cluster is hinder for users, since host names are being used when accessing the cluster, and (iii) there is no friendly (and easy) way to inform users when a login node is "down". The only way is by checking the cluster's monitoring web-page to observe the login node state or by receiving the "host unreachable" error message when trying to access a login node.

A proposed solution by using a Linux Virtual Server (LVS).

The Linux Virtual Server (http://www.linuxvirtualserver.org), a.k.a LVS, is a kernel level implementation for load balance services by using layer 4 (OSI) switching. It is very stable and fast when switching connections among servers. In a nutshell, its architecture defines a "director" server which receive a (TCP) connection request and forward it to a "real-server" pool according to a scheduling policy. In a cluster layout, the frontend plays the "director" role and login nodes are defined as "real-servers". In this way, the users will access apparently the cluster by its name, "syntagma.cmm.uchile.cl" (the director), but in reality, users will be accessing a login node (real-server). LVS defines three ways to forward connections: LVS-NAT, LVS-DR and LVS-TUN. The first one (LVS-NAT) uses Network Address Translation (NAT) to forward the incoming connection to a real-server. In a cluster layout, all the traffic between the client and the login node will pass through the frontend. This might produce a bottleneck when concurrent users are copying data to/from the login-nodes. The second one (LVS-DR) requires public IPs for all real-servers, so the director only routes the connection. This approach applied to the cluster architecture, the frontend is not required to be the gateway for login nodes (as in the previous approach). So, the connection between clients and login nodes can be divided in two streams: incoming stream passing through the frontend, and outgoing stream routed directly to users. This approach also leave the option for users to access directly the login nodes for special purposes. The final one (LVS-TUN) requires an IP over IP tunnel between the director and the real-servers. The main advantage of this approach is that real-servers may be on a different network to the linux director. As in the cluster architecture, this is not the case, we will not further analyse this approach.

In this article, we propose to use LVS-DR to implement the login service for two reasons: (i) it allows to handle the outgoing traffic directly from login nodes to clients. This fact might help a bit to offload traffic from frontend server, relying the performance on the layer 2 switching capacity where the login nodes are connected to Internet. (ii) it leaves the option to access login nodes directly from "outside" the cluster, which provides an extra freedom degree to the whole solution.

Requirements

The requirements for implementing the proposed login service for HPC clusters are the following:

  1. cluster naming handled by DNS hosted in the frontend.
  2. One (1) public IPs for each login node.
  3. One IP for the virtual service (we assume the frontend has already one).
  4. Kernel with Linux Virtual Server (LVS) support.

Implementation

The implementation described in this article is based on a Rocks Cluster version 6.0 (Mamba) and CentOS 6.2 with Kernel 2.6.32-220.4.2.el6 compiled with LVS support.

Starting Configuration

We assume the public interface is always eth1 and eth0 is the private (or service) network for the cluster. The following table illustrate the naming convention and IP addressing for nodes.

FQDNdevIP
frontend.syntagma.cmm.uchile.cleth1PUB2
login-1.syntagma.cmm.uchile.cleth1PUB3
login-2.syntagma.cmm.uchile.cleth1PUB4

The first step is to add a sub-interface (or alias) eth1:1 in the frontend with the LVS IP (PUB1) with a /32 netmask (255.255.255.255). The second step is to add the ip PUB1 to all login nodes in the lo:1 loopback sub-interface in order to do not discard the incoming connections. It is very important to do not forget modify the ARP behaviour in login nodes in order to avoid ARPs conflicts between the frontend and login nodes. Once this steps are complete, the LVS service is configured in the frontend, and voila!

DNS

Rocks clusters has already built-in a DNS server which is configured to manage the addressing of the cluster (in addition to the /etc/hosts file). This configuration is automatically created by the rocks command (rocks sync dns). So, to include any DNS changes, it must be done in the /etc/named.conf.local file, which will be included at the end of the automatically generated named.conf file.

The named.conf.local file includes the cluster domain definition, in our example case, the syntagma.cmm.uchile.cl domain (zone). The file defining the domain DNS table is located in /etc/named/syntagma.cmm.uchile.cl.zone. To protect the DNS server from unauthorized zone transfers, we allow only the "cmm.uchile.cl" DNS server (IP_DNS_MASTER) to transfer the zone.

zone "syntagma.cmm.uchile.cl" {
    type master;
    file "/etc/named/syntagma.cmm.uchile.cl.zone";
    notify yes;
    allow-transfer { IP_DNS_MASTER; };
};

In order to allow the frontend to answer queries for this domain, the master domain cmm.uchile.cl shall delegate this zone to the frontend. So, the master DNS shall include the following zone in its named.conf file.

zone "syntagma.cmm.uchile.cl" {
    type slave;
    file "/wherever_the_zones_are/syntagma.cmm.uchile.cl.zone";
    masters {
       PUB1;
    };
    allow-update { PUB1; };
};

Frontend/Director

First, we need to create the sub-interfaces according to the stated in the previous section. To do so, we use the rocks command in order to keep always the cluster configuration synchronized. Also we need to overwrite the default route for login nodes. For this, we assume the gateway for the public IPs is PUB_GW.

rocks add network lvs PUB1 255.255.255.255 mtu=1500 servedns=false
rocks add host interface syntagma iface=eth1:1 ip=PUB1 subnet=lvs
rocks add host interface login-1 iface=lo:1 ip=PUB1 subnet=lvs
rocks add host interface login-2 iface=lo:1 ip=PUB1 subnet=lvs
rocks add host route login-1 0.0.0.0 PUB_GW netmask=0.0.0.0
rocks add host route login-2 0.0.0.0 PUB_GW netmask=0.0.0.0

Then, we create the LVS service with LVS-DR and the lblc scheduling policy.

ipvsadm -A -t PUB1:22 -s lblc
ipvsadm -a -t PUB1:22 -r PUB3:22 -g
ipvsadm -a -t PUB1:22 -r PUB4:22 -g

By the activation of LVS, the frontend ssh service will be "shadowed". So, in order to allow ssh access to the frontend, we modify the /etc/sshd_config file to serve the SSH in order port, let us say, 6969, and we implement a redirection rule to all ssh connection made within the frontend will be redirected to the new port. This is important to allow the rocks sync config command to work. The private frontend ip is assumed to be PRIV1.

ACCEPT      all     fw      tcp 6969    -
REDIRECT    $FW     6969    tcp 22      -   PRIV1

To finish the frontend configuration, it is required to synchronize the network configuration in the frontend and all the login nodes.

rocks sync config
rocks sync host network syntagma
rocks sync host network login-1
rocks sync host network login-2 

Login Nodes Real Server

As mentioned before, we need to modify the ARP resolution behaviour in login nodes in order to avoid a resolution problem among the frontend and each login node. This problem comes from the use of the ip PUB1 in the frontend and login nodes. To do so, add these lines to the file /etc/sysctl.conf in order avoid login nodes to answer ARP queries in the loopback adapter and all the other interfaces.

# Disable ARP for LVS
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

Verify each login node to have the default route via PUB_GW by issuing the "ip route show" command on each one of them. Also it is recommendable to check the access to login nodes from outside via their respective public IPs. This can be done by using their fqdn, which should be published by the DNS server at the frontend node.

Testing and Performance

In this section we draw some lines to perform the verification and testing for the LVS service as well as the performance evaluation in order to asses the impact of the LVS over the throughput for large file transfers via ssh (TCP). The tools required to walk through this section are: ssh, iperf, tshark and iptraf.

Operational testing.

 

Performance Testing.