Wednesday, February 27, 2008

Lustre File System at a glance

Lustre is a very fast, multi-network, scalable cluster filesystem for Linux. It uses Object-Based Storage Devices (OBDs), to manage entire file objects (that is, inodes) instead of blocks. Several specific components/sub-systems of Lustre include Meta Data Servers (MDSs), Object Storage Targets (OSTs), and Lustre clients.

Lustre can be built from either a pre-packaged release or freely-available source code. Installing Lustre from a pre-existing package is quite straight-forward. Integrating Lustre into an existing kernel and building the associated Lustre software can be somewhat involved.

Lustre has been extensively tested with a selected set of distributions on i386, ia64, and x86_64 architectures, and also tested on PowerPC/PowerPC64 machines.
Lustre installation/configuration requires the following:

* Linux kernel patched with Lustre-specific patches
* Lustre modules compiled for the above kernel
* Lustre utilities required for configuration

Lustre Networking (LNET) software provides a network abstraction layer that simplifies using Lustre across multiple network types of. LNET and Lustre's administrative utilities require the following packages:

  1. readline: Install the -devel version of this package, which includes the header files.

Lustre releases prior to version 1.6 also require:

  1. libxml2: Install the -devel version of this package.
  2. Python: http://www.python.org
  3. PyXML: http://pyxml.sourceforge.net/

Several advanced Lustre functions, such as server failover, require packages and configuration outside of the scope of this document.

The RPMs/tarballs available on download site for every Lustre release:

  1. kernel-smp-.rpm - Lustre-patched Linux kernel RPM; use with matching Lustre Utilities and Lustre Modules package below.
  2. lustre-modules-.rpm - The Lustre kernel modules for the above kernel.
  3. lustre-.rpm - Lustre Utilities - userspace utilities for configuring and running Lustre; please use only with matching kernel RPM, above.
  4. kernel-source-.rpm - Lustre-patched Linux kernel source RPM - companion to the kernel package; not required to build or use Lustre
  5. lustre-source-.rpm - Contains the Lustre source code (which includes the kernel patches), not required to build or use Lustre.

Lustre requires a number of patches to the core Linux kernel, mostly to export new functions, add features to ext3, and add a new locking path to the VFS. You can either patch your kernel with patches from the Lustre source tarball or download pre-patch kernel RPM along with matching lustre-lite-utils RPM.

There are two methods to install the Lustre software:

  1. Install using RPMs (easy, but you must use a CFS kernel)
  2. Install from source code (if you need your own kernel)
Using a Pre-patched Kernel RPM with Matching Lustre-utils RPM
  1. Install the kernel RPM kernel-smp.rpm
  2. Install the Lustre modules lustre-modules-.rpm
  3. Install the utilities RPM lustre-.i386.rpm
  4. Update lilo.conf or grub.conf to boot your new kernel
  5. Reboot

The source code tarball for Lustre is located at: download site. As described above, Lustre requires a Linux kernel patched with some Lustre patches. A Lustre-patched kernel can be obtained in two ways:

  1. Patching a core kernel with the relevant kernel patches from this source code tarball following instructions at: kernel patch management wiki.
  2. Installing the pre-patched kernel RPM located at: download site

In general, CFS recommends that you obtain the pre-patched Lustre kernel RPM from download site. If you want to patch your own kernel:

1. Install quilt utility for patch management.

2. Copy the .config file in your kernel source, as it will be deleted in this process!

3. Refer to the lustre/kernel_patches/which_patch file to determine which patch series best fits your kernel.

The example below shows a sample which_patch file:
          SERIES               MEMNONIC                  COMMENT
vanilla-2.4.24 linux-2.4.24 patch includes UML
rhel-2.4.21 linux-rhel-2.4.21-15 RHEL3 2.4.21
2.6-suse linux-2.6.5 SLES9 2.6.5

4. Unsupported kernels may have patches at the Lustre Users Group.

5. Follow the instructions in KernelPatchManagement to patch the kernel Once the kernel is patched, rebuild the kernel as follows:

a. make distclean (this step is important!) (make sure to include jbd and ext3 support in the kernel or in a module)

b. cp save-dot-config .config

c. make oldconfig dep bzImage modules (make sure that preemption is disabled (in 2.6), and modules are enabled)

d. update lilo.conf or grub/menu.lst

e. Reboot with the new kernel.

CFS suggests that you download the tarball for Lustre source code and extract the contents as in the following example (this example assumes that you are using Lustre v1.0):

src$ tar zxvf lustre-.tar.gz

The following are sample build instructions for Lustre:

Note: If you are using a CVS checkout, you will need to run the following to create the configure script (automake version 1.5 or greater is necessary):

lustre$ sh autogen.sh

If you are building Lustre against a pre-patched kernel-source RPM for an already-built kernel RPM, there are a few steps to do first. Otherwise, if you have already built the kernel you can skip these steps.

[linux]$ cp /boot/config-`uname -r` .config
[linux]$ make oldconfig || make menuconfig
# For 2.6 kernels
[linux]$ make include/asm
[linux]$ make include/linux/version.h
[linux]$ make SUBDIRS=scripts
# For 2.4 kernels
[linux]$ make dep

Now to build the Lustre RPMs:

$ cd /your/source/for/lustre
[lustre]$ ./configure --with-linux=/your/patched/kernel/sources
[lustre]$ make rpms

[edit]
Patchless client

The Lustre client (only!) can be built against an unpatched 2.6.15-16 kernel. This results in some small performance losses, but may be worthwhile to some users for maintenance reasons. The Lustre configure script will automatically detect the unpatched kernel and disable building the servers.

[lustre]$ ./configure --with-linux=/unpatched/kernel/source


Lustre system consists of 3 types of subsystems - Clients, a Metadata Server (MDS), and Object Storage Targets (OST's). All of these can co-exist on a single system or can be running on different systems. A Lustre client system can also optionally contain a Logical (Object) Volume manager (LOV) that can transparently manage several OST's; this component is required for achieving file striping.

It is possible to set up the Lustre system in many different configurations using the administrative utilities provided with Lustre. Lustre includes some sample scripts in the /usr/src/lustre-1.4.7/lustre/tests directory on a system where Lustre has been installed (or the lustre/tests subdirectory of a source code installation) that enable quick setup of some simple, standard configurations.

Note - if your distribution does not contain these examples, skip down do the "Using Supplied Configuration Tools" section. Verify your mounted system as below.

This section provides an overview of using these scripts to set up a simple Lustre installation.

1. Single System Test using the llmount.sh script:

The simplest Lustre installation is a configuration where all three subsystems execute on a single node. You can execute the script llmount.sh, located in the /usr/src/lustre-1.4.7/lustre/tests/ directory, to set up, initialize, and start the Lustre file system for a single node system. This script first executes a configuration script identified by a 'NAME' variable. This configuration script uses the lmc utility to generate a XML configuration file, which is then used by the lconf utility to do the actual system configuration. The llmount.sh script then loads all of the modules required by the specified configuration.

Next, the script creates small loopback filesystems in /tmp for the server nodes. You can change the size and location of these files by modifying the configuration script.

Finally, the script mounts the Lustre file system at the mount-point specified in the initial configuration script, the default used is /mnt/lustre. The following are all of the steps needed to configure and test Lustre for a single system: The llmount.sh script is mostly useful for initial testing with Lustre to hide many of the background steps needed to configure Lustre, it is not indended to be used as a configuration tool for production installations.

a. Starting the System: Two initial configuration scripts are provided for a single system test. Any changes to the loopback filesystem locations or sizes, or Lustre filesystem mountpoint have to be made in these scripts. i. local.sh: This script contains lmc commands that generate an XML (local.xml) configuration file for a system with a single MDS, OST, and Client. ii. lov.sh: This script contains lmc commands that generate a configuration file with an MDS, LOV, two OST's and a Client.

b. Execute the llmount.sh script as shown below, specifying setup based on either local.sh or lov.sh:

NAME={local|lov} sh llmount.sh
Sample output from executing this command on a random system looks like the following:

# NAME=local sh llmount.sh
config.portals ../utils/../portals
loading module: portals srcdir ../utils/../portals devdir libcfs
loading module: ksocknal srcdir ../utils/../portals devdir knals/socknal
loading module: obdclass srcdir ../utils/.. devdir obdclass
loading module: ptlrpc srcdir ../utils/.. devdir ptlrpc
loading module: ldlm srcdir ../utils/.. devdir ldlm
loading module: ost srcdir ../utils/.. devdir ost
loading module: fsfilt_ext3 srcdir ../utils/.. devdir obdclass
loading module: obdfilter srcdir ../utils/.. devdir obdfilter
loading module: mds srcdir ../utils/.. devdir mds
loading module: osc srcdir ../utils/.. devdir osc
loading module: mdc srcdir ../utils/.. devdir mdc
loading module: llite srcdir ../utils/.. devdir llite
The GDB module script is in /r/tmp/ogdb-localhost.localdomain
NETWORK: NET_localhost_tcp NET_localhost_tcp_UUID tcp localhost 988
LDLM: ldlm ldlm_UUID
OSD: ost1 ost1_UUID obdfilter /tmp/ost1 200000 ext3 no 0
MDSDEV: mds1 mds1_UUID /tmp/mds1 ext3 no
OSC: OSC_localhost.localdomain_ost1_MNT_localhost 2dd80_OSC_localhost.local_6c1af22326 ost1_UUID
MDC: MDC_localhost.localdomain_mds1_MNT_localhost 4d135_MDC_localhost.local_771336bde1 mds1_UUID
MTPT: MNT_localhost MNT_localhost_UUID /mnt/lustre mds1_UUID ost1_UUID
#

You can the verify that the file system has been mounted from the output of df:

# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/ubd/0 1011928 362012 598512 38% /
/dev/ubd/1 6048320 3953304 1787776 69% /r
none 193712 16592 167120 10% /mnt/lustre

#

Note: The output of the df command following the output of the script shows that a Lustre filesystem as been mounted on the mount-point /mnt/lustre. The actual output of the script included with your Lustre installation may have changed due to enhancements or additional messages, but should resemble the example.

You can also verify that the Lustre stack has been set up correctly by observing the output of find /proc/fs/lustre:

# find /proc/fs/lustre

/proc/fs/lustre
/proc/fs/lustre/llite
....
/proc/fs/lustre/ldlm/ldlm/ldlm_canceld/service_stats
/proc/fs/lustre/ldlm/ldlm/ldlm_cbd
/proc/fs/lustre/ldlm/ldlm/ldlm_cbd/service_stats

#

Note: The actual output may depend on what modules are being inserted and what OBD devices are being instantiated. Also note that the filesystem statistics presented from /proc/fs/lustre are expected to be the same as those obtained from df.

c. Bringing down a cluster and cleanup using script llmountcleanup.sh:

Cleanup and unmounting of the filesystem can be done as shown below:

NAME= sh llmountcleanup.sh

d. Remounting the Filesystem using script llrmount.sh:

Remounting can be done with the llrmount.sh script as shown below. Using llmount.sh again will reformat the devices so llrmount.sh should be used if you want to keep the data in the filesystem:

NAME= sh llrmount.sh

As described in earlier sections, Lustre uses clients, a metadata server, and object storage targets. It is possible to set up Lustre on either a single system or on multiple systems. The Lustre distribution comes with utilities that can be used to create configuration files easily and set up Lustre for various configurations. Lustre uses three administrative utilities - lmc, lconf, and lctl - to configure nodes for any of these topologies. The lmc utility can be used to create configuration files in the form of XML files that describe a configuration. The lconf utility uses the information in this configuration file to invoke low-level configuration utility lctl to actually configure systems. Further details on these utilities can be found in the man pages. The complete configuration for the whole cluster should be kept in a single file. The same file will be used on all the cluster nodes to configure the individual nodes.

The next few sections describe the process of setting up a variety of configurations.

Note: you can use "lconf -v" to show more verbose messages when running other lconf commands.

Important: You must use fstype = ext3 for Linux 2.4 kernels, and fstype = ldiskfs for 2.6 kernels. (In 2.4, Lustre patches the ext3 driver, and in 2.6 provides its own.)

1. Client, MDS, and two OSTs on a single node:

This is a simple configuration script where the client, MDS, and the OSTs are running on a single system. The lmc utility can be used to generate a configuration file for this as shown below. All the devices in the script below are shown to be loopback devices, but you can specify any device here. The size option is required only for the loopback devices; for others the utility will extract the size from the device parameters. (See Using Real Disks below).

#!/bin/sh

# local.sh

# Create node
rm -f local.xml
lmc -m local.xml --add node --node localhost
lmc -m local.xml --add net --node localhost --nid localhost --nettype tcp

# Configure MDS
lmc -m local.xml --format --add mds --node localhost --mds mds-test --fstype ext3 --dev /tmp/mds-test --size 50000

# Configure OSTs
lmc -m local.xml --add lov --lov lov-test --mds mds-test --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m local.xml --add ost --node localhost --lov lov-test --ost ost1-test --fstype ext3 --dev /tmp/ost1-test --size 100000
lmc -m local.xml --add ost --node localhost --lov lov-test --ost ost2-test --fstype ext3 --dev /tmp/ost2-test --size 100000

# Configure client
lmc -m local.xml --add mtpt --node localhost --path /mnt/lustre --mds mds-test --lov lov-test

When this script is run these commands create a local.xml file describing the specified configuration. The actual configuration could then be executed using the following command:

# Configuration using lconf
$ sh local.sh
$ lconf --reformat local.xml

This command would load all the required Lustre and Portals modules, and would also do all the low level configuration of every device using lctl. The reformat option here is essential at least the first time to initialize the filesystems on OST's and MDS's. If it is used on any subsequent attempts to bring up the Lustre system, it will re-initialize the filessystems.

2.Multiple Nodes

Lustre can also be set up on multiple systems, with the client on one or more systems, the MDS on another system, and the OSTs on yet other nodes. The following is an example of the configuration script that could be used to create such a setup. In the below examples, replace node-* with the hostnames of real systems. The servers, clients, and the node running the configuration script all need to be able to resolve those hostnames into IP addresses via DNS or /etc/hosts. One common problem with some Linux setups is that the hostname is mapped in /etc/hosts to 127.0.0.1, which causes the clients to be unable to communicate to the servers.

#!/bin/sh

# config.sh

# Create nodes
rm -f config.xml
lmc -m config.xml --add net --node node-mds --nid node-mds --nettype tcp
lmc -m config.xml --add net --node node-ost1 --nid node-ost1 --nettype tcp
lmc -m config.xml --add net --node node-ost2 --nid node-ost2 --nettype tcp
lmc -m config.xml --add net --node node-ost3 --nid node-ost3 --nettype tcp
lmc -m config.xml --add net --node client --nid '*' --nettype tcp

# Cofigure MDS
lmc -m config.xml --add mds --node node-mds --mds mds-test --fstype ext3 --dev /tmp/mds-test --size 50000

# Configures OSTs
lmc -m config.xml --add lov --lov lov-test --mds mds-test --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m config.xml --add ost --node node-ost1 --lov lov-test --ost ost1-test --fstype ext3 --dev /tmp/ost1-test --size 100000
lmc -m config.xml --add ost --node node-ost2 --lov lov-test --ost ost2-test --fstype ext3 --dev /tmp/ost2-test --size 100000
lmc -m config.xml --add ost --node node-ost3 --lov lov-test --ost ost3-test --fstype ext3 --dev /tmp/ost3-test --size 100000

# Configure client (this is a 'generic' client used for all client mounts)
lmc -m config.xml --add mtpt --node client --path /mnt/lustre --mds mds-test --lov lov-test


# Generate the config.xml (once). Put the file in a place where all nodes can get to it.
$ sh config.sh

# Start up OST's first
$ lconf --reformat --node node-ost1 config.xml
$ lconf --reformat --node node-ost2 config.xml
$ lconf --reformat --node node-ost3 config.xml

# Then MDS's (which try to connect to the OST's)
$ lconf --reformat --node node-mds config.xml

# And finally clients (which try to connect to OST's and MDS's).
$ lconf --node client config.xml
# Or use 0-config "mount" command for clients (see below)
$ mount -t lustre node-mds:/mds-test/client /mnt/lustre

Startup of clients and servers using lconf can actually occur in any order, but startup may block until all the servers that the node needs are up and communicating.

Lustre can use any physical device that can be formatted as an EXT3 filesystem. An entire scsi drive could be specified as --dev /dev/sdb, or a single partition could be specified as /dev/sdb2. The --reformat option to lconf will completely erase and reformat the drive or partition for use with Lustre - use with caution.

Lustre provides support for mounting the Lustre file system using a simple NFS-like mount command. The client configuration information is recorded on the MDS when the MDS is reformatted. Subsequently, when a mount command is invoked on the client, the configuration log on the MDS is read and the required devices setup and configured on the client. The filesystem mount-point is specified in the mount command. The XML file is not needed for 0-config clients. In order to support 0-config, you will have to modify your modprobe-configuration so that modprobe know how to load the modules for Lustre. The mount command will be similar to:

mount -t lustre [-o nettype=] mdshost:/mdsname/client-profile

A network type of tcp is the default. The client-profile is the client node name (passed as --node argument to lmc). Please note that for 0-config to work, the node must have a mount-point entry specified with lmc's --add mtpt command. The lustre/conf/modules.conf file illustrates the changes that need to be made to your modules.conf file for 0-config on linux-2.4:

# sample modules.conf for autoloading lustre modules on zeroconf clients

add below kptlrouter portals
add below ptlrpc ksocknal
add below llite lov osc
alias lustre llite

Modutils for linux-2.6 uses a different configuration-file. The entries should be added to modprobe.conf.local or to a new file in the /etc/modprobe.d/ directory:

# sample modprobe.conf.local for autoloading lustre modules on zeroconf clients

install kptlrouter /sbin/modprobe portals; /sbin/modprobe --ignore-install kptlrouter
install ptlrpc /sbin/modprobe ksocknal; /sbin/modprobe --ignore-install ptlrpc
install llite /sbin/modprobe lov; /sbin/modprobe osc; /sbin/modprobe --ignore-install llite
alias lustre llite

# sample modprobe.conf.local for autoloading lustre modules on zeroconf clients with elan

install kptlrouter /sbin/modprobe portals; /sbin/modprobe --ignore-install kptlrouter
install kqswnal /sbin/modprobe kptlrouter; /sbin/modprobe --ignore-install kqswnal
install ksocknal /sbin/modprobe kqswnal; /sbin/modprobe --ignore-install ksocknal
install ptlrpc /sbin/modprobe ksocknal; /sbin/modprobe --ignore-install ptlrpc
install llite /sbin/modprobe lov osc; /sbin/modprobe --ignore-install llite
alias lustre llite

Note that if you are working from RPM's then you must install the lustre-lite-utils RPM as well as the kernel RPM otherwise this will fail with the error:

LustreError: 6946:(llite_lib.c:540:lustre_fill_super()) Unable to process log:
LustreError: 6946:(llite_lib.c:603:lustre_fill_super()) Unable to process log: -clean

Once your Lustre filesystem has been mounted on a client, you are able to create files and directories subject to normal Linux file/directory permissions: touch /mnt/lustre/foo

Once your Lustre filesystem has been mounted on a client, you are able to create files and directories subject to normal Linux file/directory permissions: touch /mnt/lustre/foo

To get a sense of the performance of the file system, run the "vmstat" utility on the OSS's while creating files on the lustre file system. Using "iozone" will give a more accurate performance measure than "dd" in our example:

oss1> vmstat 1
oss2> vmstat 1
client> dd of=/mnt/lustre/bigfile if=/dev/zero bs=1048576
or
client> cd /mnt/lustre; iozone -a

The "bo" column from vmstat will show blocks (1024 bytes) out every second, so this gives a throughput rate in KB/sec.

A user-space library version of the Lustre file system is now available for some versions of Lustre, it gives a user application linked with the library access to Lustre file systems. The key goals for the library are to provide a portable mechanism to access Lustre from different POSIX compliant operating systems, to provide access from microkernel based systems and from the Windows operating system.

More information is available at LibLustreHowTo.

Fast, Safe, Open, Free!
Open for business. Open for me! » Learn More