Thursday, March 29, 2007

Mpxio and LSI storage

Two Sun cluster boxes(Sun v440 and Sun Fire 490) running solaris 10 connect to LSI storage box using fibre channel switches. One LUN on the LSI box was created to supply the two server box storing 1.9TB data. When mpxio is disabed on both servers, only one server can recognize the LUN correctly while the other regards the LUN as unformatted disk. When trying to format or label the LUN using format utility, illegal ASC write errors ocurred.

When mpxio is enabled on both servers through the command:
#stmsboot -e

problem described above solved. Both server can access the LUN with no errors occuring. I don't know why. Anybody knows?

Thursday, March 22, 2007

Stupid is as stupid does

China's Net Nanny is having some kind of emotional issue: in the last few weeks, Livejournal, Xanga, and Blogspot have all been blocked.

Blocked, unblocked, blocked again... ...

What does the old Nanny want to protect?

Tuesday, March 20, 2007

LDAP and OpenLDAP (Part III)

In previous sections, we have know there are three models of LDAP architecture besides information model. This section will discuss these three models: the naming model, function model and security model.

Naming Model
Information model provides the basic elements to construct the directory. The naming model describe how these elements fit together to build up directory.

We have known the DIT -- Directory Information Tree, which is perhaps the most important concept of the naming model. The examples in Part I have shown us how the directory is created as a tree with the root entry having the information of company abc.com.

LDAP provides a great deal of flexibility in the tree design, but that does not mean that everything is possible. The directory always has to be a treelike structure, i.e., every entry above the directory root has to have one ancestor. You can not insert en entry that has no parent.

The concept of the distinguished name is the heart of the naming model. Each entry has an attribute called "distinguished name" (DN). This attribute is used to identify the entry unambiguously. From this, it is clear that the DN must be unique throughout the whole directory. The construction of the DN is used to construct the namespace and the directory information tree.

The distinguished name is a comma-separated list of subunits called "relative distinguished names." You obtain the distinguished name in postfixing the relative distinguished name (RDN) with the distinguished name of its ancestor.

Like the DN, the RDN has to be unique. However, the RDN has to be unique only in its scope. For example, if the parent is:
DN: ou=sales, l=Europe, o=abc.com

under this subtree there can be only one RDN:
uid=usr1

resulting in the DN:
DN: uid=usr1, ou=sales, l=Europe, o=abc.com

This means that you can have an entry with an RDN of uid = usr1 under l = Asia, as shown in Exhibit 18, resulting in the unique DN:
DN: uid=usr1, ou=sales, l=Asia, o=abc.com

As directories continue to increase in size, there may come a point where it is no longer useful to hold the whole directory tree on one server. For performance reasons, we might decide to put one part of the directory tree on another directory server. However, performance is not the only reason for placing one or more parts of the directory on other servers. Administrative considerations — allowing different policies for different parts of the directory tree — might also come into play. We can solve these problems using referrals besides partition which will be explained later.

Assume that our directory server does not hold the entire directory tree and that part of the tree is located on another server. For example, imagine that the directory tree holding the information of company abc.com for location of North America has been moved to a separate server. At this point, a client searching an entry in the sales department located at San Francisco would not find anything, receiving instead an error message indicating that the required entry had not been found on the server. This is not what we wanted to achieve. We need an entry that points to the exact location where the entry can now be found. This special entry is called a "referral."

The referral is a special entry of the object class "referral." Like the alias, the referral has a distinguished name to locate it in the directory. The referral has one required attribute: the "ref" attribute. The ref attribute is an LDAP URL pointing to the location where the real entry can be found.

Function model
The functional model describes the operations that can be performed on the directory.

At this point, it is useful to remember that LDAP is a protocol that mediates the messages sent between client and server. The client sends a message to the server requesting a certain action against the directory. The server then executes this action on behalf the client, sending a message back to the client containing the result code and the eventual results sets.

There are three groups of functions, plus one special group of the "extended operations." This group is new in version 3 of LDAP and has been defined in RFC 2251. The "extended operations" allow adding further functionality published in the form of RFCs or in private implementations. For example, the "StartTLS" (Transport Layer Security protocol) operation, not foreseen in RFC 2251, is an operation defined with an extended operation.

The three groups of functions are:
1. Interrogation operations: search, compare
2. Update operations: add, delete, modify DN, modify
3. Authentication and control operations : bind, unbind, abandon

All of these operations are requests made by an LDAP client to an LDAP server. The server executes the requested operation and sends back to the client the result plus an error code.

The most complicated operation is the search operation. It can have up to eight parameters: base, scope, derefAliases, sizeLimit, timeLimit, attrOnly, searchFilter, and attributeList:
1. Base: DN where the query should start
2. Scope: Extension of the query inside the directory information tree. The scope can have three different values:
a. baseObject: Limits the search to the base object only.
b. singleLevel: Limits the search to the base objects and the immediate children.
c. wholeSubtree: Extends the search to the entire subtree from the base object.
3. derefAliases: Indicates how alias dereferencing should be handled.
4. sizeLimit: Maximum number of entries a query will return. A number of "zero" means that there is no size limit.
5. timeLimit: Maximum number of seconds a query can take. A number of "zero" means that the client does not impose any time limit.
6. attrOnly: A Boolean value. Set to "true," it indicates that only attribute types are returned; set to "false," it returns attribute types and attribute values.
7. searchFilter: Defines the conditions under which a search return is successful. The conditions can be combined with the Boolean "and," "or," and "not" operators.
8. attributeList: Attributes that should be returned if the searchFilter matches. Two values have a special meaning: an empty list with no attributes and an asterisk, "*". Both values instruct the server to return all attributes of the matching entries. The asterisk allows you to specify further operational attributes to be returned.

The "compare" operation tests for the presence of a particular attribute in an entry with a given distinguished name. It returns "true" if the entry contains the attribute and "false" if the entry does not contain the attribute. Now look at the parameters for the "compare" operations:
1. entry: Distinguished name of the entry you are searching for
2. ava: Attribute name-value pair you want to verify is contained in the entry ("ava" means "attribute value assertion")

The "add" operation is a relative easy one, as it contains only two parameters: entry and attributeList
1. entry: Distinguished name of the new entry
2. attributeList: A list of name-value pairs of the attributes contained in the entry
Update Operations: Delete

The "delete" operation is still easier than the "add" operation inasmuch as it takes one parameter only, the distinguished name of the entry to be deleted.
entry: Distinguished name of the entry to be deleted

The "modify" operation is more complicated than the previous two. It takes three parameters: distinguished name, type of operation, and name-value pairs:
1. entry: Distinguished name of the entry to be modified
2. operation: Type of operation to be executed on this entry, with three possible values:
add: Adds a new attribute (name,value pair)
delete: Deletes an attribute
modify: Modifies an attribute
3. attributeList: Produces a list of name-value pairs to be added/modified

The main purpose of the "bind" operation is to let the client authenticate itself against the directory server. The "bind" operation takes three parameters, version, name, and authentication:
1. version: Version of LDAP the client wishes to use
2. name: Name of the directory object the client wishes to bind to
3. authentication: Authentication choice, which has two possible values:
simple: Indicates that the password travels in clear text over the wire.
sasl: Uses the SASL mechanism as described in RFC 2222, "Simple Authentication and Security Layer".

The operation "unbind" is very simple and does not take any parameters. The "unbind" operation does not return any value from the server. The client assumes that the connection to the server is now closed. The server releases any resources allocated for the client, discards any authentication information, and closes the TCP/IP connection with the client.

Another simple operation, "abandon" takes only one parameter. The "abandon" operation is used to inform the server to stop a previously requested operation. The abandon operation typically is used by GUIs in the case of long-running operations on the server to inform the server that the client is no longer interested in the result of the operation. The operation takes only one parameter: the operationID.
operationID: ID of the operation to be abandoned


Security model
Security model shows how to secure the data in the directory. There are two major arguments in place: authentication and authorization, also called "access control" in LDAP.

Before a client can access data on an LDAP server, two processes must take place first: authorization and authentication. These two processes are quite different from each other.

Authentication takes place when the client identifies itself for the server as it tries to connect. The process depends very much on the authentication mechanism used. The easiest way is to connect to the server without the need to provide an identity. To such an anonymous connection, if allowed at all, the server grants the lowest access rights. There are authentication schemes ranging from simple authentication with user and password to authentication using certificates. These certificates give the assurance to the server that the client really is who it says it is. Certificates can also assure the client about the server's identity.

Once the client has been recognized by the server, the client will have certain access rights to the data. Authorization is the process by which the server grants the correct access rights to the previously authenticated client. This means that the user can read and write data with restrictions that depend on the level of access granted. To define what each client can do, the server must maintain an access control information (ACI) file.

The first type of authentication is "no authentication at all," also called "anonymous bind" because the server has no idea of who actually is asking for a connection. Anonymous bind is used to access publicly available data.

After anonymous access, the simplest authentication is the basic authentication, which is also used in other protocols like HTTP. The client simply sends the user credentials across the wire. In the case of LDAP, this means the user's distinguished name and the userPassword. Both of them are sent over the network in plain view without encryption. This method may be okay in a trusted environment, but even in an intranet environment, sending around unencrypted passwords is not a good idea.

The goal of the TLS protocol is to provide both data integrity and privacy. This means that the TLS protocol guarantees that data sent between two partners arrives unmodified and that the conversation is encrypted, i.e., that a person sitting between client and server can not intercept the conversation. TLS requires a reliable protocol and is based upon TCP. TLS itself comprises two different protocols, the TLS Record protocol and the TLS Handshake protocol. The function of the TLS Record Protocol is only the encapsulation of the higher protocol, the TLS Handshake protocol. The TLS Handshake protocol instead provides the security mechanisms. It allows server and client to authenticate each other and negotiate encryption protocol and cryptographic keys. The TLS Handshake protocol supports the public key mechanism.

LDAP (v2) supports a bind mechanism based on Kerberos, but it is not directly supported in LDAP (v3). By "not directly supported," we mean that it can be used as a security mechanism upon an agreement established using the SASL protocol.

The simple authentication and security layer (SASL) is a method of providing authentication services to a connection-oriented protocol such as LADP. The SASL standard is defined in RFC 2222, "Simple Authentication and Security Layer." This standard makes it possible for a client and server to agree upon a security layer for encryption. Once the server and client are connected, they agree upon a security mechanism for the ongoing conversation. One of these mechanisms is Kerberos. At the time of this writing, a number of mechanisms are supported by SASL, including: anonymous, CRAM-MD5, Digest-MD5, External, Kerberos, SecurID, Secure Remote Password, S/Key, X.509.

Monday, March 19, 2007

LDAP and OpenLDAP (Part II)

If you just only want to know how to use a directory server, you may not to know the models of LDAP. But there do exist four models of LDAP. They're information model, name model, function model and security model, and they help us to learn what makes a LDAP server and how LDAP works.

Information model
The basic unit of information is the entry. We have seen that the object class defines how an entry should look. Entries are made up of attributes, and just as entries are defined by object classes, so attributes are defined by the "attribute types". Everything is held together by the schema of the directory.

The excerpt of the directory can be broken down into small data structures which are called "entry" or "object". "Organization" and "organization unit" etc. are all objects.

Note the example listed in last article that each entry have several lines, each line corresponding to one attribute. In other words, entry is a collection of attributes, with each attribute having one or more values.

LDAP has rules to specify how information is to be saved. The collection of these rules is called schema which are kept in the configuration files. In version 3 of LDAP, client can explore the schema that the server is using. The schema in LDAP software are likely but having some important difference comparing with those in RDBMS:

1. LDAP software are shipped with a number of schemas. The user selects schema and begins filling in data. While in RDBMS, the first user has to define the schema.

2. The LDAP schema is simpler and does not know anything about complicate structure such as "join", nor are there any triggers.

The schema contains:
1. object-class definitions describing the different types of entries in the directory.
2. Attribute-type definitions describing the different attributes of the objects.
3. Syntax definitions to describe how the attribute values are to be compared during queries

If you write an application using a object-oriented language such as C++ or Java, the first thing you do is define classes. Then you can define objects to implement the previously defined classes. The same will happen in using LDAP. Every directory is configured to recognize a number of classes called "object class". The objects in the directory are implementations of these classes. Following is the example of object class definition stored in schema files:

Objectclass top:
(2.5.6.0
NAME 'top'
ABSTRACT
MUST (
objectClass
)
)

Objectclass organization:
(2.5.6.4
NAME 'organization'
SUP top
STRUCTURAL
MUST (
o
MAY (
userPassword
searchGuide
seeAlso
businessCategory
x121Address
registeredAddress
destinationIndicator
preferredDeliveryMethod
telexNumber
teletexTerminalIdentifier
telephoneNumber
internationaliSDNNumber
facsimileTelephoneNumber
street
postOfficeBox
postalCode
postalAddress
physicalDeliveryOfficeName
st
1
description
)
)
)

In this example class top and organization are defined. Class organization are inherited from class top using "SUP top". The inheritance is similar to that in object-oriented languages.

The "MUST" section in this definition contains those attribute which must have some value while defining objects. The value of he attributes in "MAY" section is optional while defining objects. What type of value of these attribute should have is defined in BNF(Backus Naur form), i.e., the attribute such as "userPassword" or "searchGuide" is predefined in the schema file and the valid value type of them is also defined. For example the attribute "c"(stand for country) is predefined in core.schema file as:

attributetype ( 2.5.4.6 NAME ( 'c' 'countryName' )
DESC 'RFC2256: ISO-3166 country 2-letter code'
SUP name SINGLE-VALUE )

The object class "top" is one of the most simplest object class. And all other classes derived from the class "top". The only purpose for class "top" is to ensure that every object contains the "objectClass" attribute, since it is the only required attribute in the class "top". It exists only to be inherited, as indicated as "ABSTRACT". An object is not intended to be instantiated is called "abstract object class". That means you'll never find an entry for class "top" in the directory.

objectclass ( 2.5.6.6 NAME 'person'
DESC 'RFC2256: a person'
SUP top STRUCTURAL
MUST ( sn $ cn )
MAY ( userPassword $ telephoneNumber $ seeAlso $ description ) )

Example listed above is the definition of class "person". 2.5.6.6 is the object id(oid) of the class. It inherit attribute "objectClass" from class "top". Object for this class should have the values of attibute "sn" and "cn". Other attribute of the object for class "person" is optional.

As mentioned previously, there are three types of object classes:
1. Abstract
2. Structure
3. Auxiliary

Abstract class is the class of class. Structure class can be implemented as object to be stored in directory.

We could just be happy with these two object types, so what is the third one for? This is for situations when you need a new object class extending from an existing one. Unfortunately, new data such as this does not fit very well in the object structure. To get around this problem, you can define all the data the new object needs to hold together in an auxiliary class. An object of type "auxiliary" can be attached anywhere in the directory information tree.

The concept of OID is another important concept imported into LDAP from the X.500 standard. Object identifiers (OIDs) are not only used for object classes, but also for attribute types. The OID standard is not limited only to object classes and attribute types. It can be used to identify uniquely any type of object.

An OID is syntactically a string of numbers divided by dots, such as the OID 2.5.6.6 for the object class "person" (see Exhibit 6) or 2.16.840.1.113730.3.2.2 for the object class "inetOrgPerson". The namespace of all OIDs implements an OID tree. A subtree in the OID tree is called an "arc". This concept greatly simplifies the administration of the OID tree.

You can get your own OID subtree from IANA (Internet Assigned Numbers Authority). Their Web site (http://www.iana.org) provides further information. The particular syntax makes it possible to understand where a certain object comes from. All you have to do is trace the route from the object to the root of the tree to understand the object's origin. For example, all attributes and objects defined by the IETF begin with 2.5.4, e.g., 2.5.4.0 for the attribute "object-Class."

You will need an OID subtree whenever you need to extend the directory schema. You can, of course, invent your own OIDs. However, you will run into trouble if you have to exchange information with systems that have reserved this OID for another purpose. So if you extend your schema, it is wise to ask for an OID and to construct your own hierarchy based on this OID. Remember to keep a list for what OID is used for which object to avoid name collisions.

In case of queries, you need to make comparisons in order to find from all entries the one you are interested in. Comparisons are not only necessary in query operations, but also in "delete," "update," and "add" operations.

You should also tell your directory server how to make comparisons between atribute values. You do this in the form of so-called "matching rules."

As an example, look at the telephone number attribute mentioned above. The two versions of the same telephone number should be considered equal by the server and therefore the matching rule should specify the following:
1. ignore extra white spaces
2. ignore hyphens ("-")
3. ignore opening and closing parenthesis

The matching rules are defined by standard and everyone has a name and an OID. The rule describing the matching of telephone nubmers for example is called "telephoneNumberMatch" and has the OID 2.5.13.20

Sunday, March 18, 2007

LDAP and OpenLDAP (Part I)

LDAP is a nothing more than a communication protocol between client which request to get some information and the server which store or know the place to store the information, so LDAP is only a communication rule but not a product you can buy.

OpenLDAP is an open product implemented according to LDAP.

Experience has shown that the best way to understand a new tool is simply to use or play around with it. That is what we will do in this article.

Before you can enter any objects into a directory, you must first define what kind of objects the directory will accept. This is much like the design of an object-oriented database. For example, if we want to describe an company abc.com as an object, we consider the abc.com is an instance of object class "organization".

Object class is some structure that is already standardized defined in OpenLDAP schemas, and most of them is derived from the original X.500 protocol. Schema is just like the header files of C language, which show the data structure used in C libraries. So we don't have to invent these object class all by ourselves just like we don't have to invent the printf functions when we program using C language.

When a company can be described as an object, the department of a large company also can be described as an object of object class "organization unit", and person in this company as object class "person".

Directory is made up of a number of entries, each entry is corresponding to an object in the real world. The object always belongs to an object class which is characterized by a number of "attributes". For example, a company has a name, phone number, and so on. The attributes are made up of an attribute name and one or more attribute values. The attribute names, like the class names, are standard, most of them being inherited from X.500 protocol.

An object(entry) must have "distinguished name" to identify itself, so the "dn" must be unique. How to give distinguished name to object will be explained in detail in following sections.

The directory is build up like a tree, and the tree is always called as DIT(Directory Information Tree). We will take the company abc.com as an example. When we began to store the information of abc.com, we consider the abc.com as a tree with the root entry which have the attribute "o"(stand for organization) and the attribute value "abc.com". So the root entry have the distinguished name o=abc.com.

The departments of abc.com is subtree of the DIT of the entire enterprise, and which have the attribute ou(stand for organization unit) and the value of department name. For example the marketing department have the distinguished name "ou=marketing, o=abc.com". Likewise we give the distinguish name to the IT department as "ou=IT, o=abc.com".

After the dn given the object may be add to the DIT using client tools which released with the OpenLdap software distribution like:
#ldapmodify -a -D "uid=admin, o=abc.com" -w password
dn: o=abc.com
object class: top
object class: organization
o: abc.com
l: ShangHai

adding new entry "o=abc.com"

The parameter following -D is the administrator account name of the ldap server running on the machine, and the -w speak out the password of this administrator account. Both the account and the password are stored in the configuration file which can be modified using vi before starting the server.

The first line of the actual data begins with:
dn: o=abc.com
which is the distinguished name of this entry. The distinguished name is just a key to access this particular entry. It must be unique across the whole directory.

The following lines:
object class: top
object class: organization

Means the o=abc.com is an object of the object class "organization", and "organization" is a subclass of "top". Both "organization" and "top" are declared in the configuration files.

o: abc.com
l: ShangHai
The object o=abc.com has two other properties(attributes): "o"(organization) and "l"(location), both of them defined in the schema. Reading the configuration files, we'll note some properties are required, and others are optional.

the last line:
adding new entry "o=abc.com"
is the output of the command which means that the command has been executed successfully. Otherwise, you will get an error message.

After the root entry(o=abc.com) was added, the department can be added as subtrees. For example:
# ldapmodify -a -D "cn=admin, o=abc.com" -w password
dn: ou=HR, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: HR
description: Human Resources

adding new entry "ou=HR, o=abc.com"

dn: ou=R&D, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: R&D
description: Research and Development

adding new entry "ou=R&D, o=abc.com"

dn: ou=Mkt, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: Mkt
description: Marketing

adding new entry "ou=Mkt, o=abc.com"

After all these entries added to the DIT, we can retrieve the information from the ldap sever using command ldapsearch:

#ldapsearch -b "o=abc.com" "(objectclass=*)"
# extended LDIF
#
# LDAPv3
# filter: (objectclass=*)
# requesting: ALL
#
# abc.com

dn: o=abc.com
objectclass: top
objectclass: organization
o: abc.com
l: ShangHai

# HR, abc.com
dn: ou=HR, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: HR
description: Human Resources

# R&D, abc.com
dn: ou=R&D, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: R&D
description: Research and Development

# Mkt, abc.com
dn: ou=Mkt, o=abc.com
objectclass: top
objectclass: organizationalUnit
ou: Mkt
description: Marketing

# search result
search: 2
result: 0 Success

# numResponses: 5
# numEntries: 4

When the DIT is created, root entry and subtrees entries are added into it. We'll find that if we want put personal information of hundreds of employees in abc.com is a difficult thing, for we must input these message entry by entry without mistake. Fortunately, the ldapmodify command also accepts a file as input:
#cat persons.ldif
dn: uid=ZhaoJia, ou=Mkt, o=abc.com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
cn: Zhao
sn: Jia
givenName: Thomas
ou: Mkt
uid: ZhaoJia
mail: zhaojia@abc.com

dn: uid=QianYi, ou=Mkt, o=abc.com
objectClass: top
objectClass: person
objectClass: organizationalPerson
objectClass: inetOrgPerson
cn: Qian
sn: Yi
givenName: Peter
ou: Mkt
uid: QianYi
mail: QianYi@abc.com

#ldapmodify -a -D "uid=admin, o=abc.com" -w "password" -f persons.ldif
adding new entry "uid=ZhaoJia, ou=Mkt, o=abc.com"
adding new entry "uid=QianYi, ou=Mkt, o=abc.com"

Saturday, March 17, 2007

COW and ZFS relate features

Copy-on-write (sometimes referred to as "COW") is an optimization strategy used in computer programming. The fundamental idea is that if multiple callers ask for resources which are initially indistinguishable, you can give them pointers to the same resource. This fiction can be maintained until a caller tries to modify its "copy" of the resource, at which point a true private copy is created to prevent the changes becoming visible to everyone else. All of this happens transparently to the callers. The primary advantage is that if a caller never makes any modifications, no private copy need ever be created.

Copy-on-write finds its main use in virtual memory operating systems; when a process creates a copy of itself, the pages in memory that might be modified by either the process or its copy are marked copy-on-write. When one process modifies the memory, the operating system's kernel intercepts the operation and copies the memory so that changes in one process's memory are not visible to the other.

Another use is in the calloc function. This can be implemented by having a page of physical memory filled with zeroes. When the memory is allocated, the pages returned all refer to the page of zeroes and are all marked as copy-on-write. This way, the amount of physical memory allocated for the process does not increase until data is written. This is typically only done for larger allocations.

Copy-on-write can be implemented by telling the MMU that certain pages in the process's address space are read-only. When data is written to these pages, the MMU raises an exception which is handled by the kernel, which allocates new space in physical memory and makes the page being written to correspond to that new location in physical memory.

One major advantage of COW is the ability to use memory sparsely. Because the usage of physical memory only increases as data is stored in it, very efficient hash tables can be implemented which only use little more physical memory than is necessary to store the objects they contain. However, such programs run the risk of running out of virtual address space -- virtual pages unused by the hash table cannot be used by other parts of the program. The main problem with COW at the kernel level is the complexity it adds, but the concerns are similar to those raised by more basic virtual memory concerns such as swapping pages to disk; when the kernel writes to pages, it must copy them if they are marked copy-on-write.

COW is also used outside the kernel, in library, application and system code. The string class provided by [[C++]]'s Standard Template Library, for example, was specifically designed to allow copy-on-write implementations. One hazard of COW in these contexts arises in multithreaded code, where the additional locking required for objects in different threads to safely share the same representation can easily outweigh the benefits of the approach.

The COW concept is also used in virtualization/emulation software such as Bochs, QEMU, and UML for virtual disk storage. This allows a great reduction in required disk space when multiple VMs can be based on the same hard disk image, as well as increased performance as disk reads can be cached in RAM and subsequent reads served to other VMs out of the cache.

ZFS uses a copy-on-write, transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.

The ZFS copy-on-write model has another powerful advantage: when ZFS writes new data, instead of releasing the blocks containing the old data, it can instead retain them, creating a snapshot version of the file system. ZFS snapshots are created very quickly, since all the data comprising the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

Monday, March 12, 2007

Why ZFS for home

--from uadmin.blogspot.com

I’m getting annoyed at people that keep saying ZFS is okay for servers but I don’t need it for home. ZFS scales from one drive to an infinite number of drives and has benefits for all of them.

Let’s take a look at the average home computer a single drive holding a mix of files, up to 300GB drives are common. That is a lot of data to lose and its getting easier lose data these days. Further more new hard drives aren’t getting any more reliable with time. Of course you can lose things on new hard drives by just misplacing them in one of the 1000’s of directories you can use in an attempt to organize your files.

What do other operating systems and file systems provide to fight this situation? In Linux you can use raid, redundant array of inexpensive drives, then if a hard drive fails your data is safe. Okay the fun part begins when you try to enable raid, the obvious choices are raid 1 (mirroring your data) or raid 5 (that uses part of your drives as parity protecting your data uses less space but requires a minimum of 3 drives to work). I won’t bore you with the technical details I will just show a small sample of the commands to create a raid 1, a mirror image of one drive onto a second drive.

These instructions were taken for http://unthought.net/Software-RAID.HOWTO/Software-RAID.HOWTO-5.html#ss5.6

You have two devices of approximately same size, and you want the two to be mirrors of each other. Eventually you have more devices, which you want to keep as stand-by spare-disks, that will automatically become a part of the mirror if one of the active devices break.

Set up the /etc/raidtab file like this:

raiddev /dev/md0
raid-level 1
nr-raid-disks 2
nr-spare-disks 0
persistent-superblock 1
device /dev/sdb6
raid-disk 0
device /dev/sdc5
raid-disk 1

If you have spare disks, you can add them to the end of the device specification like

device /dev/sdd5
spare-disk 0

Remember to set the nr-spare-disks entry correspondingly.

Ok, now we're all set to start initializing the RAID. The mirror must be constructed, eg. the contents (however unimportant now, since the device is still not formatted) of the two devices must be synchronized.

Issue the

mkraid /dev/md0

command to begin the mirror initialization.

Check out the /proc/mdstat file. It should tell you that the /dev/md0 device has been started, that the mirror is being reconstructed, and an ETA of the completion of the reconstruction.

Reconstruction is done using idle I/O bandwidth. So, your system should still be fairly responsive, although your disk LEDs should be glowing nicely.

The reconstruction process is transparent, so you can actually use the device even though the mirror is currently under reconstruction.

Try formatting the device, while the reconstruction is running. It will work. Also you can mount it and use it while reconstruction is running. Of Course, if the wrong disk breaks while the reconstruction is running, you're out of luck.

Looks like fun right? Before ZFS the situation wasn't much better in Solaris. A typical home user will see this and say I will do this next week. Then next week never comes. Of course doing raid5 only gets more complex in Linux at least, for ZFS its just a slight change to the commands used to create a mirror. In ZFS we execute two or three commands and we are done.

# zpool create data mirror c0t0d0 c0t1d0
# zfs create data/filesystem

Done. The only complex part is getting the last two entries and you can find those by running the Solaris format command. The red is added to help readability

#format < /dev/null Searching for disks...done

AVAILABLE DISK SELECTIONS:
0. c0t0d0
/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0
1. c1t2d0
/sbus@1f,0/SUNW,fas@1,8800000/sd@2,0
#

That takes care of drive failure; another problem is accidental deletion, accidentally installing a broken application or any change you would like to undo. Linux’s answer to this is backups, either on optical media, tape or perhaps another harddrive. This is expensive or time consuming, choose one. So the typical home user will most likely put this off to till another day and won’t have a backup for there data.


ZFS has snapshots, they are easy and painless and have a very low cost in resources to create. Snapshots are basically a picture of your data; these are taken in real time and are nearly instant in ZFS, to get these in any other OS you need to buy expensive raid hardware or an expensive software package something that no home user will want to buy.

For example I want to protect my mp3 collection so I put it on file system all its own.


# du -sh /mp3
17G /mp3
#

And then I took a snapshot of it for protection.

# time zfs snapshot data/mp3@may-1-2006
real 0m0.317s
user 0m0.017s
sys 0m0.030s
#

Not bad, 1/3 of a second to protect 17 Gigabyte of data. That can easily be restored should I make a mistake and delete or corrupt a file or all of them.

And here is a little script I created to take snapshots of all my zfs file systems and puts a date stamp on each one. Each snapshot takes very little space, so you can make as many as you need to be safe.

#!/bin/sh
date=`date +%b-%d-%Y`

for i in `/usr/sbin/zfs list -H -t filesystem -o name` ;
do /usr/sbin/zfs snapshot $i@$date ;
done

A few minutes in crontab or your desktop graphical crontab creator and you can have this script execute daily with no user intervention. Below is a sample line to add to your crontab that that takes snapshots at 3:15 am

15 3 * * * /export/home/username/bin/snapshot_all

To see your snapshots, is easy you just look in .zfs/snapshot that is in each zfs filesystem. You can even see individual files that make up a snapshot by changing directories further into the snapshot. This even works if the file system is shared via NFS.

Now let’s take a look at how to recover from mistakes using snapshots. First lets create filesystem, and populate it with a few files.

#zfs create data/test1
#cd /data/test1
#mkfile 100m file1 file2 file3 file4 file5
#ls
file1 file2 file3 file4 file5
#

We now have 5 files, each 100 megabytes, lets take a snapshot, and then delete a couple files.

# zfs snapshot data/test1@backup
# rm file2 file3
# ls
file1 file4 file5

The files are gone. Oops a day later or a month later I realize I need those files.

# cd ..
#zfs rollback data/test1@backup

So all we do is rollback the using a saved snapshot and the files are back.

# ls
file1 file2 file3 file4 file5
#

ZFS makes it easy to create lots of filesystems, in Linux you are limited to 16 file systems per drive (yes I know you can use the Linux volume manager but of course that adds even more complexity to the raid setup outlined above, as drives get bigger you end with hundreds or even thousands of files and directories per drive making it easy to lose files in the levels of directories. With ZFS there is no real limit to the number of filesystems and they all share the storage pool, they are quick and easy to create.

#time zfs create data/word_processor_files
real 0m0.571s
user 0m0.019s
sys 0m0.040s
#

Little over half a second to create a filesystem and you can create as many as you like.

The next problem the home user may face is running out of space. Typically the user heads down to the local electronic or computer shop and gets another hard drive or two if they want to be safe and use raid, so they get to head back to the raid setup guide, of course. Depending on the filesystem you may be able to grow the filesystem with more cryptic commands turning your filesystems into a raid 1+0, but its pretty complicated, so most people resort to keeping them simple and moving files back and forth between the filesystems to get the space they need.

With ZFS it is only one command to add the drive(s) to the pool of storage.

#zpool add data mirror drive3 drive4

Afterward all your filesystems have access to the additional space. If money is a little tight, you can turn on compression on any filesystem you like with a simple command then all files that are added to the filesystem are compressed possibly using less space. Note this usually doesn’t slow down IO at all, on some systems and workloads it actually speeds up data access.

# zfs set compression=on data/filesystem

ZFS is so simple you can talk your grandmother through the process of creating filesystems or restoring data. This is just a small sample of what ZFS can do, but it’s all just as simple as what I have shown you in this document. Even if you are more advanced, you can still benefit from ZFS’ ease of use. No more hitting the web to study how-to's to setup raid or LVM. Even if you can't afford two drives in your home box ZFS will be perfectly happy with one drive, though you do lose hardware redundancy, snapshots are still there to take care of software or user introduced filesystem problems.
Fast, Safe, Open, Free!
Open for business. Open for me! » Learn More