Friday, April 27, 2007

A Comparison of Solaris, Linux, and FreeBSD Kernels

by Max Bruning

I spend most of my time teaching classes on Solaris internals, device drivers, and kernel crash dump analysis and debugging. When explaining to classes how various subsystems are implemented in Solaris, students often ask, "How does it work in Linux?" or, "In FreeBSD, it works like this, how about Solaris?" This article examines three of the basic subsystems of the kernel and compares implementation between Solaris 10, Linux 2.6, and FreeBSD 5.3.

The three subsystems examined are scheduling, memory management, and file system architecture. I chose these subsystems because they are common to any operating system (not just Unix and Unix-like systems), and they tend to be the most well-understood components of the operating system.

This article does not go into in-depth details on any of the subsystems described. For that, refer to the source code, various websites, and books on the subject. For specific books, see:

If you search the Web for Linux, FreeBSD, and Solaris comparisons, most of the hits discuss old (in some cases, Solaris 2.5, Linux 2.2, etc.) versions of the OSes. Many of the "facts" are incorrect for the newest releases, and some were incorrect for the releases they intended to describe. Of course, most of them also make value judgments on the merits of the OSes in question, and there is little information comparing the kernels themselves. The following sites seem more or less up to date:

One of the more interesting aspects of the three OSes is the amount of similarities between them. Once you get past the different naming conventions, each OS takes fairly similar paths toward implementing the different concepts. Each OS supports time-shared scheduling of threads, demand paging with a not-recently-used page replacement algorithm, and a virtual file system layer to allow the implementation of different file system architectures. Ideas that originate in one OS often find their way into others. For instance, Linux also uses the concepts behind Solaris's slab memory allocator. Much of the terminology seen in the FreeBSD source is also present in Solaris. With Sun's move to open source Solaris, I expect to see much more cross-fertilization of features. Currently, the LXR project provides a source cross-reference browser for FreeBSD, Linux, and other Unix-related OSes, available at fxr.watson.org. It would be great to see OpenSolaris source added to that site.

Scheduling and Schedulers

The basic unit of scheduling in Solaris is the kthread_t; in FreeBSD, the thread; and in Linux, the task_struct. Solaris represents each process as a proc_t, and each thread within the process has a kthread_t. Linux represents processes (and threads) by task_struct structures. A single-threaded process in Linux has a single task_struct. A single-threaded process in Solaris has a proc_t, a single kthread_t, and a klwp_t. The klwp_t provides a save area for threads switching between user and kernel modes. A single-threaded process in FreeBSD has a proc struct, a thread struct, and a ksegrp struct. The ksegrp is a "kernel scheduling entity group." Effectively, all three OSes schedule threads, where a thread is a kthread_t in Solaris, a thread structure in FreeBSD, and a task_struct in Linux.

Scheduling decisions are based on priority. In Linux and FreeBSD, the lower the priority value, the better. This is an inversion; a value closer to 0 represents a higher priority. In Solaris, the higher the value, the higher the priority. Table 1 shows the priority values of the different OSes.

Table 1. Scheduling Priorities in Solaris, Linux, and FreeBSD
Solaris
Priorities Scheduling Class
0-59 Time Shared, Interactive, Fixed, Fair Share Scheduler
60-99 System Class
100-159 Real-Time (note real-time higher than system threads)
160-169 Low level Interrupts
Linux Priorities Scheduling Class
0-99 System Threads, Real time (SCHED_FIFO, SCHED_RR)
100-139 User priorities (SCHED_NORMAL)
FreeBSD Priorities Scheduling Class
0-63 Interrupt
64-127 Top-half Kernel
128-159 Real-time user (system threads are better priority)
160-223 Time-share user
224-255 Idle user

All three OSes favor interactive threads/processes. Interactive threads run at better priority than compute-bound threads, but tend to run for shorter time slices. Solaris, FreeBSD, and Linux all use a per-CPU "runqueue." FreeBSD and Linux use an "active" queue and an "expired" queue. Threads are scheduled in priority from the active queue. A thread moves from the active queue to the expired queue when it uses up its time slice (and possibly at other times to avoid starvation). When the active queue is empty, the kernel swaps the active and expired queues. FreeBSD has a third queue for "idle" threads. Threads run on this queue only when the other two queues are empty. Solaris uses a "dispatch queue" per CPU. If a thread uses up its time slice, the kernel gives it a new priority and returns it to the dispatch queue. The "runqueues" for all three OSes have separate linked lists of runnable threads for different priorities. (Though FreeBSD uses one list per four priorities, both Solaris and Linux use a separate list for each priority.)

Linux and FreeBSD use an arithmetic calculation based on run time versus sleep time of a thread (as a measure of "interactive-ness") to arrive at a priority for the thread. Solaris performs a table lookup. None of the three OSes support "gang scheduling." Rather than schedule n threads, each OS schedules, in effect, the next thread to run. All three OSes have mechanisms to take advantage of caching (warm affinity) and load balancing. For hyperthreaded CPUs, FreeBSD has a mechanism to help keep threads on the same CPU node (though possibly a different hyperthread). Solaris has a similar mechanism, but it is under control of the user and application, and is not restricted to hyperthreads (called "processor sets" in Solaris and "processor groups" in FreeBSD).

One of the big differences between Solaris and the other two OSes is the capability to support multiple "scheduling classes" on the system at the same time. All three OSes support Posix SCHED_FIFO, SCHED_RR, and SCHED_OTHER (or SCHED_NORMAL). SCHED_FIFO and SCHED_RR typically result in "realtime" threads. (Note that Solaris and Linux support kernel preemption in support of realtime threads.) Solaris has support for a "fixed priority" class, a "system class" for system threads (such as page-out threads), an "interactive" class used for threads running in a windowing environment under control of the X server, and the Fair Share Scheduler in support of resource management. See priocntl(1) for information about using the classes, as well as an overview of the features of each class. See FSS(7) for an overview specific to the Fair Share Scheduler. The scheduler on FreeBSD is chosen at compile time, and on Linux the scheduler depends on the version of Linux.

The ability to add new scheduling classes to the system comes with a price. Everywhere in the kernel that a scheduling decision can be made (except for the actual act of choosing the thread to run) involves an indirect function call into scheduling class-specific code. For instance, when a thread is going to sleep, it calls scheduling-class-dependent code that does whatever is necessary for sleeping in the class. On Linux and FreeBSD, the scheduling code simply does the needed action. There is no need for an indirect call. The extra layer means there is slightly more overhead for scheduling on Solaris (but more features).

Memory Management and Paging

In Solaris, every process has an "address space" made up of logical section divisions called "segments." The segments of a process address space are viewable via pmap(1). Solaris divides the memory management code and data structures into platform-independent and platform-specific parts. The platform-specific portions of memory management is in the HAT, or hardware address translation, layer. FreeBSD describes its process address space by a vmspace, divided into logical sections called regions. Hardware-dependent portions are in the "pmap" (physical map) module and "vmap" routines handle hardware-independent portions and data structures. Linux uses a memory descriptor to divides the process address space into logical sections called "memory areas" to describe process address space. Linux also has a pmap

command to examine process address space.

Linux divides machine-dependent layers from machine-independent layers at a much higher level in the software. On Solaris and FreeBSD, much of the code dealing with, for instance, page fault handling is machine-independent. On Linux, the code to handle page faults is pretty much machine-dependent from the beginning of the fault handling. A consequence of this is that Linux can handle much of the paging code more quickly because there is less data abstraction (layering) in the code. However, the cost is that a change in the underlying hardware or model requires more changes to the code. Solaris and FreeBSD isolate such changes to the HAT and pmap layers respectively.

Segments, regions, and memory areas are delimited by:

  • Virtual address of the start of the area.
  • Their location within an object/file that the segment/region/memory area maps.
  • Permissions.
  • Size of the mapping.

For instance, the text of a program is in a segment/region/memory area. The mechanisms in the three OSes to manage address spaces are very similar, but the names of data structures are completely different. Again, more of the Linux code is machine-dependent than is true of the other two OSes.

Paging

All three operating systems use a variation of a least recently used algorithm for page stealing/replacement. All three have a daemon process/thread to do page replacement. On FreeBSD, the vm_pageout daemon wakes up periodically and when free memory becomes low. When available memory goes below some thresholds, vm_pageout runs a routine (vm_pageout_scan) to scan memory to try to free some pages. The vm_pageout_scan routine may need to write modified pages asynchronously to disk before freeing them. There is one of these daemons regardless of number of CPUs. Solaris also has a pageout daemon that also runs periodically and in response to low-free-memory situations. Paging thresholds in Solaris are automatically calibrated at system startup so that the daemon does not overuse the CPU or flood the disk with page-out requests. The FreeBSD daemon uses values that, for the most part, are hard-coded or tunable in order to determine paging thresholds. Linux also uses an LRU algorithm that is dynamically tuned while it runs. On Linux, there can be multiple kswapd

daemons, as many as one per CPU. All three OSes use a global working set policy (as opposed to per process working set).

FreeBSD has several page lists for keeping track of recently used pages. These track "active," "inactive," "cached," and "free" pages. Pages move between these linked lists depending on their uses. Frequently accessed pages will tend to stay on the active list. Data pages of a process that exits can be immediately placed on the free list. FreeBSD may swap entire processes out if vm_pageout_scan cannot keep up with load (for example, if the system is low on memory). If the memory shortage is severe enough, vm_pageout_scan will kill the largest process on the system.

Linux also uses different linked lists of pages to facilitate an LRU-style algorithm. Linux divides physical memory into (possibly multiple sets of) three "zones:" one for DMA pages, one for normal pages, and one for dynamically allocated memory. These zones seem to be very much an implementation detail caused by x86 architectural constraints. Pages move between "hot," "cold," and "free" lists. Movement between the lists is very similar to the mechanism on FreeBSD. Frequently accessed pages will be on the "hot" list. Free pages will be on the "cold" or "free" list.

Solaris uses a free list, hashed list, and vnode page list to maintain its variation of an LRU replacement algorithm. Instead of scanning the vnode or hash page lists (more or less the equivalent of the "active"/"hot" lists in the FreeBSD/Linux implementations), Solaris scans all pages uses a "two-handed clock" algorithm as described in Solaris Internals and elsewhere. The two hands stay a fixed distance apart. The front hand ages the page by clearing reference bit(s) for the page. If no process has referenced the page since the front hand visited the page, the back hand will free the page (first asynchronously writing the page to disk if it is modified).

All three operating systems take NUMA locality into account during paging. The I/O buffer cache and the virtual memory page cache is merged into one system page cache on all three OSes. The system page cache is used for reads/writes of files as well as mmapped files and text and data of applications.

File Systems

All three operating systems use a data abstraction layer to hide file system implementation details from applications. In all three OSes, you use open, close, read, write, stat, etc. system calls to access files, regardless of the underlying implementation and organization of file data. Solaris and FreeBSD call this mechanism VFS ("virtual file system") and the principle data structure is the vnode, or "virtual node." Every file being accessed in Solaris or FreeBSD has a vnode assigned to it. In addition to generic file information, the vnode contains pointers to file-system-specific information. Linux also uses a similar mechanism, also called VFS (for "virtual file switch"). In Linux, the file-system-independent data structure is an inode. This structure is similar to the vnode on Solaris/FreeBSD. (Note that there is an inode structure in Solaris/FreeBSD, but this is file-system-dependent data for UFS file systems). Linux has two different structures, one for file operations and the other for inode operations. Solaris and FreeBSD combine these as "vnode operations."

VFS allows the implementation of many file system types on the system. This means that there is no reason that one of these operating systems could not access the file systems of the other OSes. Of course, this requires the relevant file system routines and data structures to be ported to the VFS of the OS in question. All three OSes allow the stacking of file systems. Table 2 lists file system types implemented in each OS, but it does not show all file system types.

Table 2. Partial List of File System Types
Solaris ufs Default local file system (based on BSD Fast Filesystem)
nfs Remote Files
proc /proc files; see proc(4)
namefs Name file system; allows opening of doors/streams as files
ctfs Contract file system used with Service Management Facility
tmpfs Uses anonymous space (memory/swap) for temporary files
swapfs Keeps track of anonymous space (data, heap, stack, etc.)
objfs Keeps track of kernel modules, see objfs(7FS)
devfs Keeps track of /devices files; see devfs(7FS)
FreeBSD ufs Default local file system (ufs2, based on BSD Fast Filesystem)
defvs Keeps track of /dev files
ext2 Linux ext2 file system (GNU-based)
nfs Remote files
ntfs Windows NT file system
smbfs Samba file system
portalfs Mount a process onto a directory
kernfs Files containing various system information
Linux ext3 Journaling, extent-based file system from ext2
ext2 Extent-based file system
afs AFS client support for remote file sharing
nfs Remote files
coda Another networked file system
procfs Processes, processors, buses, platform specifics
reiserfs Journaling file system

Conclusions


Solaris, FreeBSD, and Linux are obviously benefiting from each other. With Solaris going open source, I expect this to continue at a faster rate. My impression is that change is most rapid in Linux. The benefits of this are that new technology has a quick incorporation into the system. Unfortunately, the documentation (and possibly some robustness) sometimes lags behind. Linux has many developers, and sometimes it shows. FreeBSD has been around (in some sense) the longest of the three systems. Solaris has its basis in a combination of BSD Unix and AT&T Bell Labs Unix. Solaris uses more data abstraction layering, and generally could support additional features quite easily because of this. However, most of the layering in the kernel is undocumented. Probably, source code access will change this.

A brief example to highlight differences is page fault handling. In Solaris, when a page fault occurs, the code starts in a platform-specific trap handler, then calls a generic as_fault() routine. This routine determines the segment where the fault occurred and calls a "segment driver" to handle the fault. The segment driver calls into file system code. The file system code calls into the device driver to bring in the page. When the page-in is complete, the segment driver calls the HAT layer to update page table entries (or their equivalent). On Linux, when a page fault occurs, the kernel calls the code to handle the fault. You are immediately into platform-specific code. This means the fault handling code can be quicker in Linux, but the Linux code may not be as easily extensible or ported.

Kernel visibility and debugging tools are critical to get a correct understanding of system behavior. Yes, you can read the source code, but I maintain that you can easily misread the code. Having tools available to test your hypothesis about how the code works is invaluable. In this respect, I see Solaris with kmdb, mdb, and DTrace as a clear winner. I have been "reverse engineering" Solaris for years. I find that I can usually answer a question by using the tools faster than I can answer the same question by reading source code. With Linux, I don't have as much choice for this. FreeBSD allows use of gdb on kernel crash dumps. gdb can set breakpoints, single step, and examine and modify data and code. On Linux, this is also possible once you download and install the tools.

Max Bruning currently teaches and consults on Solaris internals, device drivers, kernel (as well as application) crash analysis and debugging, networking internals, and specialized topics. Contact him at max at bruningsystems dot com or http://mbruning.blogspot.com/.

Tuesday, April 17, 2007

Limitations of File System for Solaris OE

Description

This document will describe the limitations of the filesystem for the Solaris [TM] Operating Environment.

The effect of OpenBoot[TM] PROM (OBP) revisions on filesystem size limitations is also discussed.

Definitions:

  1. 1 Mbyte (megabyte) 2^20 bytes (1,048,576 bytes)

  2. 1 Gbyte (gigabyte) 2^30 bytes (1,073,741,824 bytes)

  3. 1 Tbyte (terabyte) 2^40 bytes (1,099,511,627,776 bytes)

In general if this document refers to a limitation of (for instance) 2 Gbyte, ¨2 Gbyte  1 byte〃 (1,073,741,823 bytes) is meant.

To improve readability, this has been abbreviated to ¨2 Gbyte〃.

The symbol ¨~〃 is used to denote ¨approximately〃

Maximum size of a single file and a filesystem.

This also applies to Solaris x86, but there may be some issues with disk drives larger than 30 Gbyte. This is due to hardware limitations with some PC motherboard/disk configurations.

OS Release

Single file

File system

Solaris 2.5.1

2 Gbyte

1 Tbyte

Solaris 2.6 - 9 12/02 (U3)

~1012 Gbyte

1 Tbyte

Solaris 9 08/03 (U4) - Solaris 10 FCS

~1023 Gbyte

16 Tbyte

  1. A single file in Solaris 2.6 through Solaris 9 (U3) is limited to approximately 1012 Gbyte because the file must fit inside the 1 Tbyte filesystem. The filesystem is nominally 1 Tbyte, but in fact, due to the overhead in such a large filesystem, the largest single file ends up being about 1012 Gbyte. (part of this is a bug, but even if bug-free, a single file can't be 1 Tbyte due to the filesystem overhead). The overhead in the filesystem includes amongst other things, items such as superblock backups and inode tables. The example given here is using 1024 Kbyte (1 Mbyte) for the number of bytes per inode (nbpi) within the UFS filesystem. If nbpi is set to a lower value, more filesystem space will be allocated to inode tables and less will be available to store data. With nbpi=8 Kbyte, the maximum single file size would be smaller than 1012 Gbyte. The file in this example cannot use any of the minfree area set up on the filesystem (which is setup for ¨root only〃 use). On a 1 Tbyte filesystem minfree is set to 1% which translates to about 10 Gbyte.
  2. A safe assumption here would be that the limit on the size of a single file is the size of the filesystem,minus 1% to 2% overhead. In Solaris 2.6, the swap and tmpfs filesystems are still limited to 2 Gbyte. This is not the total amount of swap, it is a limit per swap slice or per swap file. A swap slice or file may be defined as larger than 2 Gbyte, but any space above 2 Gbyte in that slice or file will not be accessible and the size of the slice or file will be reported by the swap command as 2 Gbyte. There can be multiple swap slices or files totaling more than 2 Gbyte. Any later release of Solaris running 32 bit kernel has the same limitation. Later releases of Solaris running a 64 bit kernel do not have this limitation. See the ¨USAGE〃 paragraph in the Solaris 8 swap(1M) manual page for the new limits.

  3. Solaris 9 Update 4 introduced multiterabyte UFS. The maximum individual file size is still the same as before (~1 Tbyte ), as increasing it would require radical on-disk format changes. The total filesystem size can now be up to 16 Tbyte. The -T option is specified to the newfs command to create such a filesystem. See the newfs(1M) manpage for additional information. There is also a limit of 1 million files per Tbyte, for instance. A 4 Tbyte UFS filesystem would have a 4 million files limit). This is done to keep fsck times reasonable (even when logging is enabled).

  4. Multiterabyte UFS functionality can also be added to earlier releases of Solaris 9 by installing the UFS patch 113454-09 or later. See the Special Install Instructions in the patch README for a list of additional patches required to get the full Multiterabyte functionality.

  5. The maximum single file size in a multiterabyte filesystem, which is greater than 1 Tbyte is 1 Tbyte minus 500 Mbyte or 1023.5 Gbyte As a rule of thumb this should be taken as 1023 Gbyte.

  6. A multiterabyte UFS filesystem is not bootable (This means the root filesystem cannot be a multiterabyte filesystem).

  7. A multiterabyte UFS filesystem is not mountable in any version of 32bit Solaris kernel.

Limitations in combination with the OBP.

Excluding the multiterabyte root filesystem limitation detailed above, the root filesystem has limits in Solaris[TM] 2.x that are not imposed on any other filesystems. This limiting factor is a combination of two things, the OS release and the OBP (Open Boot Prom) level.

To establish the OS release, examine the /etc/release file To establish the OBP level, use the command prtconf -V (capital ¨V〃).

Here is a list of the various possible configurations.

OBP level

OS Release

Max root filesystem size

OBP 3.1beta1 or newer

Solaris 2.5.1

Kernel Patch 103640-08 or newer

No limit

OBP 3.1beta1 or newer

Solaris 2.6

No limit

OBP 3.0 or earlier

Solaris 2.5.1

Kernel Patch 103640-07 or older.

2Gbyte

OBP 3.0 or earlier

Solaris 2.6 on an Ultra (sun4u)

4Gbyte

Solaris 10 and ZFS

ZFS is now fully integrated in Solaris 10 U2 and above.

Limitations of a single file and filesystem in ZFS:

http://www.opensolaris.org/os/community/zfs/faq/

What limits does ZFS have?

- The limitations of ZFS are designed to be so large that they will never be encountered in any practical operation.

- ZFS can store 16 Exabytes in each storage pool, file system, file, or file attribute.

- ZFS can store billions of names: files or directories in a directory, file systems in a file system, or snapshots of a file system.

- ZFS can store trillions of items: files in a file system, file systems, volumes, or snapshots in a pool.

Tuesday, April 10, 2007

Migrating UFS to ZFS on the fly: Mission impossible or Mission will be completed

Although ZFS has already been proved so powerful and flexible, few of our customers have the plan to migrate their UFSes to ZFS.

One reason for this is most of them believe that ZFS works well in labs, but they won't believe it can perform well in their production system. After all, ZFS is such a new thing comparing with other mature filesystems including UFS. This can be changed by time. I believe they will change their minds as many expedient data that will be provided by the users who are working with ZFS.

The other important reason is probably that they won't take the risks of long down time while copy data of several TB without the intelligent backup tools such as VERITAS NetBackup and automatic filesystem replicate tools.

If Sun microsystem won't provide some tools of migrating UFS to ZFS on the fly, ZFS may would be mission impossible for the users having data of several TBs, even hundreds of GBs. What a pity!

Friday, April 6, 2007

Sun Sets New World Records With Enhanced UltraSPARC IV+ Servers Running Solaris

Sun Boosts Performance On UltraSPARC Servers, Show Sun's UltraSPARC Leadership
SANTA CLARA, Calif. April 3, 2007 Sun Microsystems, Inc. (NASDAQ: SUNW) today announced the availability of faster 1.95GHz and 2.1GHz UltraSPARC(R) IV+ processors for its popular Sun Fire servers. New Sun Fire servers with UltraSPARC IV+ 1.95GHz and 2.1GHz processors offer world record performance, easy application portability and industry leading investment protection. The new servers are powered by the Solaris Operating System (OS), which delivers customers unbroken binary compatibility, thus ensuring that existing applications will run on the new UltraSPARC IV+ processors without the need to re-code or re-compile.


According to industry analyst reports, Sun has gained market share for four consecutive quarters. SPARC-based systems represent a significant portion of that growth and the new 1.95GHz and 2.1 GHz UltraSPARC IV+ processors demonstrate Sun's ongoing commitment to improving the performance of the SPARC architecture and the protection of customer investment.

"SPARC, with Solaris, is the engine that drives our server business, and we're committed to continual improvements on our SPARC-based products," said John Fowler, executive vice president, Systems Group, Sun Microsystems. "The growing demand for UltraSPARC IV+ servers has helped Sun build tremendous market momentum in our systems line-up, and steal significant market share at the expense of our competitors."

Sun Sets New World Records


The Sun Fire E2900 with UltraSPARC IV+ 1.95 GHz (US-IV+1.95GHz) set a new world-record for a single application server, with a Sun Fire T2000 as the database server, on SPECjAppServer2004 achieving >1781 JOPS - the highest 2-node result performance record to date. The Sun Fire (US-IV+1.95GHz) uses six Solaris containers which - through consolidation - improves datacenter efficiency and promotes higher levels of system utilization.1

The Sun Fire E6900 (24 processors, 48 cores, 48 threads) with UltraSPARC IV+ 1.95GHz set a new world-record for the SAP-SD 2-Tier Standard Application benchmark for systems with 24 or fewer processors as of 04/02/07, achieving 6160 users.2

Sun Fire UltraSPARC IV+ servers now own over 75 world-record benchmarks.

The Sun Fire V490, V890, E2900, E4900, E6900, E20K and E25K servers, powered by new 1.95GHz and 2.1GHz UltraSPARC IV+ processors have up to 2x the life of comparable IBM servers and up to 1/3 better TCO.

When compared to previous generations, the new UltraSPARC IV+ processor has shown 2X performance over the UltraSPARC IV and 5X performance over the UltraSPARC III.

Sun Fire UltraSPARC platforms are designed to help customers with CRM, business intelligence/data warehousing, and enterprise applications using large databases. The systems are ideal for virtualization environments using a combination of fault isolated hard partitions and flexible Solaris Containers. Solaris 10 Containers can consolidate and virtualize hundreds of applications on a single system so customers save in energy, space and complexity. Solaris Container management is superior to HP and IBM partitioning strategies as it requires less overhead, while providing resource flexibility down to a single processor.

Thursday, April 5, 2007

0-Day in Solaris 10 and 11

telnet -l "-froot" ip_addr
will get you root on most solaris 10/11 with default configs.

So disable youe telnet service when system installed will be advised.
#svcadm disable telnet

Sun Introduced Netra X4200 M2 server

By Chip Brookshaw

March 27, 2007 - Telecommunications customers need a unique combination of stability and horsepower--the IT equivalent of a tank that accelerates like a sports car. Sun has met those requirements for more than a decade with the industry's broadest line of carrier grade servers and blades.

Last year, Sun completely refreshed its telco line with new processors and added ATCA support. That momentum continues today with the announcement of the Sun Netra X4200 M2 server. The new server makes it even easier for telco customers to maintain standardization without sacrificing performance.

The 2U, 20-inch Sun Netra X4200 M2 server is Sun's first NEBS Level 3-certified rack server powered by AMD Opteron processors. It delivers several key advantages:

* Broadest OS support on the market
* Highest storage and memory capacity in its class
* Support for Sun's new 10 GbE Networking Technology

The Sun Netra X4200 M2 server is designed to power solutions such as media servers, voice over IP (VoIP) solutions, signaling gateway controllers, element management systems, network traffic analysis systems, operation management systems, and data management systems.

"Adding the Netra X4200 M2 server to the Netra server family extends the choice of platforms for our customers and partners," says Baljeet Grewal, Sun product line manager. "Companies can now have a single vendor for their UltraSPARC, x64, Solaris, Linux, and Windows based carrier-grade servers."
The new Sun Netra X4200 M2 server makes it even easier for telco customers to maintain standardization without sacrificing performance.
Scalability and Investment Protection

The Sun Netra X4200 M2 server supports Solaris 10, Red Hat, and SUSE Linux, and Windows operating systems--the broadest platform support in the industry. The server also offers scalability and investment protection with:

* Up to four 146GB SAS drives with RAID 0 and 1 support
* Up to two dual-core AMD Opteron processors
* Upgradeability to future quad-core AMD Opteron processors

The two-socket Sun Netra X4200 M2 server is available now, starting at $9,845. The one-socket version will be available in May, starting at $6,145.
Ready for 10 GbE

Last month Sun debuted its 10 GbE Networking Technology with the introduction of the Sun Multithreaded Networking Card, the first of its kind from a major vendor. The Sun Netra X4200 M2 server ships ready to support the networking card. Sun intends to make its leading 10GbE Networking Technology available across the Netra portfolio in the near future.

Sun's new networking technology brings high-performance multithreading to the entire stack, from the operating system to the processor to the network wire. With 10Gb/s bandwidth across the stack, businesses can provide services to customers faster and more efficiently, lowering overall costs.
New Customer and ISV Adoption

Customers are already experiencing how the Sun Netra X4200 M2 server can add high performance and platform flexibility to their solutions.

"Siemens Networks has been validating the Sun Netra X4200 M2 server in our laboratories for several months now," says Joachim Ungruh, senior vice president, VoIP Solutions, Siemens Networks. "While we are still in the platform verification phase, it has exceeded our performance expectations. The Sun Netra X4200 M2 will significantly enhance the competitiveness of our consumer and business VoIP solution offerings, continuing to reinforce Siemens' strategy to leverage the technology advancements of market-leading computing platforms."

A number of ISVs are developing applications for the Sun Netra X4200 M2 Server including, Adax, AppGate Network Security, Surf, and Appium.

The recently announced Sun Neta Data-Plane Software (NDPS) Suite is also gaining support from leading ISVs including Teja Technologies, SDC Labs, and Surf. NDPS enables the telco customers to reach breakthrough performance for data-plane functions on Netra CoolThreads servers such as Netra T2000 Server and Netra CP3060 UltraSPARC T1 ATCA Blade.
Designed for Demanding Environments

The Sun Netra X4200 M2 server expands Sun's selection of carrier grade servers and blades--the broadest line in the industry. As with other Netra servers, the Sun Netra X4200 M2 server is ruggedized making it ideal for the most demanding applications in the toughest environments.

Monday, April 2, 2007

Transaction file system and COW

NTFS is a transaction based file system. Is it a joke? No, NTFS is such a file system although it performed so badly. The use of activity logging and transaction management of file system changes allows the file system to maintain its internal integrity by preventing incomplete transactions from being implemented. One key to the operation of the transaction system is the process that is employed to check for and undo transactions that were not properly completed. This is sometimes called transaction recovery. Recovery is performed on NTFS volumes each time they are mounted on the system. Most commonly, this occurs when the system is booted or rebooted.

ZFS is also a file system based on transaction. Comparing with NTFS, ZFS only performs copy-on-write operations.This means that the blocks containing the in-use data on disk are never modified. The changed information is written to alternate blocks, and the block pointer to the in-use data is only moved once the write transactions are complete. This happens all the way up the file system block structure to the top block, called the uberblock.


As show in picture above, transactions select unused blocks to write modified data and only then change the location to which the preceding block points. If the machine were to suffer a power outage in the middle of a data write, no corruption occurs because the pointer to the "good" data is not moved until the entire write is complete. (Note: The pointer to the data is the only thing that is moved.) This eliminates the need for a journaling or logging file system and any need for a fsck or mirror resync when a machine reboots unexpectedly.
Fast, Safe, Open, Free!
Open for business. Open for me! » Learn More