Archive for the ‘NFS’ category

InnoDB: Unable to lock ./ibdata1, error: 37

June 1st, 2010

InnoDB: Unable to lock ./ibdata1, error: 37

I recently faced the “error  InnoDB: Unable to lock ./ibdata1, error: 37 “ while moving my data directory to NFS from local disk , when I restarted the mysql , the restart failed and I got the error in the error log , all permissions were set properly so I was not sure whats going wrong , I was sure that there is some problem with the OS not able to lock the file and it was related to NFS so I

1) /etc/init.d/portmap restart
2) /etc/init.d/nfslock restart

Restarted the mysql and it worked .

MYSQL WITH LINUX OPERATING SYSTEMS

November 5th, 2009

LINUX SETTINGS
ENLARGING A CLIENT’S TRANSPORT SOCKET BUFFERS
Enlarging the transport socket buffers that Linux uses for NFS traffic helps reduce resource contention on the client, reduces performance variance, and improves maximum data and operation throughput. In future releases of the client, the following procedure will not be necessary, because the client will automatically choose an optimal socket buffer size.

1. Log in as root on the client and enter:
cd into /proc/sys/net/core
echo 262143 > rmem_max
echo 262143 > wmem_max
echo 262143 > rmem_default

2. Remount the NFS file systems on the client.

This is especially useful for NFS over UDP and when using Gigabit Ethernet. Consider adding this to a system startup script that runs before the system mounts NFS file systems. The recommended size (262,143 bytes) is the largest safe socket buffer size that NetApp has tested. On clients with 16MB of memory or less, leave the default socket buffer size setting to conserve memory.
Red Hat versions after 7.2 contain a file named /etc/sysctl.conf where changes such as this can be added so that they are executed after every system reboot.
3. Add the following lines to the /etc/sysctl.conf file on these Red Hat systems:
net.core.rmem_max = 262143
net.core.wmem_max = 262143
echo 262143 > wmem_default
net.core.rmem_default = 262143
net.core.wmem_default = 262143
kernel.shmmax=2147483648
kernel.shmall=2147483648
kernel.msgmni=2048
kernel.msgmax=65536
kernel.sem=250 32000 32 1024
fs.file-max=65536

OTHER TCP ENHANCEMENTS
The following settings can help reduce the amount of work that clients and storage systems do when running NFS over TCP:
echo 0 > /proc/sys/net/ipv4/tcp_sack
echo 0 > /proc/sys/net/ipv4/tcp_timestamps
These operations disable optional features of TCP to save a little processing time and network bandwidth.
When building kernels, ensure that CONFIG_SYNCOOKIES is disabled. SYN cookies slow down TCP connections by adding extra processing on both ends of the socket. Some Linux distributors provide kernels with SYN cookies enabled. Linux 2.2 and 2.4 kernels support large TCP windows (RFC 1323) by default. No modification is required to enable large TCP windows.

LINUX NETWORKING—FULL DUPLEX AND AUTO-NEGOTIATION

Most network interface cards use auto-negotiation to obtain the fastest settings allowed by the card and the switch port to which it attaches. Chipset incompatibilities may sometimes result in constant renegotiation, negotiating half duplex, or a slow speed. When diagnosing a network problem, make sure that the Ethernet settings are as expected before looking for other problems. Avoid hard coding the settings to solve auto-negotiation problems, because it only masks a deeper problem. Switch and card vendors should be able to help resolve these problems.

LINUX NETWORKING—GIGABIT ETHERNET NETWORK ADAPTERS
If Linux servers are using high-performance networking (gigabit or faster), be sure to provide enough CPU and memory bandwidth to handle the interrupt and data rate. The NFS client software and the gigabit driver reduce the resources available to the application, so make sure that resources are adequate. Most gigabit cards that support 64-bit PCI or higher should provide good performance.
Any database that uses NetApp storage should use Gigabit Ethernet on both the storage system and the database server to achieve optimal performance.
NetApp has found that the following Gigabit Ethernet cards work well with Linux:
• SysKonnect: The SysKonnect SK-98XX series cards work very well with Linux and support single- and dual-fiber and copper interfaces for better performance and availability. A mature driver for this card exists in the 2.4 kernel source distribution.
• Broadcom: Many cards and switches use this chipset, including the ubiquitous 3Com solutions.
This provides a high probability of compatibility between network switches and Linux clients. The driver software for this chipset appeared in the 2.4.19 Linux kernel and is included in Red Hat distributions with earlier 2.4 kernels. Make sure that the chipset firmware is up to date.
• AceNIC Tigon II: Several cards, such as the NetGear GA620T, use this chipset, but none are still being manufactured. A mature and actively maintained driver for this chipset exists in the kernel source distribution.

Intel® EEPro/1000: This appears to be the fastest gigabit card available for systems based on Intel, but the card’s driver software is included only in recent kernel source distributions (2.4.20 and later) and may be somewhat unstable. The card’s driver software for earlier kernels can be found on the Intel Web site. There are reports that the jumbo frame MTU for Intel cards is only 8998 bytes, not the standard 9000 bytes.

LINUX NETWORKING—JUMBO FRAMES WITH GBE
All of the cards just described support the jumbo frames option of Gigabit Ethernet. Using jumbo frames can improve performance in environments where Linux NFS clients and NetApp systems are together on an unrouted network. Consult the command reference for each switch to make sure that it is capable of handling jumbo frames. There are some known problems in Linux drivers and the networking layer when using the maximum frame size (9000 bytes). If unexpected performance slowdowns occur when using jumbo frames, try reducing the MTU to 8960 bytes.

mount: wrong fs type, bad option

November 3rd, 2009

Basically following libraries are missing as it is required by nfs client. After installing following libraries, i am able to mount the nfs partition

#yum install nfs-*
#yum install nfs-common

Installing:
nfs-utils x86_64 1:1.0.9-42.el5 base 389 k
nfs-utils-lib i386 1.0.8-7.6.el5 base 55 k
nfs-utils-lib-devel i386 1.0.8-7.6.el5 base 57 k
nfs-utils-lib-devel x86_64 1.0.8-7.6.el5 base 63 k

Optimizing NFS Performance

October 31st, 2009

Optimizing NFS Performance

Careful analysis of your environment, both from the client and from the server point of view, is the first step necessary for optimal NFS performance. The first sections will address issues that are generally important to the client. Later (Section 5.3 and beyond), server side issues will be discussed. In both cases, these issues will not be limited exclusively to one side or the other, but it is useful to separate the two in order to get a clearer picture of cause and effect.

Aside from the general network configuration – appropriate network capacity, faster NICs, full duplex settings in order to reduce collisions, agreement in network speed among the switches and hubs, etc. – one of the most important client optimization settings are the NFS data transfer buffer sizes, specified by the mount command options rsize and wsize.

5.1. Setting Block Size to Optimize Transfer Speeds

The mount command options rsize and wsize specify the size of the chunks of data that the client and server pass back and forth to each other. If no rsize and wsize options are specified, the default varies by which version of NFS we are using. The most common default is 4K (4096 bytes), although for TCP-based mounts in 2.2 kernels, and for all mounts beginning with 2.4 kernels, the server specifies the default block size.

The theoretical limit for the NFS V2 protocol is 8K. For the V3 protocol, the limit is specific to the server. On the Linux server, the maximum block size is defined by the value of the kernel constant NFSSVC_MAXBLKSIZE, found in the Linux kernel source file ./include/linux/nfsd/const.h. The current maximum block size for the kernel, as of 2.4.17, is 8K (8192 bytes), but the patch set implementing NFS over TCP/IP transport in the 2.4 series, as of this writing, uses a value of 32K (defined in the patch as 32*1024) for the maximum block size.

All 2.4 clients currently support up to 32K block transfer sizes, allowing the standard 32K block transfers across NFS mounts from other servers, such as Solaris, without client modification.

The defaults may be too big or too small, depending on the specific combination of hardware and kernels. On the one hand, some combinations of Linux kernels and network cards (largely on older machines) cannot handle blocks that large. On the other hand, if they can handle larger blocks, a bigger size might be faster.

You will want to experiment and find an rsize and wsize that works and is as fast as possible. You can test the speed of your options with some simple commands, if your network environment is not heavily used. Note that your results may vary widely unless you resort to using more complex benchmarks, such as Bonnie, Bonnie++, or IOzone.

The first of these commands transfers 16384 blocks of 16k each from the special file /dev/zero (which if you read it just spits out zeros really fast) to the mounted partition. We will time it to see how long it takes. So, from the client machine, type:

# time dd if=/dev/zero of=/mnt/home/testfile bs=16k count=16384

This creates a 256Mb file of zeroed bytes. In general, you should create a file that’s at least twice as large as the system RAM on the server, but make sure you have enough disk space! Then read back the file into the great black hole on the client machine (/dev/null) by typing the following:

# time dd if=/mnt/home/testfile of=/dev/null bs=16k

Repeat this a few times and average how long it takes. Be sure to unmount and remount the filesystem each time (both on the client and, if you are zealous, locally on the server as well), which should clear out any caches.

Then unmount, and mount again with a larger and smaller block size. They should be multiples of 1024, and not larger than the maximum block size allowed by your system. Note that NFS Version 2 is limited to a maximum of 8K, regardless of the maximum block size defined by NFSSVC_MAXBLKSIZE; Version 3 will support up to 64K, if permitted. The block size should be a power of two since most of the parameters that would constrain it (such as file system block sizes and network packet size) are also powers of two. However, some users have reported better successes with block sizes that are not powers of two but are still multiples of the file system block size and the network packet size.

Directly after mounting with a larger size, cd into the mounted file system and do things like ls, explore the filesystem a bit to make sure everything is as it should. If the rsize/wsize is too large the symptoms are very odd and not 100% obvious. A typical symptom is incomplete file lists when doing ls, and no error messages, or reading files failing mysteriously with no error messages. After establishing that the given rsize/wsize works you can do the speed tests again. Different server platforms are likely to have different optimal sizes.

Remember to edit /etc/fstab to reflect the rsize/wsize you found to be the most desirable.

If your results seem inconsistent, or doubtful, you may need to analyze your network more extensively while varying the rsize and wsize values. In that case, here are several pointers to benchmarks that may prove useful:

The easiest benchmark with the widest coverage, including an extensive spread of file sizes, and of IO types – reads, writes, rereads, and rewrites, random access, etc. – seems to be IOzone. A recommended invocation of IOzone (for which you must have root privileges) includes unmounting and remounting the directory under test, in order to clear out the caches between tests, and including the file close time in the measurements. Assuming you’ve already exported /tmp to everyone from the server foo, and that you’ve installed IOzone in the local directory, this should work:

# echo "foo:/tmp /mnt/foo nfs rw,hard,intr,rsize=8192,wsize=8192 0 0"
>> /etc/fstab
# mkdir /mnt/foo
# mount /mnt/foo
# ./iozone -a -R -c -U /mnt/foo -f /mnt/foo/testfile > logfile

The benchmark should take 2-3 hours at most, but of course you will need to run it for each value of rsize and wsize that is of interest. The web site gives full documentation of the parameters, but the specific options used above are:

  • -a: Full automatic mode, which tests file sizes of 64K to 512M, using record sizes of 4K to 16M
  • -R: Generate report in excel spreadsheet form (The “surface plot” option for graphs is best)
  • -c: Include the file close time in the tests, which will pick up the NFS version 3 commit time
  • -U: Use the given mount point to unmount and remount between tests; it clears out caches
  • -f: When using unmount, you have to locate the test file in the mounted file system

5.2. Packet Size and Network Drivers

While many Linux network card drivers are excellent, some are quite shoddy, including a few drivers for some fairly standard cards. It is worth experimenting with your network card directly to find out how it can best handle traffic.

Try pinging back and forth between the two machines with large packets using the -f and -s options with ping (see ping(8) for more details) and see if a lot of packets get dropped, or if they take a long time for a reply. If so, you may have a problem with the performance of your network card.

For a more extensive analysis of NFS behavior in particular, use the nfsstat command to look at nfs transactions, client and server statistics, network statistics, and so forth. The -o net option will show you the number of dropped packets in relation to the total number of transactions. In UDP transactions, the most important statistic is the number of retransmissions, due to dropped packets, socket buffer overflows, general server congestion, timeouts, etc. This will have a tremendously important effect on NFS performance, and should be carefully monitored. Note that nfsstat does not yet implement the -z option, which would zero out all counters, so you must look at the current nfsstat counter values prior to running the benchmarks.

To correct network problems, you may wish to reconfigure the packet size that your network card uses. Very often there is a constraint somewhere else in the network (such as a router) that causes a smaller maximum packet size between two machines than what the network cards on the machines are actually capable of. TCP should autodiscover the appropriate packet size for a network, but UDP will simply stay at a default value. So determining the appropriate packet size is especially important if you are using NFS over UDP.

You can test for the network packet size using the tracepath command: From the client machine, execute:

$ tracepath server
 1:  server (x.x.x.x)                     0.274ms pmtu 1500
 1:  x.x.x.x (x.x.x.x)                    0.494ms
 2:  x.x.x.x (x.x.x.x)                    0.424ms
 3:  x.x.x.x (x.x.x.x)                    1.042ms
 4:  server (x.x.x.x)                     0.421ms reached
     Resume: pmtu 1500 hops 4 back 4
$

and the path MTU should be reported at the bottom. You can then set the MTU on your network card equal to the path MTU, by using the MTU option to ifconfig, and see if fewer packets get dropped. See the ifconfig man pages for details on how to reset the MTU.

In addition, netstat -s will give the statistics collected for traffic across all supported protocols. You may also look at /proc/net/snmp for information about current network behavior; see the next section for more details.

5.3. Overflow of Fragmented Packets

Using an rsize or wsize larger than your network’s MTU (often set to 1500, in many networks) will cause IP packet fragmentation when using NFS over UDP. IP packet fragmentation and reassembly require a significant amount of CPU resource at both ends of a network connection. In addition, packet fragmentation also exposes your network traffic to greater unreliability, since a complete RPC request must be retransmitted if a UDP packet fragment is dropped for any reason. Any increase of RPC retransmissions, along with the possibility of increased timeouts, are the single worst impediment to performance for NFS over UDP.

Packets may be dropped for many reasons. If your network topography is complex, fragment routes may differ, and may not all arrive at the Server for reassembly. NFS Server capacity may also be an issue, since the kernel has a limit of how many fragments it can buffer before it starts throwing away packets. With kernels that support the /proc filesystem, you can monitor the files /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv4/ipfrag_low_thresh. Once the number of unprocessed, fragmented packets reaches the number specified by ipfrag_high_thresh (in bytes), the kernel will simply start throwing away fragmented packets until the number of incomplete packets reaches the number specified by ipfrag_low_thresh.

Another counter to monitor is IP: ReasmFails in the file /proc/net/snmp; this is the number of fragment reassembly failures. if it goes up too quickly during heavy file activity, you may have a problem.

5.4. NFS Over TCP

A new feature, available for both 2.4 and 2.5 kernels but not yet integrated into the mainstream kernel at the time of this writing, is NFS over TCP. Using TCP has a distinct advantage and a distinct disadvantage over UDP. The advantage is that it works far better than UDP on lossy networks. When using TCP, a single dropped packet can be retransmitted, without the retransmission of the entire RPC request, resulting in better performance on lossy networks. In addition, TCP will handle network speed differences better than UDP, due to the underlying flow control at the network level.

The disadvantage of using TCP is that it is not a stateless protocol like UDP. If your server crashes in the middle of a packet transmission, the client will hang and any shares will need to be unmounted and remounted.

The overhead incurred by the TCP protocol will result in somewhat slower performance than UDP under ideal network conditions, but the cost is not severe, and is often not noticable without careful measurement. If you are using gigabit ethernet from end to end, you might also investigate the usage of jumbo frames, since the high speed network may allow the larger frame sizes without encountering increased collision rates, particularly if you have set the network to full duplex.

5.5. Timeout and Retransmission Values

Two mount command options, timeo and retrans, control the behavior of UDP requests when encountering client timeouts due to dropped packets, network congestion, and so forth. The -o timeo option allows designation of the length of time, in tenths of seconds, that the client will wait until it decides it will not get a reply from the server, and must try to send the request again. The default value is 7 tenths of a second. The -o retrans option allows designation of the number of timeouts allowed before the client gives up, and displays the Server not responding message. The default value is 3 attempts. Once the client displays this message, it will continue to try to send the request, but only once before displaying the error message if another timeout occurs. When the client reestablishes contact, it will fall back to using the correct retrans value, and will display the Server OK message.

If you are already encountering excessive retransmissions (see the output of the nfsstat command), or want to increase the block transfer size without encountering timeouts and retransmissions, you may want to adjust these values. The specific adjustment will depend upon your environment, and in most cases, the current defaults are appropriate.

5.6. Number of Instances of the NFSD Server Daemon

Most startup scripts, Linux and otherwise, start 8 instances of nfsd. In the early days of NFS, Sun decided on this number as a rule of thumb, and everyone else copied. There are no good measures of how many instances are optimal, but a more heavily-trafficked server may require more. You should use at the very least one daemon per processor, but four to eight per processor may be a better rule of thumb. If you are using a 2.4 or higher kernel and you want to see how heavily each nfsd thread is being used, you can look at the file /proc/net/rpc/nfsd. The last ten numbers on the th line in that file indicate the number of seconds that the thread usage was at that percentage of the maximum allowable. If you have a large number in the top three deciles, you may wish to increase the number of nfsd instances. This is done upon starting nfsd using the number of instances as the command line option, and is specified in the NFS startup script (/etc/rc.d/init.d/nfs on Red Hat) as RPCNFSDCOUNT. See the nfsd(8) man page for more information.

5.7. Memory Limits on the Input Queue

On 2.2 and 2.4 kernels, the socket input queue, where requests sit while they are currently being processed, has a small default size limit (rmem_default) of 64k. This queue is important for clients with heavy read loads, and servers with heavy write loads. As an example, if you are running 8 instances of nfsd on the server, each will only have 8k to store write requests while it processes them. In addition, the socket output queue – important for clients with heavy write loads and servers with heavy read loads – also has a small default size (wmem_default).

Several published runs of the NFS benchmark SPECsfs97 specify usage of a much higher value for both the read and write value sets, [rw]mem_default and [rw]mem_max. You might consider increasing these values to at least 256k. The read and write limits are set in the proc file system using (for example) the files /proc/sys/net/core/rmem_default and /proc/sys/net/core/rmem_max. The rmem_default value can be increased in three steps; the following method is a bit of a hack but should work and should not cause any problems:

  1. Increase the size listed in the files:
    # echo 262144 > /proc/sys/net/core/rmem_default
    # echo 262144 > /proc/sys/net/core/rmem_max
  2. Restart NFS via the method described in your distribution’s documentation.
  3. You might return the size limits to their normal size in case other kernel systems depend on it. This last step may be necessary because machines have been reported to crash or have issues when these variables are left unchanged for long periods of time.
    # echo 65536 > /proc/sys/net/core/rmem_default
    # echo 65536 > /proc/sys/net/core/rmem_max

5.8. Turning Off Autonegotiation of NICs and Hubs

If network cards auto-negotiate badly with hubs and switches, and ports run at different speeds, or with different duplex configurations, performance will be severely impacted due to excessive collisions, dropped packets, etc. If you see excessive numbers of dropped packets in the nfsstat output, or poor network performance in general, try playing around with the network speed and duplex settings. If possible, concentrate on establishing a 100BaseT full duplex subnet; the virtual elimination of collisions in full duplex will remove the most severe performance inhibitor for NFS over UDP. Be careful when turning off autonegotiation on a card: The hub or switch that the card is attached to will then resort to other mechanisms (such as parallel detection) to determine the duplex settings, and some cards default to half duplex because it is more likely to be supported by an old hub. The best solution, if the driver supports it, is to force the card to negotiate 100BaseT full duplex.

5.9. Synchronous vs. Asynchronous Behavior in NFS

The default export behavior for both NFS Version 2 and Version 3 protocols, used by exportfs in nfs-utils versions prior to nfs-utils-1.0.1 is “asynchronous”. This default permits the server to reply to client requests as soon as it has processed the request and handed it off to the local file system, without waiting for the data to be written to stable storage. This is indicated by the async option denoted in the server’s export list. It yields better performance at the cost of possible data corruption if the server reboots while still holding unwritten data and/or metadata in its caches. This possible data corruption is not detectable at the time of occurrence, since the async option instructs the server to lie to the client, telling the client that all data has indeed been written to the stable storage, regardless of the protocol used.

In order to conform with “synchronous” behavior, used as the default for most proprietary systems supporting NFS (Solaris, HP-UX, RS/6000, etc.), and now used as the default in the latest version of exportfs, the Linux Server’s file system must be exported with the sync option. Note that specifying synchronous exports will result in no option being seen in the server’s export list:

  • Export a couple file systems to everyone, using slightly different options:
    # /usr/sbin/exportfs -o rw,sync *:/usr/local
    # /usr/sbin/exportfs -o rw *:/tmp
  • Now we can see what the exported file system parameters look like:
    # /usr/sbin/exportfs -v
    /usr/local *(rw)
    /tmp *(rw,async)

If your kernel is compiled with the /proc filesystem, then the file /proc/fs/nfs/exports will also show the full list of export options.

When synchronous behavior is specified, the server will not complete (that is, reply to the client) an NFS version 2 protocol request until the local file system has written all data/metadata to the disk. The server will complete a synchronous NFS version 3 request without this delay, and will return the status of the data in order to inform the client as to what data should be maintained in its caches, and what data is safe to discard. There are 3 possible status values, defined an enumerated type, nfs3_stable_how, in include/linux/nfs.h. The values, along with the subsequent actions taken due to these results, are as follows:

  • NFS_UNSTABLE: Data/Metadata was not committed to stable storage on the server, and must be cached on the client until a subsequent client commit request assures that the server does send data to stable storage.
  • NFS_DATA_SYNC: Metadata was not sent to stable storage, and must be cached on the client. A subsequent commit is necessary, as is required above.
  • NFS_FILE_SYNC: No data/metadata need be cached, and a subsequent commit need not be sent for the range covered by this request.

In addition to the above definition of synchronous behavior, the client may explicitly insist on total synchronous behavior, regardless of the protocol, by opening all files with the O_SYNC option. In this case, all replies to client requests will wait until the data has hit the server’s disk, regardless of the protocol used (meaning that, in NFS version 3, all requests will be NFS_FILE_SYNC requests, and will require that the Server returns this status). In that case, the performance of NFS Version 2 and NFS Version 3 will be virtually identical.

If, however, the old default async behavior is used, the O_SYNC option has no effect at all in either version of NFS, since the server will reply to the client without waiting for the write to complete. In that case the performance differences between versions will also disappear.

Finally, note that, for NFS version 3 protocol requests, a subsequent commit request from the NFS client at file close time, or at fsync() time, will force the server to write any previously unwritten data/metadata to the disk, and the server will not reply to the client until this has been completed, as long as sync behavior is followed. If async is used, the commit is essentially a no-op, since the server once again lies to the client, telling the client that the data has been sent to stable storage. This again exposes the client and server to data corruption, since cached data may be discarded on the client due to its belief that the server now has the data maintained in stable storage.

5.10. Non-NFS-Related Means of Enhancing Server Performance

In general, server performance and server disk access speed will have an important effect on NFS performance. Offering general guidelines for setting up a well-functioning file server is outside the scope of this document, but a few hints may be worth mentioning:

  • If you have access to RAID arrays, use RAID 1/0 for both write speed and redundancy; RAID 5 gives you good read speeds but lousy write speeds.
  • A journalling filesystem will drastically reduce your reboot time in the event of a system crash. Currently, ext3 will work correctly with NFS version 3. In addition, Reiserfs version 3.6 will work with NFS version 3 on 2.4.7 or later kernels (patches are available for previous kernels). Earlier versions of Reiserfs did not include room for generation numbers in the inode, exposing the possibility of undetected data corruption during a server reboot.
  • Additionally, journalled file systems can be configured to maximize performance by taking advantage of the fact that journal updates are all that is necessary for data protection. One example is using ext3 with data=journal so that all updates go first to the journal, and later to the main file system. Once the journal has been updated, the NFS server can safely issue the reply to the clients, and the main file system update can occur at the server’s leisure. The journal in a journalling file system may also reside on a separate device such as a flash memory card so that journal updates normally require no seeking. With only rotational delay imposing a cost, this gives reasonably good synchronous IO performance. Note that ext3 currently supports journal relocation, and ReiserFS will (officially) support it soon. The Reiserfs tool package found at ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.x.0k.tar.gz contains the reiserfstune tool, which will allow journal relocation. It does, however, require a kernel patch which has not yet been officially released as of January, 2002.
  • Using an automounter (such as autofs or amd) may prevent hangs if you cross-mount files on your machines (whether on purpose or by oversight) and one of those machines goes down. See the Automount Mini-HOWTO for details.
  • Some manufacturers (Network Appliance, Hewlett Packard, and others) provide NFS accelerators in the form of Non-Volatile RAM. NVRAM will boost access speed to stable storage up to the equivalent of async access.

What values can be tuned to improve NFS performance?

October 31st, 2009

In most situations the default NFS configuration should allow file access at acceptable speeds. However there may be situations that require tuning of the NFS client for specific circumstances.

An example command to mount the remote export from a NFS server sharing large media files could be:

mount -t nfs -o noacl,nocto, rsize=32768,wsize=32768 mediaserver:/exports /mnt/mediaserver/

Below are possible mount options and their advantages and disadvantages:

noacl

Using this mount option the client will not attemp to to implement advanced Access Control Lists (ACL). This will save possible ACL read and write attempts when attempting to get advanced ACL permissions from a file.

Red Hat Enterprise Linux 3 fully supports NFS ACLs from the server. Red Hat Enterprise Linux 4 update 2 and below does not implement ACLs from the NFS server.

nocto

This option should reduce the NFS clients attempting to check or re-validate attributes on the NFS server. This mount option should only be used if the files are static and are not expected to change.

rsize

The default read default buffer size is 4096. Increasing this value may increase the performance depending on the size of the data being read. Recommended values for this parameter are numbers within the power of 2 (4096, 8192, …). Large values may not work with NFS version 2.

wsize

The default read default buffer size of 4096, increasing this value may increase the performance depending on the size of the data being read.

In Addition to the listed Tuning parameters, consider the following parameters as well:

noatime

Setting this value disables the NFS server from updating the inodes access time.  As most applications do not necessarily need this value, you can safely disable this updating.

nodiratime

Setting this value disables the NFS server from updating the directory access time.  This is the directory equivalent setting of ‘noatime’.

noac

This disables client side caching of inode attributes.  This effectively disables the defaults of acregmin=60 and acregmax=60 – 60 second timeouts for caching file attributes and directory attributes.  This can, actually cause a performance slowdown, but in certain circumstances can help if multiple NFS clients are actively writing to the same share.

Please note that there is no one-size-fits-all approach to performance tuning.  Different servers handling different workloads need to be tuned differently.  Please test and monitor both your server and client systems as you test different tuning options.

NetApp NFS mount for Red Hat Linux 5.2

October 31st, 2009

Just another post from my mails, where I have collected some Best Practices for mounting NFS share in RHRL.

Automounter

An automounter can cause a lot of network chatter, so it is best to disable the automounter on your client and set up static mounts before taking a network trace. Automounters depend on the availability of several network infrastructure services. If any of these services is not reliable or performs poorly, it can adversely affect the performance and availability of your NFS clients. When diagnosing an NFS client problem, triple-check your automounter configuration first. It is often wise to disable the automounter before drilling into client problem diagnosis.

LINUX KERNEL TUNING FOR KNFS
sunrpc.tcp_slot_table_entries = 128

Increasing this parameter from the default of 16 to the maximum of 128 increases the number of in-flight Remote Procedure Calls (I/Os). Be sure to edit /etc/init.d/netfs to call /sbin/sysctl –p in the first line of the script so that sunrpc.tcp_slot_table_entries is set before NFS mounts any file systems. If NFS mounts the file systems before this parameter is set, the default value of 16 will be in force.

Mount options
rw,bg,hard,intr,rsize=32768,wsize=32768,vers=3,proto=tcp,timeo=600,retrans=2

Kernel Tuning
Most modern Linux distributions contain a file called /etc/sysctl.conf where you can add changes such as this so they will be executed after every system reboot. Add these lines to your /etc/sysctl.conf file on your client systems:

net.core.rmem_default 262144
Default TCP receive window size (Default buffer size)
Improve network performance for IPbased protocols

net.core.rmem_max 16777216
Max. TCP receive window size.(Max. buffer size)
Improve network performance for IPbased protocols

net.core.wmem_default 262144
Default TCP send window size (Default buffer size)
Improve network performance for IPbased protocols

net.core.wmem_max 16777216
Max. TCP send window size (Max. buffer size)
Improve network performance for IPbased protocols

net.ipv4.tcp_rmem 4096 262144 16777216
Autotuning for TCP receive window size (Default and Max. values are overridden by rmem_default rmem_max)
Improve network performance for IPbased protocols

net.ipv4.tcp_wmem 4096 262144 16777216

Autotuning for TCP send window size (Default and Max. values are overridden by wmem_default wmem_max)
Improve network performance for IPbased protocols

net.ipv4.tcp_window_scaling 1
TCP scaling, allows a TCP window size greater than 65536 to be used
This is enabled by default (value 1), make sure that it doesn’t get disabled (Value 0).

net.ipv4.tcp_syncookies 0
Disables generation SYN (crypto) COOKIES
Helps to reduce CPU overhead

net.ipv4.tcp_timestamps 0
Disables new RTTM feature introduced in RFC-1323
Helps to reduce CPU overhead Prevents adding 10-byte overhead to TCP header

net.ipv4.tcp_sack 0
Disables selective ack
Helps to reduce CPU overhead

Tuning on NFS and network

October 31st, 2009

Without any tuning
##########################################################
192.168.1.1:/vol/data /data nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,noac,vers=3,timeo=600 0 0
[root@server data]#
[root@server /]# cd /data
[root@server data]# time dd if=/dev/zero of=bigfile1 bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 156.126 seconds, 6.7 MB/s
real 2m36.932s
user 0m0.000s
sys 0m1.476s

After tuning vi /etc/fstab
########################################
192.168.1.1:/vol/data /data nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,noatime,vers=3,timeo=600 0 0

With Default setting
#############################################
echo 131071 > /proc/sys/net/core/rmem_default
echo 131071 > /proc/sys/net/core/rmem_max

[root@server data]# time dd if=/dev/zero of=bigfile1 bs=1024k count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 27.0813 seconds, 38.7 MB/s
real 0m27.091s
user 0m0.000s
sys 0m1.996s
Increase the size listed in the files
#######################################################
echo 262144 > /proc/sys/net/core/rmem_default
echo 262144 > /proc/sys/net/core/rmem_max
After doing more tuning on /etc/fstab
##############################################
192.168.1.1.:/vol/data /data nfs rw,bg,hard,nointr,rsize=65536,wsize=65536,tcp,noatime,vers=3,timeo=600 0 0

[root@server gwire_master]# time dd if=/dev/zero of=bigfile1 bs=1024k count=1000

1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 14.0536 seconds, 74.6 MB/s
real 0m14.089s
user 0m0.000s
sys 0m2.168s