-
Notifications
You must be signed in to change notification settings - Fork 29
Performance Tuning Tips
Following is just a reference, don't need to add them to your physical machine.
cat /proc/cmdline BOOT_IMAGE=/vmlinuz-4.4.0-64-generic root=/dev/mapper/vxRailSIM--vg-root ro intel_iommu=on hugepagesz=1GB hugepages=4 default_hugepagesz=1GB transparent_hugepage=never
vmlinuz is the name of the Linux kernel executable. A kernel is a program that constitutes the central core of a computer operating system. ... An executable, also called an executable file, is a file that can be run as a program. vmlinuz is a compressed Linux kernel, and it is bootable.
root Root filesystem
ro Mount root device read-only on boot
intel_iommu is the kernel parameter used to enable or disable support for the Input-Output Memory Management Unit. It is a device built into some motherboards which is used by processors that support it to direct I/O to physical addresses.
HugePages is a feature integrated into the Linux kernel 2.6. Enabling HugePages makes it possible for the operating system to support memory pages greater than the default (usually 4KB).
-
When a process uses some memory, the CPU is marking the RAM as used by that process. For efficiency, the CPU allocate RAM by chunks of 4K bytes (it's the default value on many platforms). Those chunks are named pages. Those pages can be swapped to disk, etc.
-
Since the process address space are virtual, the CPU and the operating system have to remember which page belong to which process, and where it is stored. Obviously, the more pages you have, the more time it takes to find where the memory is mapped. When a process uses 1GB of memory, that's 262144 entries to look up (1GB / 4K). If one Page Table Entry consume 8bytes, that's 2MB (262144 * 8) to look-up.
-
Most current CPU architectures support bigger pages (so the CPU/OS have less entries to look-up), those are named Huge pages (on Linux), Super Pages (on BSD) or Large Pages (on Windows), but it all the same thing.
The Irqbalance daemon is enabled by default. It is designed to distribute hardware interrupts across CPUs in a multi-core system in order to increase performance. However, it can/will cause the cpu running the VM to be stalled, causing dropped Rx packets. When irqbalance is disabled, all interrupts will be handled by cpu0, so the vpp VM should NOT run on cpu0.
Disable irqbalance by setting ENABLED="0" in the default configuration file (/etc/default/irqbalance): ENABLED="0" ONESHOT="0"
This daemon is enabled by default and periodically forces interrupts to be handled by CPUs in an even, fair manner. However in realtime deployments, applications are typically dedicated and bound to specific CPUs, so the irqbalance daemon is not required.
$ service irqbalance stop
Speedstep is a CPU feature that dynamically adjusts the frequency of processor to meet processing needs, decreasing the frequency under low cpu-load conditions. Turboboost overclocks a core when the demand for cpu is high. Turboboost requires that Speedstep is enabled. While these two configuration are good for power saving they could introduce a variance in dataplane performance when there is a burst of packets. For consistency of behavior, these two features should be disabled. For maximum performance, Speedstep and Turboboost can both be enabled. BIOS changes are likely not sufficient to enable Turboboost. The host OS may also need changes to support running at higher clock speeds. The specific configuration changes required are different on Ubuntu, CentOS, RedHat, etc. On Ubuntu, “performance” mode for all CPU cores should be set in these files:
for CPUFREQ in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor;
do [ -f $CPUFREQ ] || continue;
echo -n performance > $CPUFREQ;
done
grep -E '^model name|^cpu MHz' /proc/cpuinfo
Avoiding CPU speed scaling: http://www.servernoobs.com/avoiding-cpu-speed-scaling-in-modern-linux-distributions-running-cpu-at-full-speed-tips
clean dirty data, the dirty_wirteback_centisecs is 500*1/100 by default
echo 10000 > /proc/sys/vm/dirty_writeback_centisecs
https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
$ sysctl -a |grep dirty
vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.
vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.
dirty page A logical write occurs when data is modified in a page in the buffer cache. A physical write occurs when the page is written from the buffer cache to disk. When a page is modified in the buffer cache, it is not immediately written back to disk; instead, the page is marked as dirty.
pdflush is a set of kernel threads which are responsible for writing the dirty pages to disk, either explicitly in response to a sync() call, or implicitly in cases when the page cache runs out of pages, if the pages have been in memory for too long, or there are too many dirty pages in the page cache (as specified by /proc/sys/vm/dirty_ratio).
set it in /etc/sysctl.conf by appending “vm.swappiness = 0” and running “sudo sysctl –p” to reload the values.
The Linux kernel has quite a number of tunable options in it. One of those is vm.swappiness, a parameter that helps guide the kernel in making decisions about memory. “vm” in this case means “virtual memory,” which doesn’t mean memory allocated by a hypervisor but refers to the addressing scheme the Linux kernel uses to handle memory. Even on a physical host you have “virtual memory” within the OS. Memory on a Linux box is used for a number of different things. One way it is used is internally for buffers for things like network stacks, SCSI queues, etc. Another way is the obvious use for applications. The third big way is as disk cache, where RAM not used for buffers or applications is used to make disk read accesses faster. All of these uses are important, so when RAM is scarce how does the kernel decide what’s more important and what should be sent to the swap file? The kernel buffers always stay in main memory, because they have to. Applications and cache don’t need to stay in RAM, though. The cache can be dropped, and the applications can be paged out to the swap file. Dropping cache means a potential performance hit. Likewise with paging applications out. The vm.swappiness parameter helps the kernel decide what to do. By setting it to the maximum of 100 the kernel will swap very aggressively. By setting it to 0 the kernel will only swap to protect against an out-of-memory condition. The default is 60 which means that some swapping will occur. In this context we’re talking about swapping inside the guest OS, not at the hypervisor level. Swapping is bad in a virtual environment, at any level. When the guest OS starts swapping on a physical server it only affects that server, but in a virtual environment it causes problems for all the workloads.
Improving disk I/O performance in QEMU 2.5 with the qcow2 L2 cache
KSM is a memory-saving de-duplication feature, that merges anonymous (private) pages (not pagecache ones). Although it started this way, KSM is currently suitable for more than Virtual Machine use, as it can be useful to any application which generates many instances of the same data Enabling KSM KSM only operates on those areas of address space which an application has advised to be likely candidates for merging, by using the madvise(2) system call: int madvise(addr, length, MADV_MERGEABLE). The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel that advice and restore unshared pages: whereupon KSM unmerges whatever it merged in that range. Note: this unmerging call may suddenly require more memory than is available - possibly failing with EAGAIN, but more probably arousing the Out-Of-Memory killer.
please set the following parameters if use of KSM:
echo 1 > /sys/kernel/mm/ksm/run
adjust the sleep time of ksm:
echo 200 > /sys/kernel/mm/ksm/sleep_millisecs
Suggestion to disable KSM and daemon [ksmd] if only focus on the performance of VM.
echo 1 > /proc/sys/kernel/numa_balancing
Virtualization Tuning and Optimization Guide, Automatic NUMA balancing improves the performance of applications running on NUMA hardware systems, without any manual tuning required for Red Hat Enterprise Linux 7 guests. Automatic NUMA balancing moves tasks, which can be threads or processes, closer to the memory they are accessing.
In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:
In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:
Modern multiprocessor systems mix these basic architectures as seen in the following diagram:
The optimal configuration is (usually) as follows:
On the host, set elevator=deadline
Use virtio and only virtio
use raw LVs whenever possible. Qcow2 gives overhead. Files on a FS also have overhead
in the VM use the elevator=noop both in host and VM, use noatime,nodiratime in fstab wherever possible
Make sure the virtio drivers are up to date, especially the windows ones.
Debian based distros are (arguably) not as good as Fedora and RHEL for QEMU/KVM. Not to start a flamewar, but most of the development and testing is done on Fedora and RHEL, and in my own experience, there have been lots of issues on Ubuntu and Debian that I couldn't reproduce on Fedora and RHEL. You can ignore this particular bullet if you want, but if you're looking for a solution, a quick benchmark on another distro is usually worth a try.
Try setting "deadline" as the I/O scheduler for your host's disks before starting KVM:
for f in /sys/block/sd*/queue/scheduler
do echo "deadline" > $f
done
If you have I/O bound load, it might be your best choice.
Virtio is a virtualization standard for network and disk device drivers where just the guest's device driver "knows" it is running in a virtual environment, and cooperates with the hypervisor.
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/enabled
Permanently disable transparent hugepage as for item1, or you can install the sysfsutils package: apt install sysfsutils and append a line with that setting to /etc/sysfs.conf: kernel/mm/transparent_hugepage/enabled = never This is the cleanest solution, because it keeps all the sysfs configuration in one place instead of relying on custom start-up scripts. The other answers, with the scripts and conditional expressions, are suitable if you don't know through which path the kernel will expose that setting, i. e. if you don't even have a rough idea of the kernel version running on the affected machine.
https://www.ibm.com/developerworks/cn/linux/l-cn-hugetlb/
echo 0 > /proc/sys/vm/zone_reclaim_mode
A guest operating system can sometimes cause zone reclaim to occur when you pin memory. For example, a guest operating system causes zone reclaim in the following situations:
When you configure the guest operating system to use huge pages. When you use Kernel same-page merging (KSM) to share memory pages between guest operating systems. Configuring huge pages and running KSM are both best practices for KVM environments. Therefore, to optimize performance in KVM environments, disable zone reclaim. Zone reclaim can be made more aggressive by enabling write-back of dirty pages or the swapping of anonymous pages, but in practice doing so has often resulted in significant performance issues.
-realtime mlock=off
-
object memory-backend-ram,size=65536M,policy=bind,prealloc=on,host-nodes=0,id=ram-node0 -numa node,nodeid=0,cpus=0-7,memdev=ram-node0
-
object memory-backend-ram,size=32768M,policy=bind,prealloc=on,host-nodes=0,id=ram-node1 -numa node,nodeid=0,cpus=0-7,memdev=ram-node1
-
object memory-backend-ram,size=32768M,policy=bind,prealloc=on,host-nodes=1,id=ram-node2 -numa node,nodeid=1,cpus=0-7,memdev=ram-node2
-
object memory-backend-ram,size=32768M,policy=bind,prealloc=on,host-nodes=1,id=ram-node3 -numa node,nodeid=1,cpus=0-7,memdev=ram-node3
Notes:
a. -numa node,nodeid=0,cpus=0,memdev=ram-node0 to create a guest NUMA node
b. -object memory-backend-ram,size=256M,id=ram-node0 to tell qemu which memory region on which host's NUMA node it should allocate the guest memory from.
c. -object size=[memory], this memory must be equal another option -m [memory] of qemu, both of them have to ensure that's free memory size, we can fetch them from the command: "numactl --hardware".
numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29 node 0 size: 128895 MB node 0 free: 68283 MB node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39 node 1 size: 129020 MB node 1 free: 22851 MB node distances: node 0 1 0: 10 21 1: 21 10
d. -numa node, cpus=[qemu cpu core on guest system, not physical cpu cores or sockets], in general, its value is 0-7.
e. btw, it should be care about "-smp 8,sockets=2,cores=4,threads=1", smp=sockets * cores * threads
Combining above comments we can instruct qemu to create a guest NUMA node that is tied to a host NUMA node.
The IVSHMEM library facilitates fast zero-copy data sharing among virtual machines (host-to-guest or guest-to-guest) by means of QEMU’s IVSHMEM mechanism. The library works by providing a command line for QEMU to map several hugepages into a single IVSHMEM device. For the guest to know what is inside any given IVSHMEM device , a metadata file is also mapped into the IVSHMEM segment. No work needs to be done by the guest application to map IVSHMEM devices into memory.
A typical DPDK (The Data Plane Development Kit (DPDK) is a set of data plane libraries and network interface controller drivers for fast packet processing.) IVSHMEM use case looks like the following.
Now if we do lspci in guest, the ivshmem pci device is shown in it. For performance reasons, it is best to pin host processes and QEMU processes to different cores so that they do not interfere with each other. If NUMA support is enabled, it is also desirable to keep host process’ hugepage memory and QEMU process on the same NUMA node. For the best performance across all NUMA nodes, each QEMU core should be pinned to host CPU core on the appropriate NUMA node. QEMU’s virtual NUMA nodes should also be set up to correspond to physical NUMA nodes. The QEMU IVSHMEM command line creation should be considered the last step before starting the virtual machine. Currently, there is no hot plug support for QEMU IVSHMEM devices, so one cannot add additional memory to an IVSHMEM device once it has been created. Therefore, the correct sequence to run an IVSHMEM application is to run host application first, obtain the command lines for each IVSHMEM device and then run all QEMU instances with guest applications afterwards. It is important to note that once QEMU is started, it holds on to the hugepages it uses for IVSHMEM devices. As a result, if the user wishes to shut down or restart the IVSHMEM host application, it is not enough to simply shut the application down. The virtual machine must also be shut down (if not, it will hold onto outdated host data).
./ivshmem-server -S /tmp/ivshmem_socket -m /tmp/share/ -l 20G -n 1 -device ivshmem-doorbell,vectors=1,chardev=id -chardev socket,path=/tmp/ivshmem_socket,id=id
https://github.com/qemu/qemu/blob/master/docs/specs/ivshmem-spec.txt
checking if compute node support huge memory page. grep -m1 "pse|pdpe1gb" /proc/cpuinfo configure huge page adding assigning huge page to kernel parameter list /etc/default/grub, meanwhile disable THP for avoiding memory switch, please take a look at item 1. Performance command "update-grub" and reboot OS to make it validate. Create hugepage folder:
mkdir /dev/hugepages_node0
mkdir /dev/hugepages_node1
Mount huge page:
mount -t hugetlbfs -o pagesize=1G none /dev/hugepages_node0
mount -t hugetlbfs -o pagesize=1G none /dev/hugepages_node1
Bind two nodes to 4 qemu process:
-mem-path /dev/hugepages_node0
-mem-path /dev/hugepages_node1
Checking Hugepage usage
grep -i huge /proc/meminfo
or
grep /sys/devices/system/node/node*/hugepages/hugepages*kB/nr_hugepages
example on Dell C6320:
$cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
0
$cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
0
write back Page Cache write-back cache write flush description Direct-sync NO NO N/A No Improvement Write-through YES NO YES Improvement IO performance based on HOST OS none/off NO YES N/A Close Host page cache, improvement write performance, ensure security Write-back YES YES YES Improvement read/write performance, it might loss data unsafe YES YES NO Improvement read/write performance, not ensure data security "none" is best option. so adding it to qemu option "cache=none"
AIO 异步读写,分别包括Native aio: kernel AIO 和 threaded aio: user space AIO emulated by posix thread workers,内核方式要比用户态的方式性能稍好一点,所以一般情况都选择native. aio=native
从大的范围来看,虚拟机所有的VCPU应尽可能在同一个Node内,减少内存访问开销。细化下来,对于物理 CPU,同一个 core 的 threads 共享 L2 Cache,同一个 socket 的 cores 共享 L3 cache,所以虚拟机的 VCPU 也应当尽可能在同一个 core 和 同一个 socket 中,增加 cache 的命中率,从而提高性能。所以,具体的绑定策略,应该根据宿主机物理CPU特性和虚拟机VCPU个数来定。
VCPU个数不大于2个,且物理机CPU支持超线程,则可以将VCPU绑定到同一个core内。 VCPU个数不大于物理机一个Socket的总逻辑Processor个数,则可以将VCPU绑定到同一个Socket中。 VCPU个数不大于物理机一个Node的总逻辑Processor个数,则可以将VCPU绑定到同一个Node中。 根据实际情况,物理机某Processor使用率很高导致虚拟机很卡,则可及时将虚拟机绑定到物理机其他较空闲的Processor。
- Bind multi-core to QEMU
option: numactl --physcpubind=xxxx --localalloc qemu-application Filtering all core_id/socket_id before you bind.
cores = [0, 1, 2, 3, 4, 8, 9, 10, 11, 12] sockets = [0, 1] Socket 0 Socket 1 -------- -------- Core 0 [0, 20] [10, 30] Core 1 [1, 21] [11, 31] Core 2 [2, 22] [12, 32] Core 3 [3, 23] [13, 33] Core 4 [4, 24] [14, 34] Core 8 [5, 25] [15, 35] Core 9 [6, 26] [16, 36] Core 10 [7, 27] [17, 37] Core 11 [8, 28] [18, 38] Core 12 [9, 29] [19, 39] http://www.glennklockwood.com/hpc-howtos/process-affinity.html
RAMDisk is a program that takes a portion of your system memory and uses it as a disk drive. The more RAM your computer has, the larger the RAMDisk you can create.
mkdir /mnt/ramdisk[x]
chmod [value] /mnt/ramdisk[x]
mount -t tmpfs -o size=[yy] tmpfs /mnt/ramdisk[x]
To test write speed of RAM disk, you can use dd utility.
sudo dd if=/dev/zero of=/mnt/ramdisk0/zero bs=4k count=10000
To test read speed, run:
sudo dd if=/mnt/ramdisk/zero of=/dev/null bs=4k count=10000
mount -t [TYPE] -o size=[SIZE] [FSTYPE] [MOUNTPOINT]
Substitute the following attributes for your own values:
* [TYPE] is the type of RAM disk to use; either tmpfs or ramfs.
* [SIZE] is the size to use for the filesystem. Remember that ramfs does not have a physical limit and is specified
as a starting size.
* [FSTYPE] is the type of RAM disk to use; either tmpfs, ramfs, ext4, etc.
_**tmpfs**_ is a common name for a temporary file storage facility on many Unix-like operating systems. It is intended to appear as a mounted file system, but stored in volatile memory instead of a persistent storage device.
_**ramfs**_ is a very simple FileSystem that exports Linux's disk cacheing mechanisms (the page cache and dentry cache) as a dynamically resizable ram-based filesystem. Normally all files are cached in memory by Linux.
http://osr507doc.sco.com/man/html.HW/ramdisk.HW.html
http://leeon.me/a/linux-ramdisk-tmpfs-ramfs
mount -o nobarrier,nodiscard,data=writeback /dev/sd /sd
noatime Do not update inode access times on this filesystem (e.g., for faster access on the news spool to speed up news servers). This works for all inode types (directories too), so implies nodiratime.
nodiration Do not update directory inode access times on this filesystem. If noatime option is set, this option is not needed.
nobarrier Enable/disable the use of block-layer write barriers. Write barriers ensure that certain IOs make it through the device cache and are on persis-tent storage. If disabled on a device with a volatile (non-battery-backed) write-back cache, the nobarrier option will lead to filesystem corruption on a system crash or power loss.
nodiscard Controls whether ext4 should issue discard/TRIM commands to the underlying block device when blocks are freed. This is useful for SSD devices and sparse/thinly-provisioned LUNs, but it is off by default until sufficient testing has been done.
data=writeback Data ordering is not preserved - data may be written into the main filesystem after its metadata has been committed to the journal. This is rumored to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.
http://searchenterpriselinux.techtarget.com/tip/Tuning-the-Ext4-file-system-for-optimal-performance
https://docs.salixos.org/wiki/How_to_Tune_an_SSD
Reference link: https://www.kernel.org/doc/Documentation/sysctl/vm.txt https://wiki.fd.io/view/VPP/How_To_Optimize_Performance_(System_Tuning)
http://download.qemu-project.org/qemu-doc.html