Is there a way to tell the Linux kernel to only use a certain percentage of memory for the buffer cache? I know /proc/sys/vm/drop_caches can be used to clear the cache temporarily, but is there any permanent setting that prevents it from growing to more than e.g. 50% of main memory?
The reason I want to do this, is that I have a server running a Ceph OSD which constantly serves data from disk and manages to use up the entire physical memory as buffer cache within a few hours. At the same time, I need to run applications that will allocate a large amount (several 10s of GB) of physical memory. Contrary to popular belief (see the advice given on nearly all questions concerning the buffer cache), the automatic freeing up the memory by discarding clean cache entries is not instantaneous: starting my application can take up to a minute when the buffer cache is full (*), while after clearing the cache (using echo 3 > /proc/sys/vm/drop_caches) the same application starts nearly instantaneously.
(*) During this minute of startup time, the application is faulting in new memory but spends 100% of its time in the kernel, according to Vtune in a function called pageblock_pfn_to_page. This function seems to be related to memory compaction needed to find huge pages, which leads me to believe that actually fragmentation is the problem.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
If you do not want an absolute limit but just pressure the kernel to flush out the buffers faster, you should look at vm.vfs_cache_pressure
This variable controls the tendency of the kernel to reclaim the memory which is used for caching of VFS caches, versus pagecache and swap. Increasing this value increases the rate at which VFS caches are reclaimed.
Ranges from 0 to 200. Move it towards 200 for higher pressure. Default is set at 100. You can also analyze your memory usage using the slabtop command. In your case, the dentry and *_inode_cache values must be high.
If you want an absolute limit, you should look up cgroups. Place the Ceph OSD server within a cgroup and limit the maximum memory it can use by setting the memory.limit_in_bytes parameter for the cgroup.
memory.memsw.limit_in_bytessets the maximum amount for the sum of memory and swap usage. If no units are specified, the value is interpreted as bytes. However, it is possible to use suffixes to represent larger units — k or K for kilobytes, m or M for Megabytes, and g or G for Gigabytes.
References:
[1] – GlusterFS Linux Kernel Tuning
[2] – RHEL 6 Resource Management Guide
Method 2
I don’t know about A % but, You can set a time limit so it drops it after x amount of minutes.
First in a terminal
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
To clear current caches.
Make it a cron-job
Press Alt-F2, type gksudo gedit /etc/crontab, Then Add this line near the bottom.
*/15 * * * * root sync && echo 3 > /proc/sys/vm/drop_caches
**This cleans every 15 minutes. You can set to 1 or 5 minutes if you really want to by changing the first parameter to * or /5 instead of /15
To see your free RAM, excepting cache:
free -m | sed -n -e '3p' | grep -Po "d+$"
Method 3
If Ceph OSD is one separate process, you could use cgroups to control resources utilized by process:
Create a cgroup named like group1 with a memory limit (of 50GB, for example, other limits like CPU are supported, in example CPU is also mentioned):
cgcreate -g memory,cpu:group1 cgset -r memory.limit_in_bytes=$((50*1024*1024*1024)) group1
Then, if you app is already running, bring app into this cgroup:
cgclassify -g memory,cpu:group1 $(pidof your_app_name)
Or execute your app within this cgroup:
cgexec -g memory,cpu:group1 your_app_name
Method 4
I think your hunch at the very end of your question is on the right track. I’d suspect either A, NUMA-aware memory allocation migrating pages between CPUs, or B, more likely, the defrag code of transparent hugepages trying to find contiguous, aligned regions.
Hugepages and transparent hugepages has been identified for both marked performance improvements on certain workloads and responsible for consuming enormous amounts of CPU time without providing much benefit.
It’d help to know which kernel you’re running, the contents of /proc/meminfo (or at least the HugePages_* values.), and, if possible, more of the vtune profiler callgraph referencing pageblock_pfn_to_page().
Also, if you’d indulge my guess, try disable hugepage defrag with:
echo ‘never’ >/sys/kernel/mm/transparent_hugepage/defrag
(it may be this instead, depending on your kernel:)
echo ‘never’ > /sys/kernel/mm/redhat_transparent_hugepage/defrag
Lastly, is this app using many tens of gigs of ram something you wrote? What language?
Since you used the term, “faulting in memory pages,” I’m guessing you’re familiar enough with operating design and virtual memory. I struggle to envision a situation/application that would be faulting so aggressively that isn’t reading in lots of I/O – almost always from the buffer cache that you’re trying to limit.
(If you’re curious, check out mmap(2) flags like MAP_ANONYMOUS and MAP_POPULATE and mincore(2) which can be used to see which virtual pages actually have a mapped physical page.)
Good Luck!
Method 5
You can also use vm.min_free_kbytes (/proc/sys/vm/min_free_kbytes) to prevent Linux kernel from using too much memory for buffering (you will need to experiment setting different values to find out which is acceptable for your use cases).
Method 6
tuned is a dynamic adaptive system tuning daemon that tunes system settings dynamically depending on usage.
$ man tuned
See the related documentation , and configuration files.
/etc/tuned /etc/tuned/*.conf /usr/share/doc/tuned-2.4.1 /usr/share/doc/tuned-2.4.1/TIPS.txt This parameter may be useful for you. ** Set flushing to once per 5 minutes ** echo "3000" > /proc/sys/vm/dirty_writeback_centisecs
Additional Info
The sync command flushes the buffer, i.e., forces all unwritten data to be written to disk, and can be used when one wants to be sure that everything is safely written. In traditional UNIX systems, there is a program called update running in the background which does a sync every 30 seconds, so it is usually not necessary to use sync. Linux has an additional daemon, bdflush, which does a more imperfect sync more frequently to avoid the sudden freeze due to heavy disk I/O that sync sometimes causes.
Under Linux, bdflush is started by update. There is usually no reason to worry about it, but if bdflush happens to die for some reason, the kernel will warn about this, and you should start it by hand (/sbin/update).
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0