I had a directory which had around 5 million files. When I tried to run the ls command from inside this directory, my system consumed a huge amount of memory and it hung after sometime. Is there an efficient way to list the files other than using the ls command?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Avoid sorting by using:
ls --sort=none # "do not sort; list entries in directory order"
Or, equivalently:
ls -U
Method 2
ls actually sorts the files and tries to list them which becomes a huge overhead if we are trying to list more than a million files inside a directory. As mentioned in this link, we can use strace or find to list the files. However, those options also seemed unfeasible to my problem since I had 5 million files. After some bit of googling, I found that if we list the directories using getdents(), it is supposed to be faster, because ls, find and Python libraries use readdir() which is slower but uses getdents() underneath.
We can find the C code to list the files using getdents() from here:
/*
* List directories using getdents() because ls, find and Python libraries
* use readdir() which is slower (but uses getdents() underneath.
*
* Compile with
* ]$ gcc getdents.c -o getdents
*/
#define _GNU_SOURCE
#include <dirent.h> /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#define handle_error(msg)
do { perror(msg); exit(EXIT_FAILURE); } while (0)
struct linux_dirent {
long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
#define BUF_SIZE 1024*1024*5
int
main(int argc, char *argv[])
{
int fd, nread;
char buf[BUF_SIZE];
struct linux_dirent *d;
int bpos;
char d_type;
fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
if (fd == -1)
handle_error("open");
for ( ; ; ) {
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
if (nread == -1)
handle_error("getdents");
if (nread == 0)
break;
for (bpos = 0; bpos < nread;) {
d = (struct linux_dirent *) (buf + bpos);
d_type = *(buf + bpos + d->d_reclen - 1);
if( d->d_ino != 0 && d_type == DT_REG ) {
printf("%sn", (char *)d->d_name );
}
bpos += d->d_reclen;
}
}
exit(EXIT_SUCCESS);
}
Copy the C program above into directory in which the files need to be listed. Then execute the below commands.
gcc getdents.c -o getdents ./getdents
Timings example: getdents can be much faster than ls -f, depending on the system configuration. Here are some timings demonstrating a 40x speed increase for listing a directory containing about 500k files over an NFS mount in a compute cluster. Each command was run 10 times in immediate succession, first getdents, then ls -f. The first run is significantly slower than all others, probably due to NFS caching page faults. (Aside: over this mount, the d_type field is unreliable, in the sense that many files appear as “unknown” type.)
command: getdents $bigdir usr:0.08 sys:0.96 wall:280.79 CPU:0% usr:0.06 sys:0.18 wall:0.25 CPU:97% usr:0.05 sys:0.16 wall:0.21 CPU:99% usr:0.04 sys:0.18 wall:0.23 CPU:98% usr:0.05 sys:0.20 wall:0.26 CPU:99% usr:0.04 sys:0.18 wall:0.22 CPU:99% usr:0.04 sys:0.17 wall:0.22 CPU:99% usr:0.04 sys:0.20 wall:0.25 CPU:99% usr:0.06 sys:0.18 wall:0.25 CPU:98% usr:0.06 sys:0.18 wall:0.25 CPU:98% command: /bin/ls -f $bigdir usr:0.53 sys:8.39 wall:8.97 CPU:99% usr:0.53 sys:7.65 wall:8.20 CPU:99% usr:0.44 sys:7.91 wall:8.36 CPU:99% usr:0.50 sys:8.00 wall:8.51 CPU:100% usr:0.41 sys:7.73 wall:8.15 CPU:99% usr:0.47 sys:8.84 wall:9.32 CPU:99% usr:0.57 sys:9.78 wall:10.36 CPU:99% usr:0.53 sys:10.75 wall:11.29 CPU:99% usr:0.46 sys:8.76 wall:9.25 CPU:99% usr:0.50 sys:8.58 wall:9.13 CPU:99%
Method 3
The most likely reason why it is slow is file type colouring, you can avoid this with ls or /bin/ls turning off the colour options.
If you really have so many files in a dir, using find instead is also a good option.
Method 4
I find that echo * works much faster than ls. YMMV.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0