CentOS 5.9
I came across an issue the other day where a directory had a lot of files. To count it, I ran ls -l /foo/foo2/ | wc -l
Turns out that there were over 1 million files in a single directory (long story — the root cause is getting fixed).
My question is: is there a faster way to do the count? What would be the most efficient way to get the count?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Short answer:
ls -afq | wc -l
(This includes . and .., so subtract 2.)
When you list the files in a directory, three common things might happen:
- Enumerating the file names in the directory. This is inescapable: there is no way to count the files in a directory without enumerating them.
- Sorting the file names. Shell wildcards and the
lscommand do that. - Calling
statto retrieve metadata about each directory entry, such as whether it is a directory.
#3 is the most expensive by far, because it requires loading an inode for each file. In comparison all the file names needed for #1 are compactly stored in a few blocks. #2 wastes some CPU time but it is often not a deal breaker.
If there are no newlines in file names, a simple ls -A | wc -l tells you how many files there are in the directory. Beware that if you have an alias for ls, this may trigger a call to stat (e.g. ls --color or ls -F need to know the file type, which requires a call to stat), so from the command line, call command ls -A | wc -l or ls -A | wc -l to avoid an alias.
If there are newlines in the file name, whether newlines are listed or not depends on the Unix variant. GNU coreutils and BusyBox default to displaying ? for a newline, so they’re safe.
Call ls -f to list the entries without sorting them (#2). This automatically turns on -a (at least on modern systems). The -f option is in POSIX but with optional status; most implementations support it, but not BusyBox. The option -q replaces non-printable characters including newlines by ?; it’s POSIX but isn’t supported by BusyBox, so omit it if you need BusyBox support at the expense of overcounting files whose name contains a newline character.
If the directory has no subdirectories, then most versions of find will not call stat on its entries (leaf directory optimization: a directory that has a link count of 2 cannot have subdirectories, so find doesn’t need to look up the metadata of the entries unless a condition such as -type requires it). So find . | wc -l is a portable, fast way to count files in a directory provided that the directory has no subdirectories and that no file name contains a newline.
If the directory has no subdirectories but file names may contain newlines, try one of these (the second one should be faster if it’s supported, but may not be noticeably so).
find -print0 | tr -dc \0 | wc -c find -printf a | wc -c
On the other hand, don’t use find if the directory has subdirectories: even find . -maxdepth 1 calls stat on every entry (at least with GNU find and BusyBox find). You avoid sorting (#2) but you pay the price of an inode lookup (#3) which kills performance.
In the shell without external tools, you can run count the files in the current directory with set -- *; echo $#. This misses dot files (files whose name begins with .) and reports 1 instead of 0 in an empty directory. This is the fastest way to count files in small directories because it doesn’t require starting an external program, but (except in zsh) wastes time for larger directories due to the sorting step (#2).
-
In bash, this is a reliable way to count the files in the current directory:
shopt -s dotglob nullglob a=(*) echo ${#a[@]} -
In ksh93, this is a reliable way to count the files in the current directory:
FIGNORE='@(.|..)' a=(~(N)*) echo ${#a[@]} -
In zsh, this is a reliable way to count the files in the current directory:
a=(*(DNoN)) echo $#a
If you have the
mark_dirsoption set, make sure to turn it off:a=(*(DNoN^M)). -
In any POSIX shell, this is a reliable way to count the files in the current directory:
total=0 set -- * if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi set -- .[!.]* if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi set -- ..?* if [ $# -ne 1 ] || [ -e "$1" ] || [ -L "$1" ]; then total=$((total+$#)); fi echo "$total"
All of these methods sort the file names, except for the zsh one.
Method 2
find /foo/foo2/ -maxdepth 1 | wc -l
Is considerably faster on my machine but the local . directory is added to the count.
Method 3
ls -1U before the pipe should spend just a bit less resources, as it does no attempt to sort the file entries, it just reads them as they are sorted in the folder on disk. It also produces less output, meaning slightly less work for wc.
You could also use ls -f which is more or less a shortcut for ls -1aU.
I don’t know if there is a resource-efficient way to do it via a command without piping though.
Method 4
Another point of comparison. While not being a shell oneliner, this C program doesn’t do anything superflous. Note that hidden files are ignored to match the output of ls|wc -l (ls -l|wc -l is off by one due to the total blocks in the first line of output).
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <error.h>
#include <errno.h>
int main(int argc, char *argv[])
{
int file_count = 0;
DIR * dirp;
struct dirent * entry;
if (argc < 2)
error(EXIT_FAILURE, 0, "missing argument");
if(!(dirp = opendir(argv[1])))
error(EXIT_FAILURE, errno, "could not open '%s'", argv[1]);
while ((entry = readdir(dirp)) != NULL) {
if (entry->d_name[0] == '.') { /* ignore hidden files */
continue;
}
file_count++;
}
closedir(dirp);
printf("%dn", file_count);
}
Method 5
You could try perl -e 'opendir($dh,".");$i=0;while(readdir $dh){$i++};print "$in";'
It’d be interesting to compare timings with your shell pipe.
Method 6
From this answer, I can think of this one as a possible solution.
/*
* List directories using getdents() because ls, find and Python libraries
* use readdir() which is slower (but uses getdents() underneath.
*
* Compile with
* ]$ gcc getdents.c -o getdents
*/
#define _GNU_SOURCE
#include <dirent.h> /* Defines DT_* constants */
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/syscall.h>
#define handle_error(msg)
do { perror(msg); exit(EXIT_FAILURE); } while (0)
struct linux_dirent {
long d_ino;
off_t d_off;
unsigned short d_reclen;
char d_name[];
};
#define BUF_SIZE 1024*1024*5
int
main(int argc, char *argv[])
{
int fd, nread;
char buf[BUF_SIZE];
struct linux_dirent *d;
int bpos;
char d_type;
fd = open(argc > 1 ? argv[1] : ".", O_RDONLY | O_DIRECTORY);
if (fd == -1)
handle_error("open");
for ( ; ; ) {
nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
if (nread == -1)
handle_error("getdents");
if (nread == 0)
break;
for (bpos = 0; bpos < nread;) {
d = (struct linux_dirent *) (buf + bpos);
d_type = *(buf + bpos + d->d_reclen - 1);
if( d->d_ino != 0 && d_type == DT_REG ) {
printf("%sn", (char *)d->d_name );
}
bpos += d->d_reclen;
}
}
exit(EXIT_SUCCESS);
}
Copy the C program above into directory in which the files need to be listed. Then execute these commands:
gcc getdents.c -o getdents
./getdents | wc -l
Method 7
A bash-only solution, not requiring any external program, but don’t know how much efficient:
list=(*)
echo "${#list[@]}"
Method 8
os.listdir() in python can do the work for you. It gives an array of the contents of the directory, excluding the special ‘.’ and ‘..’ files. Also, no need to worry abt files with special characters like ‘n’ in the name.
python -c 'import os;print len(os.listdir("."))'
following is the time taken by the above python command compared with the ‘ls -Af’ command.
~/test$ time ls -Af |wc -l
399144
real 0m0.300s
user 0m0.104s
sys 0m0.240s
~/test$ time python -c 'import os;print len(os.listdir("."))'
399142
real 0m0.249s
user 0m0.064s
sys 0m0.180s
Method 9
Probably the most resource efficient way would involve no outside process invocations. So I’d wager on…
cglb() ( c=0 ; set --
tglb() { [ -e "$2" ] || [ -L "$2" ] &&
c=$(($c+$#-1))
}
for glb in '.?*' *
do tglb $1 ${glb##.*} ${glb#*}
set -- ..
done
echo $c
)
Method 10
After fixing the issue from @Joel ‘s answer, where it added . as a file:
find /foo/foo2 -maxdepth 1 | tail -n +2 | wc -l
tail simply removes the first line, meaning that . isn’t counted anymore.
Method 11
ls -1 | wc -l comes immediately to my mind. Whether ls -1U is faster than ls -1 is purely academic – the difference should be negligible but for very large directories.
Method 12
To exclude subdirectories from the count, here’s a variation on the accepted answer from Gilles:
echo $(( $( ls -afq target | wc -l ) - $( ls -od target | cut -f2 -d' ') ))
The outer $(( )) arithmetic expansion subtracts the output of the second $( ) subshell from the first $( ). The first $( ) is exactly Gilles’ from above. The second $( ) outputs the count of directories “linking” to the target. This comes from ls -od (substitute ls -ld if desired), where the column that lists the count of hard links has that as a special meaning for directories. The “link” count includes ., .., and any subdirectories.
I didn’t test performance, but it would seem to be similar. It adds a stat of the target directory, and some overhead for the added subshell and pipe.
Method 13
A bit late answer (after 6 years), but…
The fastest way is just do ls -l on the parent directory, and check link-counts column for the given subdir.
Demo: Let say, want count the number of files/dirs in my /usr/lib directory.
So, entering ls -l /usr produces:
total 0 drwxr-xr-x 978 root wheel 31296 29 apr 2019 bin drwxr-xr-x 267 root wheel 8544 30 okt 2018 include drwxr-xr-x 312 root wheel 9984 23 jan 2019 lib drwxr-xr-x 240 root wheel 7680 29 apr 2019 libexec drwxr-xr-x 17 root wheel 544 14 nov 2018 local drwxr-xr-x 248 root wheel 7936 23 jan 2019 sbin drwxr-xr-x 47 root wheel 1504 4 okt 2018 share drwxr-xr-x 5 root wheel 160 25 okt 2017 standalone
The number just after the permissions is the link count of the file. For a directory it is just the number of entries inside it. So, in above example the /usr/lib has 312 entries.
Let verify:
$ ls -1a /usr/lib | wc -l
312
Without showing the other directories in the parent, just use -d e.g.
$ ls -ld /usr/lib drwxr-xr-x 312 root wheel 9984 23 jan 2019 /usr/lib # ^^^ - the number of entries in the /usr/lib (including . and ..)
Method 14
I would think echo * would be more efficient than any ‘ls’ command:
echo * | wc -w
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0