I have a directory with about 100000 small files (each file is from 1-3 lines, each file is a text file). In size the directory isn’t very big (< 2GB). This data lives in a professionally administered NFS server. The server runs Linux. I think the filesystem is ext3, but I don’t know for sure. Also, I don’t have root access to the server.
These files are the output of a large scale scientific experiment, over which I don’t have control. However, I have to analyze the results.
Any I/O operation/processing in this directory is very, very slow. Opening a file (fopen in python), reading from an open file, closing a file, are all very slow. In bash ls, du, etc. don’t work.
The question is:
What is the maximum number of files in a directory in Linux in such a way that it is practical to do processing, fopen, read, etc? I understand that the answer depends on many things: fs type, kernel version, server version, hardware, etc. I just want a rule of thumb, if possible.
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
As you surmise, it does depend on many things, mostly the filesystem type and options and to some extent the kernel version. In the ext2/ext3/ext4 series, there was a major improvement when the
dir_index option appeared (some time after the initial release of ext3): it makes directories be stored as search trees (logarithmic time access) rather than linear lists (linear time access). This isn’t something you can see over NFS, but if you have some contact with the admins you can ask them to run
tune2fs -l /dev/something |grep features (perhaps even convince them to upgrade?). Only the number of files matters, not their size.
dir_index, 100000 feels large. Ideally, get the authors of the program that creates the files to add a level of subdirectories. For no performance degradation, I would recommend a limit of about 1000 files per directory for ext2 or ext3 without
dir_index and 20000 with
dir_index or reiserfs. If you can’t control how the files are created, move them into separate directories before doing anything else.