A lot of command-line utilities can take their input either from a pipe or as a filename argument. For long shell scripts, I find starting the chain off with a cat makes it more readable, especially if the first command would need multi-line arguments.
Compare
sed s/bla/blaha/ data | grep blah | grep -n babla
and
cat data | sed s/bla/blaha/ | grep blah | grep -n babla
Is the latter method less efficient? If so, is the difference enough to care about if the script is run, say, once a second? The difference in readability is not huge.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The “definitive” answer is of course brought to you by The Useless Use of cat Award.
The purpose of cat is to concatenate (or “catenate”) files. If it’s only one file, concatenating it with nothing at all is a waste of time, and costs you a process.
Instantiating cat just so your code reads differently makes for just one more process and one more set of input/output streams that are not needed. Typically the real hold-up in your scripts is going to be inefficient loops and actuall processing. On most modern systems, one extra cat is not going to kill your performance, but there is almost always another way to write your code.
Most programs, as you note, are able to accept an argument for the input file. However, there is always the shell builtin < that can be used wherever a STDIN stream is expected which will save you one process by doing the work in the shell process that is already running.
You can even get creative with WHERE you write it. Normally it would be placed at the end of a command before you specify any output redirects or pipes like this:
sed s/blah/blaha/ < data | pipe
But it doesn’t have to be that way. It can even come first. For instance your example code could be written like this:
< data
sed s/bla/blaha/ |
grep blah |
grep -n babla
If script readability is your concern and your code is messy enough that adding a line for cat is expected to make it easier to follow, there are other ways to clean up your code. One that I use a lot that helps make scripts easiy to figure out later is breaking up pipes into logical sets and saving them in functions. The script code then becomes very natural, and any one part of the pipline is easier to debug.
function fix_blahs () {
sed s/bla/blaha/ |
grep blah |
grep -n babla
}
fix_blahs < data
You could then continue with fix_blahs < data | fix_frogs | reorder | format_for_sql. A pipleline that reads like that is really easy to follow, and the individual components can be debuged easily in their respective functions.
Method 2
Here’s a summary of some of the drawbacks of:
cat $file | cmd
over
< $file cmd
-
First, a note: there are (intentionally for the purpose of the discussion) missing double quotes around
$fileabove. In the case ofcat, that’s always a problem except forzsh; in the case of the redirection, that’s only a problem forbashorksh88and, for some other shells (includingbashin POSIX mode) only when interactive (not in scripts). -
The most often cited drawback is the extra process being spawned. Note that if
cmdis builtin, that’s even 2 processes in some shells likebash. -
Still on the performance front, except in shells where
catis builtin, that also an extra command being executed (and of course loaded, and initialised (and the libraries it’s linked to as well)). -
Still on the performance front, for large files, that means the system will have to alternately schedule the
catandcmdprocesses and constantly fill up and empty the pipe buffer. Even ifcmddoes1GBlargeread()system calls at a time, control will have to go back and forth betweencatandcmdbecause a pipe can’t hold more than a few kilobytes of data at a time. -
Some
cmds (likewc -c) can do some optimisations when their stdin is a regular file which they can’t do withcat | cmdas their stdin is just a pipe then. Withcatand a pipe, it also means they cannotseek()within the file. For commands liketacortail, that makes a huge difference in performance as that means that withcatthey need to store the whole input in memory. -
The
cat $file, and even its more correct versioncat -- "$file"won’t work properly for some specific file names like-(or--helpor anything starting with-if you forget the--). If one insists on usingcat, he should probably usecat < "$file" | cmdinstead for reliability. -
If
$filecannot be open for reading (access denied, doesn’t exist…),< "$file" cmdwill report a consistent error message (by the shell) and not runcmd, whilecat $file | cmdwill still runcmdbut with its stdin looking like it’s an empty file. That also means that in things like< file cmd > file2,file2is not clobbered iffilecan’t be opened.Or in other words you can choose the order in which the input and output files are opened as opposed to
cmd file > file2where the output file is always opened (by the shell) before the input file (bycmd), which is hardly ever preferable.Note however that it won’t help in
cmd1 < file | cmd2 > file2wherecmd1andcmd2and their redirections are performed concurrently and independently and which you’d need to write as{ cmd1 | cmd2; } < file > file2or(cmd1 | cmd2 > file2) < filefor instance to avoidfile2being clobbered andcmd1andcmd2being run iffilecan’t be opened.
Method 3
Putting <file on the end of a pipeline is less readable than having cat file at the start. Natural English reads from left to right.
Putting <file a the start of the pipeline is also less readable than cat, I would say. A word is more readable than a symbol, especially a symbol which seems to point the wrong way.
Using cat preserves the command | command | command format.
Method 4
One thing that the other answers here don’t seem to have directly addressed is that using cat like this isn’t “useless” in the sense that “an extraneous cat process is spawned that does no work”; it’s useless in the sense that “a cat process is spawned that does only unnecessary work”.
In the case of these two:
sed 's/foo/bar/' somefile <somefile sed 's/foo/bar/'
the shell starts a sed process that reads from somefile or stdin (respectively) and then does some processing – it reads up until it hits a newline, replaces the first ‘foo’ (if any) on that line with ‘bar’, then prints that line to stdout and loops.
In the case of:
cat somefile | sed 's/foo/bar/'
The shell spawns a cat process and a sed process, and wires cat’s stdout to sed’s stdin. The cat process reads a several kilo- or maybe mega- byte chunk out of the file, then writes that out to its stdout, where the sed sommand picks up from there as in the second example above. While sed is processing that chunk, cat is reading another chunk and writing it to its stdout for sed to work on next.
In other words, the extra work necessitated by adding the cat command isn’t just the extra work of spawning an extra cat process, it’s also the extra work of reading and writing the bytes of the file twice instead of once. Now, practically speaking and on modern systems, that doesn’t make a huge difference – it may make your system do a few microseconds of unnecessary work. But if it’s for a script that you plan on distributing, potentially to people using it on machines that are already underpowered, a few microseconds can add up over a lot of iterations.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0