A related question is here.
I often have to edit a large file by removing a few lines from the middle of it. I know which lines I wish to remove and I typically do the following:
sed "linenum1,linenum2 d" input.txt > input.temp
or in-line by adding the -i option. Since I know the line numbers, is there a command to avoid stream-editing and just remove the particular lines? input.txt can be as large as 50 GB.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
What you could do to avoid writing a copy of the file is to write the file over itself like:
{
sed "$l1,$l2 d" < file
perl -le 'truncate STDOUT, tell STDOUT'
} 1<> file
Dangerous as you’ve no backup copy there.
Or avoiding sed, stealing part of manatwork’s idea:
{
head -n "$(($l1 - 1))"
head -n "$(($l2 - $l1 + 1))" > /dev/null
cat
perl -le 'truncate STDOUT, tell STDOUT'
} < file 1<> file
That could still be improved because you’re overwriting the first l1 – 1 lines over themselves while you don’t need to, but avoiding it would mean a bit more involved programming, and for instance do everything in perl which may end up less efficient:
perl -ne 'BEGIN{($l1,$l2) = ($ENV{"l1"}, $ENV{"l2"})}
if ($. == $l1) {$s = tell(STDIN) - length; next}
if ($. == $l2) {seek STDOUT, $s, 0; $/ = 32768; next}
if ($. > $l2) {print}
END {truncate STDOUT, tell STDOUT}' < file 1<> file
Some timings for removing lines 1000000 to 1000050 from the output of seq 1e7:
sed -i "$l1,$l2 d" file: 16.2s- 1st solution: 1.25s
- 2nd solution: 0.057s
- 3rd solution: 0.48s
They all work on the same principle: we open two file descriptors to the file, one in read-only mode (0) using < file short for 0< file and one in read-write mode (1) using 1<> file (<> file would be 0<> file). Those file descriptors point to two open file descriptions that will have each a current cursor position within the file associated with them.
In the second solution for instance, the first head -n "$(($l1 - 1))" will read $l1 - 1 lines worth of data from fd 0 and write that data to fd 1. So at the end of that command, the cursor on both open file descriptions associated with fds 0 and 1 will be at the start of the $l1th line.
Then, in head -n "$(($l2 - $l1 + 1))" > /dev/null, head will read $l2 - $l1 + 1 lines from the same open file description through its fd 0 which is still associated to it, so the cursor on fd 0 will move to the beginning of the line after the $l2 one.
But its fd 1 has been redirected to /dev/null, so upon writing to fd 1, it will not move the cursor in the open file description pointed to by {...}‘s fd 1.
So, upon starting cat, the cursor on the open file description pointed to by fd 0 will be at the start of the next line after $l2, while the cursor on fd 1 will still be at the beginning of the $l1th line. Or said otherwise, that second head will have skipped those lines to remove on input but not on output. Now cat will overwrite the $l1th line with the next line after $l2 and so on.
cat will return when it reaches the end of file on fd 0. But fd 1 will point to somewhere in the file that has not been overwritten yet. That part has to go away, it corresponds to the space occupied by the deleted lines now shifted to the end of the file. What we need is to truncate the file at the exact location where that fd 1 points to now.
That’s done with the ftruncate system call. Unfortunately, there’s no standard Unix utility to do that, so we resort on perl. tell STDOUT gives us the current cursor position associated with fd 1. And we truncate the file at that offset using perl’s interface to the ftruncate system call: truncate.
In the third solution, we replace the writing to fd 1 of the first head command with one lseek system call.
Method 2
Using sed is a good approach: It is clear, it streams the file (no problem with long files), and can easily be generalized to do more. But if you want a simple way to edit the file in-place, the easiest thing is to use ed or ex:
(echo 10,31d; echo wq) | ed input.txt
A better approach, guaranteed to work with files of unlimited size (and for lines as long as your RAM allows) is the following perl one-liner which edits the file in place:
perl -n -i -e 'print if $. < 10 || $. > 31' input.txt
Explanation:
-n: Apply the script to each line. Produce no other output.
-i: Edit the file in-place (use-i.bckto make a backup).
-e ...: Print each line, except lines 10 to 31.
Method 3
If you need to read and write 50GiB, that will take a long time, regardless what you do. And unless the lines are of fixed length, or you have some other way to know where the lines to be deleted are, there is no way around reading the file up to the last line to be deleted. Maybe a custom program that just counts newlines and later copies full blocks is a bit faster than sed(1), but I believe that is not your bottleneck. Try using time(1) to find out how the time is aportioned.
Method 4
You can use Vim in Ex mode:
ex -sc '1d2|x' input.txt
-
1move to first line -
2select 2 lines -
ddelete -
xsave and close
Method 5
In the special case that the content of the lines which should be deleted are unique in the file, another option might be using grep -v and the content of the line rather than the line numbers. For instance if only one unique line should be deleted (the deletion of a single line was for instance asked in this duplicate thread), or many lines which all have the same unique content.
Here is an example
grep -v "content of lines to delete" input.txt > input.tmp
I have made some benchmarks with a file which contains 340 000 lines.
The way with grep seems to be around 15 times faster than the sed method in this case.
Here are the commands and the timings:
time sed -i "/CDGA_00004.pdbqt.gz.tar/d" /tmp/input.txt real 0m0.711s user 0m0.179s sys 0m0.530s time perl -ni -e 'print unless /CDGA_00004.pdbqt.gz.tar/' /tmp/input.txt real 0m0.105s user 0m0.088s sys 0m0.016s time (grep -v CDGA_00004.pdbqt.gz.tar /tmp/input.txt > /tmp/input.tmp; mv /tmp/input.tmp /tmp/input.txt ) real 0m0.046s user 0m0.014s sys 0m0.019s
I have tried both with and without the setting LC_ALL=C, it does not change the timings. The search string (CDGA_00004.pdbqt.gz.tar) is somewhere in the middle of the file.
Method 6
Would this help?
perl -e '
$num1 = 5;
$num2= 10000;
open IN,"<","input_file.txt";
open OUT,">","output_file.txt";
print OUT <IN> for (1 .. $num1-1)
<IN> for ($num1 .. $num2);
undef $/ and print OUT <IN>;
close IN;
close OUT;
'
This removes any lines between 5 and 10000 inclusive. Change the numbers to fit your needs. Can’t see an efficient way of doing it in situ, though (i.e. this approach will have to print to a different output file).
Method 7
If you want to edit the file in place, most shell tools won’t help you because when you open a file for writing, you only have a choice of truncating it (>) or appending to it (>>), not overwriting existing contents. dd is a notable exception. See Is there a way to modify a file in-place?
export LC_ALL=C
lines_to_keep=$((linenum1 - 1))
lines_to_skip=$((linenum2 - linenum1 + 1))
deleted_bytes=$({ { head -n "$lines_to_keep"
head -n "$lines_to_skip" >&3;
cat
} <big_file | dd of=big_file conv=notrunc;
} 3>&1 | wc -c)
dd if=/dev/null of=big_file bs=1 seek="$(($(wc -c <big_file) - $deleted_bytes))"
(Warning: untested!)
Method 8
This is nice and simple:
perl -i -n -e 'print unless $.==13' /path/to/your/file
to remove e.g. line 13 from /path/to/your/file
Method 9
you could add a *q*uit instruction to your sed command whenn linenum2 is reached, so sed stops processing the file.
sed 'linenum1,linenum2d;linenum2q' file
Method 10
Note that this is a reply to a different question that was marked a duplicate.
The question was hot to remove line 4125889 from in.csv.
You can either do things unsafe – then you may be fast but may loose the whole file, or you depend on the speed of the editor you are using.
I recommend:
echo '013003y' | VED_FTMPFIR=. ved +4125878 in.csv
where you need 3x the file size and end with in.csv and in.csv.bak
or:
echo '013003!' | VED_FTMPFIR=. ved +4125878 in.csv
where you need 2x the file size and the resulting file will be written in place.
Note that you need a POSIX compliant shell (echo) implementation to get the escapes properly expanded. The editor ved is part of the schily tools and available at:
http://sourceforge.net/projects/schilytools/files/
in schily-*.tar.bz2
It uses the fastest swap file mechanism I am aware of.
The VED_FTMPFIR=. environment sets the directory for the swapfile to the current directory. select any directory that holds sufficient space.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0