non-line-oriented tool for string replacement?

I recently asked a question about how to remove the newline character if it occurs after another specific character.

Unix text-processing tools are very powerful, but almost all of them deal with lines of text, which is fine most of the time when the input fits in the available memory.

But what should I do if I wish to replace a text sequence in a huge file that doesn’t contain any newlines?

For instance replace <foobar> with n<foobar> without reading the input line-by-line? (since there is only one line and it is 2.5G characters long).

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

The first thing that occurs to me when facing this type of problem is to change the record separator. In most tools, this is set to n by default but that can be changed. For example:

  1. Perl
    perl -0x3E -pe 's/<foobar>/n$&/' file

    Explanation

    • -0 : this sets the input record separator to a character given its hexadecimal value. In this case, I am setting it to > whose hex value is 3E. The general format is -0xHEX_VALUE. This is just a trick to break the line into manageable chunks.
    • -pe : print each input line after applying the script given by -e.
    • s/<foobar>/n$&/ : a simple substitution. The $& is whatever was matched, in this case <foobar>.
  2. awk
    awk '{gsub(/foobar>/,"n<foobar>");printf "%s",$0};' RS="<" file

    Explanation

    • RS="<" : set the input record separator to >.
    • gsub(/foobar>/,"n<foobar>") : substitute all cases of foobar> with n<foobar>. Note that because RS has been set to <, all < are removed from the input file (that’s how awk works) so we need to match foobar> (without a <) and replace with n<foobar>.
    • printf "%s",$0 : print the current “line” after the substitution. $0 is the current record in awk so it will hold whatever was before the <.

I tested these on a 2.3 GB, single-line file created with these commands:

for i in {1..900000}; do printf "blah blah <foobar>blah blah"; done > file
for i in {1..100}; do cat file >> file1; done
mv file1 file

Both the awk and the perl used negligible amounts of memory.

Method 2

gsar (general search and replace) is a very useful tool for exactly this purpose.

Most answers to this question use record-based tools and various tricks to make them adapt to the problem, such as switching the default record separator character to something assumed to be occurring frequently enough in the input not to make each record too large to handle.

In many cases this is very fine and even readable. I do like problems that can be easily/efficiently solved with everywhere-available tools such as awk, tr, sed and the bourne shell.

Performing a binary search and replace in an arbitrary huge file with random contents does not fit very well for these standard unix tools.

Some of you may think this is cheating, but I don’t see how using the right tool for the job can be wrong. In this case it is a C program called gsar that is licenced under GPL v2, so it surprises me quite a bit that there is no package for this very useful tool in neither gentoo, redhat, nor ubuntu.

gsar uses a binary variant of the Boyer-Moore string search algorithm.

Usage is straight-forward:

gsar -F '-s<foobar>' '-r:x0A<foobar>'

where -F means “filter” mode, ie read stdin write to stdout. There are methods to operate on files as well. -s specifies the search string and -r the replacement. The colon-notation can be used to specify arbitrary byte values.

Case-insensitive mode is supported (-i), but there is no support for regular expressions, since the algorithm uses the length of the search string to optimize the search.

The tool can also be used just for searching, a bit like grep. gsar -b outputs the byte offsets of the matched search string, and gsar -l prints filename and number of matches if any, a bit like combining grep -l with wc.

The tool was written by Tormod Tjaberg (initial) and Hans Peter Verne (improvements).

Method 3

In the narrow case where target and replacement strings are of the same length, memory mapping can come to the rescue. This is especially useful if the replacement needs to be performed in-place. You’re basically mapping a file into a process’s virtual memory, and the address space for 64-bit addressing is huge. Note that the file is not necessarily mapped into physical memory all at once, so files that are several time the size of the physical memory available on the machine can be dealt with.

Here’s a Python example that replaces foobar with XXXXXX

#! /usr/bin/python
import mmap
import contextlib   
with open('test.file', 'r+') as f:
 with contextlib.closing(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE)) as m:
   pos = 0
   pos = m.find('foobar', pos)
   while pos > 0:
    m[pos: pos+len('XXXXXX')] = 'XXXXXX'
    pos = m.find('foobar', pos)

Method 4

There are many tools for this:

dd is what you want to use if you want to block a file off – reliably read only a certain number of bytes only a certain number of times. It portably handles blocking and unblocking file streams:

tr -dc '[:graph:]' </dev/urandom |
dd bs=32 count=1 cbs=8 conv=unblock,sync 2>/dev/null

###OUTPUT###

UI(#Q5e
BKX2?A:Z
RAxGm:qv
t!;/v!)N

I also use tr above because it can handle converting any ASCII byte to any other (or, in this case, deleting any ASCII byte that isn’t a not-space printable character). It’s what I used in answer to your other question this morning, in fact, when I did:

tr '>n' 'n>' | sed 's/^>*//' | tr 'n>' '>n'

There are many similar. That list should provide a lowest common-denominator subset with which you might become familiar.

But, if I were going to do text processing on 2.5gbs of binary file, I might start with od. It can give you an octal dump or any of several other formats. You can specify all kinds of options – but I’ll just do one byte per line in a C escaped format:

The data you’ll get from od will be regular at whatever interval you specify – as I show below. But first – here’s an answer to your question:

printf 'firstnnewlinettab spacefoobarnull' |
od -A n -t c -v -w1 |
sed 's/^ {1,3}//;s/\$/&&/;/ /bd
     /\[0nt]/!{H;$!d};{:d
    x;s/n//g}'

That little bit above delimits on newlines, nulls, tabs and <spaces> while preserving the C escaped string for the delimiter. Note the H and x functions used – every time sed encounters a delimiter it swaps out the contents of its memory buffers. In this way sed only retains as much information as it must to reliably delimit the file and does not succumb to buffer overruns – does not, that is, so long as it actually encounters its delimiters. For so long as it does, sed will continue to process its input and od will continue to provide it until it encounters EOF.

As is, its output looks like this:

first
nnewline
ttab
 spacefoobar
null

So if I want foobar:

printf ... | od ... | sed ... | 
sed 's/foobar/
&
/g'

###OUTPUT###

first
nnewline
ttab
 space
foobar

null

Now if you want to make use of the C escapes it’s pretty easy – because sed has already double \backslash escaped all of its single input backslashes, so printf execed from xargs will have no issues producing the output to your specification. But xargs eats shell quotes so you’ll need to double quote it again:

printf 'nlntabtspace foobarfoobarnull' |
PIPELINE |
sed 's/./\&/g' | 
xargs printf %b | 
cat -A

###OUTPUT###

nl$
tab^Ispace $
foobar$
$
foobar$
^@null%

That could have as easily been saved to a shell variable and output later in identical fashion. The last sed inserts a backslash before every character in its input, and that’s all.

And here’s what it all looks like before ever sed gets hold of it:

printf 'nlntabtspace foobarfoobarnull' |
od -A n -t c -v -w1

   n
   l
  n
   t
   a
   b
  t
   s
   p
   a
   c
   e

   f
   o
   o
   b
   a
   r
   f
   o
   o
   b
   a
   r
  
   n
   u
   l
   l

Method 5

Awk operates on successive records. It can use any character as the record separator (except the null byte on many implementations). Some implementations support arbitrary regular expressions (not matching the empty string) as the record separator, but this can be unwieldy because the record separator is truncated from the end of each record before it is stowed into $0 (GNU awk sets the variable RT to the record separator that was stripped from the end of the current record). Note that print terminates its output with the output record separator ORS which is a newline by default and set independently from the input record separator RS.

awk -v RS=, 'NR==1 {printf "input up to the first comma: %sn", $0}'

You can effectively select a different character as the record separator for other tools (sort, sed, …) by swapping newlines with that character with tr.

tr 'n,' ',n' |
sed 's/foo/bar/' |
sort |
tr 'n,' ',n'

Many GNU text utilities support using a null byte instead of a newline as the separator.


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x