Finding text between two specific characters or strings

Say I have lines like this:

*[234]*
*[23]*
*[1453]*

where * represents any string (except a string of the form [number]). How can I parse these lines with a command line utility and extract the number between brackets?

More generally, which of these tools cut, sed, grep or awk would be appropriate for such task?

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

If you have GNU grep, you can use its -o option to search for a regex and output only the matching part. (Other grep implementations can only show the whole line.) If there are several matches on one line, they are printed on separate lines.

grep -o '[[0-9]*]'

If you only want the digits and not the brackets, it’s a little harder; you need to use a zero-width assertion: a regexp that matches the empty string, but only if it is preceded, or followed as the case may be, by a bracket. Zero-width assertions are only available in Perl syntax.

grep -P -o '(?<=[)[0-9]*(?=])'

With sed, you need to turn off printing with -n, and match the whole line and retain only the matching part. If there are several possible matches on one line, only the last match is printed. See Extracting a regex matched with ‘sed’ without printing the surrounding characters for more details on using sed here.

sed -n 's/^.*([[0-9]*]).*/1/p'

or if you only want the digits and not the brackets:

sed -n 's/^.*[([0-9]*)].*/1/p'

Without grep -o, Perl is the tool of choice here if you want something that’s both simple and comprehensible. On every line (-n), if the line contains a match for [[0-9]*], then print that match ($&) and a newline (-l).

perl -l -ne '/[[0-9]*]/ and print $&'

If you only want the digits, put parentheses in the regex to delimit a group, and print only that group.

perl -l -ne '/[([0-9]*)]/ and print $1'

P.S. If you only want to require one or more digits between the brackets, change [0-9]* to [0-9][0-9]*, or to [0-9]+ in Perl.

Method 2

You can’t do it with cut.

tr -c -d '012345678912'
sed 's/[^0-9]*//g'
awk -F'[^0-9]+' '{ print $1$2$3 }'
grep -o -E '[0-9]+'

tr is the most natural fit for the problem and would probably run the fastest, but I think you would need gigantic inputs to separate any of these options in terms of speed.

Method 3

If you mean extract a set of consecutive digits between non-digit characters, I guess sed and awk are the best (although grep is also able to give you the matched characters):

sed: you can of course match the digits, but it’s perhaps interesting to do the opposite, remove the non-digits (works as far as there is only one number per line):

$ echo nn3334nn | sed -e 's/[^[[:digit:]]]*//g'
3344

grep: you can match consecutive digits

$ echo nn3334nn | grep -o '[[:digit:]]*'
3344

I don’t give an example for awk because I have null experience with it; it is interesting to note that, although sed is a swiss knife, grep gives you a simpler, more readable way to do this, which also works for more than one number on each input line (the -o only prints the matching parts of the input, each one on its own line):

$ echo dna42dna54dna | grep -o '[[:digit:]]*'
42
54

Method 4

Since it has been said that this cannot be done with cut, I will show that it is easily possible to produce a solution that is at least not worse than some of the others, even though I do not endorse the use of cut as the “best” (or even a particularly good) solution. It should be said that any solution not looking specifically for *[ and ]* around the digits makes simplifying assumptions and is therefore prone to failure on examples more complex than then one given by the asker (e.g. digits outside *[ and ]*, which should not be shown). This solution checks at least for the brackets, and it could be extended to check the asterisks as well (left as an exercise to the reader):

cut -f 2 -d '[' myfile.txt | cut -f 1 -d ']'

This makes use of the the -d option, which specifies a delimiter. Obviously you could also pipe into the cut expression instead of reading from a file. While cut is probably pretty fast, since it is simple (no regex engine), you have to invoke it at least twice (or a few more time to check for *), which creates some process overhead. The one real advantage of this solution is that it is rather readable, especially for casual users not well versed in regex constructs.

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating