Remove characters if they do not follow specified patterns

I want to clean up some files, and make make the way in which they are written more uniform.

So, my input looks something like this:

$a$h$l )r
^9 ^5 l
 urd

The thing is, some spaces are “unnecessary” and make comparing the files difficult. For this reason, I want to remove all spaces, unless they follow directly after one of the following characters:

  • $
  • ^
  • T
  • iN (N being a variable, any character 1 byte long)
  • oN (N being a variable, as above)
  • s
  • sN (N being a variable, as above)
  • @
  • !
  • /
  • (
  • )
  • =N (N being a variable, as above)
  • %N (N being a variable, as above)

So, an example-input might be:

:
$ $ $N
$  $  $a
sa  s l r
*56 l r
o1 o 2
%%x v

Where the wanted output would be:

:
$ $ $N
$ $ $a
sa s lr
*56lr
o1 o 2
%%xv

For the %%x v case, the space is removed because it’s the third character following the initial %, where the second % acts as the variable.

I’m using a GNU/Linux operating system.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

I think I get it now – thank you.

With extended regexp to handle the optionals for the extra chars in N a little easier (note the example input used here is slightly different than your own in the question):

sed -Ee's|([sio=%]..)?([@!T()^$/].)? *|12|g' 
<<""
:
$ $ $N
$  $  $a
sa  s    l r
*56 l r
o1 o 2
%%xv

:
$ $ $N
$ $ $a
sa s  lr
*56lr
o1 o 2
%%xv

You’ll need a GNU/BSD/AST sed to use that. An equivalent BRE would read like:

sed 's|([soi=%]..){0,1}([@!T()^$/].){0,1} *|12|g'

The trick is to make all matches ultimately optional – so that no one portion of a pattern takes precedence. Because you are only actually removing data – and not inserting (which would have to be handled a lot differently) you won’t have any issues with the matches for null-strings in the interim between your match targets. Who cares how many null-strings are removed?

sed‘s regex scans pattern space globally and from left to right. If there were some possibility of overlap between the matches it wouldn’t work very well because it doesn’t backtrack in a global. But there is only one case I can think of for that, and it is handled here. Anyway, the space is always on the right side, and there is always some not space on the left. It is possible, though, that N could be one of the single char delims you name, but in that case the one space is still preserved as it should be.

As it scans it checks input against the patterns – the first it might match is the 3-char one, the second is the 2 char one, and the third is a single – a space (though that match might continue for any length).

When any of these are found sed will either replace one of the first two matches with itself – like passover – but the third is removed completely. And in one go.

Method 2

Maybe something like:

perl -pe 's{((?:[ios=%].|[$^T@!/()])+.)| }{$1}g'


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x