How to manipulate a CSV file with sed or awk?

How can I do the following to a CSV file using sed or awk?

  • Delete a column
  • Duplicate a column
  • Move a column

I have a big table with over 200 rows, and I’m not that familiar with sed.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This depends on whether your CSV file uses commas only for delimiters, or if you have madness like:

field one,”field,two”,field three

This assumes you’re using a simple CSV file:

Removing a column

You can get rid of a single column many ways; I used column 2 as an example. The easiest way is probably to use cut, which lets you specify a delimiter -d and which fields you want to print -f; this tells it to split on commas and output field 1, and fields 3 through the end:

$ cut -d, -f1,3- /path/to/your/file

If you actually need to use sed, you can write a regular expression that matches the first n-1 fields, the nth field, and the rest, and skip outputting the nth (here n is 2, so the first group is matched 1 time: {1}):

$ sed 's/(([^,]+,){1})[^,]+,(.*)/13/' /path/to/your/file

There are a number of ways to do this in awk, none of them particularly elegant. You can use a for loop, but dealing with the trailing comma is a pain; ignoring that it’d be something like:

$ awk -F, '{for(i=1; i<=NF; i++) if(i != 2) printf "%s,", $i; print NL}' /path/to/your/file

I find it easier to output field 1 and then use substr to pull off everything after field 2:

$ awk -F, '{print $1 "," substr($0, length($1)+length($2)+3)}' /path/to/your/file

This is annoying for columns further along though

Duplicating a column

In sed this is essentially the same expression as before, but you also capture the target column and include that group multiple times in the replacement:

$ sed 's/(([^,]+,){1})([^,]+,)(.*)/1334/' /path/to/your/file

In awk the for loop way it’d be something like (again ignoring the trailing comma):

$ awk -F, '{
for(i=1; i<=NF; i++) {
    if(i == 2) printf "%s,", $i;
    printf "%s,", $i
}
print NL
}' /path/to/your/file

The substr way:

$ awk -F, '{print $1 "," $2 "," substr($0, length($1)+2)}' /path/to/your/file

(tcdyl came up with a better method in his answer)

Moving a column

I think the sed solution follows naturally from the others, but it starts to get ridiculously long

Method 2

awk is your best bet. awk prints fields by number, so…

awk 'BEGIN { FS=","; OFS=","; } {print $1,$2,$3}' file

To remove a column, not print it:

 awk 'BEGIN { FS=","; OFS=","; } {print $1,$3}' file

To change the order:

awk 'BEGIN { FS=","; OFS=","; } {print $3,$1,$2}' file

Re-direct to an output file.

awk 'BEGIN { FS=","; OFS=","; } {print $3,$1,$2}' file > output.file

awk can format the output as well.

Awk format output

Method 3

Aside from how to cut and re-arrange the fields (covered in the other answers), there is the issue of quirky CSV fields.

If your data falls into this “quirky” category, a bit of pre and post filtering can take care of it. The filters shown below require the characters x01,x02,x03,x04 to not appear anywhere in your data.

Here are the filters wrapped around a simple awk field dump.

Note: field-five has an invalid/incomplete “quoted field” layout, but it is benign at the end of a row (depending on the CSV parser). But, of course, it would cause problematic unexpedted results if it were to be swapped away from its current end-of-row position.

Update; user121196 has pointed out a bug when a comma precedes a trailing quote. Here is the fix.

The data

cat <<'EOF' >file
field one,"fie,ld,two",field"three","field,",four","field,five
"15111 N. Hayden Rd., Ste 160,",""
EOF

The code

sed -r 's/^/,/; s/\"/x01/g; s/,"([^"]*)"/,x021x03/g; s/,"/,x02/; :MC; s/x02([^x03]*),([^x03]*)/x021x042/g; tMC; s/^,// ' file |
  awk -F, '{ for(i=1; i<=NF; i++) printf "%sn", $i; print NL}' |
    sed -r 's/x01/\"/g; s/(x02|x03)/"/g; s/x04/,/g'

The output:

field one
"fie,ld,two"
field"three"
"field,",four"
"field,five

"15111 N. Hayden Rd., Ste 160,"
""

Here is the pre filter, expanded with comments.
The post filter is just a reversal of x01.x02,x03,x04

sed -r '
    s/^/,/                # add a leading comma delimiter
    s/\"/x01/g          # obfuscate escaped quotation-mark (")
    s/,"([^"]*)"/,x021x03/g    # obfuscate quotation-marks
    s/,"/,x02/           # when no trailing quote on last field  
    :MC                   # obfuscate commas embedded in quotes
    s/x02([^x03]*),([^x03]*)/x021x042/g
    tMC
    s/^,//                # remove spurious leading delimiter
'

Method 4

Given a space-delimited file in the following format:

1 2 3 4 5

You can remove field 2 with awk like so:

awk '{ sub($2,""); print}' file

which returns

1  3 4 5

Replace column 2 with column n where appropriate.

To duplicate column 2,

awk '{ col = $2 " " $2; $2 = col; print }' file

which returns

1 2 2 3 4 5

To switch column 2 and 3,

awk '{temp = $2; $2 = $3; $3 = temp; print}'

which returns

1 3 2 4 5

awk is generally very good at dealing with the concept of fields. If you’re dealing with a CSV, and not a space-delimited file, you can simply use

awk -F,

to define your field as a comma, instead of a space (which is the default). There are a number of good awk resources online, one of which I list as a source below.

Source for #3


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x