Turning separate lines into a comma separated list with quoted entries

I have the following data (a list of R packages parsed from a Rmarkdown file), that I want to turn into a list I can pass to R to install:

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

I want to turn the list into a list of the form:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I currently have a bash pipeline that goes from the raw file to the list above:

grep 'library(' Presentation.Rmd 
| grep -v '#' 
| cut -f2 -d( 
| tr -d ')'  
| sort | uniq

I want to add a step on to turn the new lines into the comma separated list. I’ve tried adding tr 'n' '","', which fails. I’ve also tried a number of the following Stack Overflow answers, which also fail:

This produces library(stringr)))phics) as the result.

This produces ,% as the result.

This answer (with the -i flag removed), produces output identical to the input.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

You can add quotes with sed and then merge lines with paste, like that:

sed 's/^|$/"/g'|paste -sd, -

If you are running a GNU coreutils based system (i.e. Linux), you can omit the trailing '-'.

If you input data has DOS-style line endings (as @phk suggested), you can modify the command as follows:

sed 's/r//;s/^|$/"/g'|paste -sd, -

Method 2

Using awk:

awk 'BEGIN { ORS="" } { print p"'"'"'"$0"'"'"'"; p=", " } END { print "n" }' /path/to/list

Alternative with less shell escaping and therefore more readable:

awk 'BEGIN { ORS="" } { print p"47"$0"47"; p=", " } END { print "n" }' /path/to/list

Output:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

Explanation:

The awk script itself without all the escaping is BEGIN { ORS="" } { print p"'"$0"'"; p=", " } END { print "n" }. After printing the first entry the variable p is set (before that it’s like an empty string). With this variable p every entry (or in awk-speak: record) is prefixed and additionally printed with single quotes around it. The awk output record separator variable ORS is not needed (since the prefix is doing it for you) so it is set to be empty at the BEGINing. Oh and we might our file to END with a newline (e.g. so it works with further text-processing tools); should this not be needed the part with END and everything after it (inside the single quotes) can be removed.

Note

If you have Windows/DOS-style line endings (rn), you have to convert them to UNIX style (n) first. To do this you can put tr -d '15' at the beginning of your pipeline:

tr -d '15' < /path/to/input.list | awk […] > /path/to/output

(Assuming you don’t have any use for rs in your file. Very safe assumption here.)

Alternatively, simply run dos2unix /path/to/input.list once to convert the file in-place.

Method 3

As @don_crissti’s linked answer shows, the paste option borders on incredibly fast — the linux kernel’s piping is more efficient than I would have believed if I hadn’t just now tried it. Remarkably, if you can be happy with a single comma separating your list items rather than a comma+space, a paste pipeline

(paste -d' /dev/null - /dev/null | paste -sd, -) <input

is faster than even a reasonable flex program(!)

%option 8bit main fast
%%
.*  { printf("'%s'",yytext); }
n/(.|n) { printf(", "); }

But if just decent performance is acceptable (and if you’re not running a stress test, you won’t be able to measure any constant-factor differences, they’re all instant) and you want both flexibility with your separators and reasonable one-liner-y-ness,

sed "s/.*/'&'/;H;1h;"'$!d;x;s/n/, /g'

is your ticket. Yes, it looks like line noise, but the H;1h;$!d;x idiom is the right way to slurp up everything, once you can recognize that the whole thing gets actually easy to read, it’s s/.*/'&'/ followed by a slurp and a s/n/, /g.


edit: bordering on the absurd, it’s fairly easy to get flex to beat everything else hollow, just tell stdio you don’t need the builtin multithread/signalhandler sync:

%option 8bit main fast
%%
.+  { putchar_unlocked(''');
      fwrite_unlocked(yytext,yyleng,1,stdout);
      putchar_unlocked('''); }
n/(.|n) { fwrite_unlocked(", ",2,1,stdout); }

and under stress that’s 2-3x quicker than the paste pipelines, which are themselves at least 5x quicker than everything else.

Method 4

I think the following should do just fine, assuming you’re data is in the file text

d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr

Let’s use arrays which have the substitution down cold:

#!/bin/bash
input=( $(cat text) ) 
output=( $(
for i in ${input[@]}
        do
        echo -ne "'$i',"
done
) )
output=${output:0:-1}
echo ${output//,/, }

The output of the script should be as follows:

'd3heatmap', 'data.table', 'ggplot2', 'htmltools', 'htmlwidgets', 'metricsgraphics', 'networkD3', 'plotly', 'reshape2', 'scales', 'stringr'

I believe this was what you were looking for?

Method 5

Python

Python one-liner:

$ python -c "import sys; print(','.join([repr(l.strip()) for l in sys.stdin]))" < input.txt                               
'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'

Works in simple way – we redirect input.txt into stdin using shell’s < operator, read each line into a list with .strip() removing newlines
and repr() creating a quoted representation of each line. The list is then joined into one big string via .join() function, with , as separator

Alternatively we could use + to concatenate quotes to each stripped line.

 python -c "import sys;sq=''';print(','.join([sq+l.strip()+sq for l in sys.stdin]))" < input.txt

Perl

Essentially same idea as before: read all lines,strip trailing newline, enclose in single quotes,stuff everything into array @cvs , and print out array values joined with commas.

$ perl -ne 'chomp; $sq = "47" ; push @cvs,"$sq$_$sq";END{ print join(",",@cvs)   }'  input.txt                        
 'd3heatmap','data.table','ggplot2','htmltools','htmlwidgets','metricsgraphics','networkD3','plotly','reshape2','scales','stringr'

Method 6

I often have a very similar scenario: I copy a column from Excel and want to convert the content into a comma separated list (for later usage in a SQL query like ... WHERE col_name IN <comma-separated-list-here>).

This is what I have in my .bashrc:

function lbl {
    TMPFILE=$(mktemp)
    cat $1 > $TMPFILE
    dos2unix $TMPFILE
    (echo "("; cat $TMPFILE; echo ")") | tr 'n' ',' | sed -e 's/(,/(/' -e 's/,)/)/' -e 's/),/)/'
    rm $TMPFILE
}

I then run lbl (“line by line”) on the cmd line which waits for input, paste the content from the clipboard, press <C-D> and the function returns the input surrounded with (). This looks like so:

$ lbl
1
2
3
dos2unix: converting file /tmp/tmp.OGM6UahLTE to Unix format ...
(1,2,3)

(I don’t remember why I put the dos2unix in here, presumably because this often causes trouble in my company’s setup.)

Method 7

Some versions of sed act a little different, but on my mac, I can handle everything but the “uniq” in sed:

sed -n -e '
# Skip commented library lines
/#/b
# Handle library lines
/library(/{
    # Replace line with just quoted filename and comma
    # Extra quoting is due to command-line use of a quote
    s/library(([^)]*))/'''1''', /
    # Exchange with hold, append new entry, remove the new-line
    x; G; s/n//
    ${
        # If last line, remove trailing comma, print, quit
        s/, $//; p; b
    }
    # Save into hold
    x
}
${
    # Last line not library
    # Exchange with hold, remove trailing comma, print
    x; s/, $//; p
}
'

Unfortunately to fix the unique part you have to do something like:

grep library Presentation.md | sort -u | sed -n -e '...'

–Paul

Method 8

It is funny that to use a plain text list of R packages to install them in R, nobody proposed a solution using that list directly in R but fight with bash, perl, python, awk, sed or whatever to put quotes and commas in the list. This is not necessary at all and moreover does not solve how input and use the transformed list in R.

You can simply load the plain text file (said, packages.txt) as a dataframe with a single variable, that you can extract as a vector, directly usable by install.packages. So, convert it in a usable R object and install that list is just:

df <- read.delim("packages.txt", header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)

Or without an external file:

packages <-" 
d3heatmap
data.table
ggplot2
htmltools
htmlwidgets
metricsgraphics
networkD3
plotly
reshape2
scales
stringr
"
df <- read.delim(textConnection(packages), 
header=F, strip.white=T, stringsAsFactors=F)
install.packages(df$V1)


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x