Intersection of two arrays in BASH

I have two arrays like this:

A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)

The arrays are not sorted and might possibly even contain duplicated elements.

  1. I would like to make the intersection of these two arrays and store the elements in another array. How would I do that?
  2. Also, how would I get the list of elements that appear in B and are not available in A?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

comm(1) is a tool that compares two lists and can give you the intersection or difference between two lists. The lists need to be sorted, but that’s easy to achieve.

To get your arrays into a sorted list suitable for comm:

$ printf '%sn' "${A[@]}" | LC_ALL=C sort

That will turn array A into a sorted list. Do the same for B.

To use comm to return the intersection:

$ comm -1 -2 file1 file2

-1 -2 says to remove entries unique to file1 (A) and unique to file2 (B) – the intersection of the two.

To have it return what is in file2 (B) but not file1 (A):

$ comm -1 -3 file1 file2

-1 -3 says to remove entries unique to file1 and common to both – leaving only those unique to file2.

To feed two pipelines into comm, use the “Process Substitution” feature of bash:

$ comm -1 -2 <(pipeline1) <(pipeline2)

To capture this in an array:

$ C=($(command))

Putting it all together:

# 1. Intersection
$ C=($(comm -12 <(printf '%sn' "${A[@]}" | LC_ALL=C sort) <(printf '%sn' "${B[@]}" | LC_ALL=C sort)))

# 2. B - A
$ D=($(comm -13 <(printf '%sn' "${A[@]}" | LC_ALL=C sort) <(printf '%sn' "${B[@]}" | LC_ALL=C sort)))

Method 2

There is rather elegant and efficient approach to do that, using uniq — but, we will need to eliminate duplicates from each array, leaving only unique items. If you want to save duplicates, there is only one way “by looping through both arrays and comparing”.

Consider we have two arrays:

A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)

First of all, lets transform these arrays into sets. We will do it because there is mathematical operation intersection which is known like intersection of sets, and set is a collection of distinct objects, distinct or unique. To be honest, I don’t know what is “intersection” if we speak about lists or sequences. Though we can pick out a subsequence from sequence, but this operation (selection) has slightly different meaning.

So, lets transform!

$ A=(echo ${A[@]} | sed 's/ /n/g' | sort | uniq)
$ B=(echo ${B[@]} | sed 's/ /n/g' | sort | uniq)
  1. Intersection:
    $ echo ${A[@]} ${B[@]} | sed 's/ /n/g' | sort | uniq -d

    If you want to store the elements in another array:

    $ intersection_set=$(echo ${A[@]} ${B[@]} | sed 's/ /n/g' | sort | uniq -d)
    
    $ echo $intersection_set
    vol-175a3b54 vol-71600106 vol-98c2bbef

    uniq -d means show only duplicates (I think, uniq is rather fast because of its realisation: I guess that it is done with XOR operation).

  2. Get the list of elements that appear in B and are not available in A, i.e. BA
    $ echo ${A[@]} ${B[@]} | sed 's/ /n/g' | sort | uniq -d | xargs echo ${B[@]} | sed 's/ /n/g' | sort | uniq -u

    Or, with saving in a variable:

    $ subtraction_set=$(echo ${A[@]} ${B[@]} | sed 's/ /n/g' | sort | uniq -d | xargs echo ${B[@]} | sed 's/ /n/g' | sort | uniq -u)
    
    $ echo $subtraction_set
    vol-27991850 vol-2a19386a vol-615e1222 vol-7320102b vol-8f6226cc vol-b846c5cf vol-e38d0c94

    Thus, at first we have got intersection of A and B (which is simply the set of duplicates between them), say it is A/B, and then we used operation of inverting intersection of B and A/B (which is simply only unique elements), so we get BA = ! (B / (A/B)).

P.S. uniq was written by Richard M. Stallman and David MacKenzie.

Method 3

You can get all elements that are in both A and B by looping through both arrays and comparing:

A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)

intersections=()

for item1 in "${A[@]}"; do
    for item2 in "${B[@]}"; do
        if [[ $item1 == "$item2" ]]; then
            intersections+=( "$item1" )
            break
        fi
    done
done

printf '%sn' "${intersections[@]}"

You can get all elements in B but not in A in a similar manner:

A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618 vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)

not_in_a=()

for item1 in "${B[@]}"; do
    for item2 in "${A[@]}"; do
        [[ $item1 == "$item2" ]] && continue 2
    done

    # If we reached here, nothing matched.
    not_in_a+=( "$item1" )
done

printf '%sn' "${not_in_a[@]}"

Method 4

Ignoring efficiency, here is an approach:

declare -a intersect
declare -a b_only
for bvol in "${B[@]}"
do
    in_both=""
    for avol in "${A[@]}"
    do
        [ "$bvol" = "$avol" ] && in_both=Yes
    done
    if [ "$in_both" ]
    then
        intersect+=("$bvol")
    else
        b_only+=("$bvol")
    fi
done
echo "intersection=${intersect[*]}"
echo "In B only=${b_only[@]}"

Method 5

My pure bash way

As this variables contain only vol-XXX where XXX is an hexadecimal number, there is a quick way using bash arrays

unset A B a b c i                    # Only usefull for re-testing...

A=(vol-175a3b54 vol-382c477b vol-8c027acf vol-93d6fed0 vol-71600106 vol-79f7970e
   vol-e3d6a894 vol-d9d6a8ae vol-8dbbc2fa vol-98c2bbef vol-ae7ed9e3 vol-5540e618
   vol-9e3bbed3 vol-993bbed4 vol-a83bbee5 vol-ff52deb2)
B=(vol-175a3b54 vol-e38d0c94 vol-2a19386a vol-b846c5cf vol-98c2bbef vol-7320102b
   vol-8f6226cc vol-27991850 vol-71600106 vol-615e1222)

for i in ${A[@]#vol-};do
    [ "${a[$((16#$i))]}" ] && echo Duplicate vol-$i in A
    ((a[$((16#$i))]++))
    ((c[$((16#$i))]++))
  done
for i in ${B[@]#vol-};do
    [ "${b[$((16#$i))]}" ] && echo Duplicate vol-$i in B
    ((b[$((16#$i))]++))
    [ "${c[$((16#$i))]}" ] && echo Present in A and B: vol-$i
    ((c[$((16#$i))]++))
  done

This must output:

Present in A and B vol-175a3b54
Present in A and B vol-98c2bbef
Present in A and B vol-71600106

At this state, you bash environment contain:

set | grep ^c=
c=([391789396]="2" [664344656]="1" [706295914]="1" [942425979]="1" [1430316568]="1"
[1633554978]="1" [1902117126]="2" [1931481131]="1" [2046269198]="1" [2348972751]="1"
[2377892602]="1" [2405574348]="1" [2480340688]="1" [2562898927]="2" [2570829524]="1"
[2654715603]="1" [2822487781]="1" [2927548899]="1" [3091645903]="1" [3654723758]="1"
[3817671828]="1" [3822495892]="1" [4283621042]="1")

So you could:

for i in ${!b[@]};do
    [ ${c[$i]} -eq 1 ] &&
        printf "Present only in B: vol-%8xn" $i
  done

This will render:

Present only in B: vol-27991850
Present only in B: vol-2a19386a
Present only in B: vol-615e1222
Present only in B: vol-7320102b
Present only in B: vol-8f6226cc
Present only in B: vol-b846c5cf
Present only in B: vol-e38d0c94

But this is numericaly sorted! If you want original order, you could:

for i in ${B[@]#vol-};do
    [ ${c[((16#$i))]} -eq 1 ] && printf "Present in B only: vol-%sn" $i
  done

So you dislay vols in same order as submited:

Present in B only: vol-e38d0c94
Present in B only: vol-2a19386a
Present in B only: vol-b846c5cf
Present in B only: vol-7320102b
Present in B only: vol-8f6226cc
Present in B only: vol-27991850
Present in B only: vol-615e1222

or

for i in ${!a[@]};do
    [ ${c[$i]} -eq 1 ] && printf "Present only in A: vol-%8xn" $i
  done

for showing only in A:

Present only in A: vol-382c477b
Present only in A: vol-5540e618
Present only in A: vol-79f7970e
Present only in A: vol-8c027acf
Present only in A: vol-8dbbc2fa
Present only in A: vol-93d6fed0
Present only in A: vol-993bbed4
Present only in A: vol-9e3bbed3
Present only in A: vol-a83bbee5
Present only in A: vol-ae7ed9e3
Present only in A: vol-d9d6a8ae
Present only in A: vol-e3d6a894
Present only in A: vol-ff52deb2

or even:

for i in ${!b[@]};do
    [ ${c[$i]} -eq 2 ] && printf "Present in both A and B: vol-%8xn" $i
  done

will re-print:

Present in both A and B: vol-175a3b54
Present in both A and B: vol-71600106
Present in both A and B: vol-98c2bbef


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x