how to find how many common features that two columns share in a data frame in Python

Suppose I have the following dataframe where the first column is a facility and the second column is a product produced by the facility.

d = {'Facility': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C'],
 'Product': ['P1', 'P2', 'P3', 'P1', 'P2', 'P4', 'P1', 'P3', 'P5']}
df = pd.DataFrame(data=d)
df

Facility    Product
0   A   P1
1   A   P2
2   A   P3
3   B   P1
4   B   P2
5   B   P4
6   C   P1
7   C   P3
8   C   P5

My goal is to figure out how many same product is produced by each pair of facility. For instance, for the given example, Facility A and Facility B produce both P1 and P2, thus, I would like to get two as a results.

I have written the following code snippet, however, its execution is pretty slow. I was wondering if someone could show me a better way to achieve this.

#get all the product information
products = df['Product'].unique()
#create pair dataframe
colNames  =   ['Fac1', 'Fac2', 'Count']
linkedFacilities = pd.DataFrame(columns = colNames)
for p in products:
        #get a sub data frame with the corresponding product
        myDf = df[df['Product'].str.contains(p)]
        myList = [ (f1, f2)
            for f1 in myDf.Facility
            for f2 in myDf.Facility if f1 != f2]
        temp = pd.DataFrame(data=myList, columns=colNames[:2])
        #for each identified product between two facilities, starts adding a column
        temp['Count'] = 1
        linkedFacilities = pd.concat([linkedFacilities, temp],ignore_index=True)

Then, I accumulate each row and eliminate the duplicates.

grouped = linkedFacilities.groupby(["Fac1", "Fac2"]).size().reset_index()
grouped.rename(index={0: "Fac1", 1: "Fac2", 2: "Count"})
m = pd.DataFrame(np.sort(grouped[['Fac1','Fac2']], axis=1), index=grouped.index).duplicated()
grouped = grouped[~m]
grouped 
A   B   2
A   C   2
B   C   1

Contents hide

Answers:

Method 1

Method 2

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Here’s one option using merge and value_counts. Then remove pairs by making the index frozensets and filter out duplicates using duplicated:

counts = (df.merge(df, on='Product', suffixes=('_1', '_2'))
          .query('Facility_1!=Facility_2')
          .value_counts(['Facility_1','Facility_2']))
counts = (counts.reset_index(name='Number of shared products')
          .loc[pd.Series(counts.index.map(frozenset)).duplicated(keep='last')]
          .reset_index(drop=True))

Output:

  Facility_1 Facility_2  Number of shared products
0          A          B                          2
1          A          C                          2
2          B          C                          1

Method 2

Here is an approach using a grouped aggregation:

from itertools import combinations

(df
 .groupby('Product')['Facility']
 .agg(lambda x: list(map(frozenset, combinations(set(x), r=2))))
 .explode()
 .value_counts()
 )

Output:

(A, C)    2
(A, B)    2
(C, B)    1
Name: Facility, dtype: int64

Alternative, for a consistent ordering of the facilities:

from itertools import combinations

out = (df
 .groupby('Product')['Facility']
 .agg(lambda x: list(combinations(sorted(set(x)), r=2)))
 .explode()
 .value_counts()
 )

Output:

(A, B)    2
(A, C)    2
(B, C)    1
Name: Facility, dtype: int64

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating