Suppose I have the following dataframe where the first column is a facility and the second column is a product produced by the facility.
d = {'Facility': ['A', 'A', 'A', 'B', 'B','B', 'C', 'C', 'C'],
'Product': ['P1', 'P2', 'P3', 'P1', 'P2', 'P4', 'P1', 'P3', 'P5']}
df = pd.DataFrame(data=d)
df
Facility Product
0 A P1
1 A P2
2 A P3
3 B P1
4 B P2
5 B P4
6 C P1
7 C P3
8 C P5
My goal is to figure out how many same product is produced by each pair of facility. For instance, for the given example, Facility A and Facility B produce both P1 and P2, thus, I would like to get two as a results.
I have written the following code snippet, however, its execution is pretty slow. I was wondering if someone could show me a better way to achieve this.
#get all the product information
products = df['Product'].unique()
#create pair dataframe
colNames = ['Fac1', 'Fac2', 'Count']
linkedFacilities = pd.DataFrame(columns = colNames)
for p in products:
#get a sub data frame with the corresponding product
myDf = df[df['Product'].str.contains(p)]
myList = [ (f1, f2)
for f1 in myDf.Facility
for f2 in myDf.Facility if f1 != f2]
temp = pd.DataFrame(data=myList, columns=colNames[:2])
#for each identified product between two facilities, starts adding a column
temp['Count'] = 1
linkedFacilities = pd.concat([linkedFacilities, temp],ignore_index=True)
Then, I accumulate each row and eliminate the duplicates.
grouped = linkedFacilities.groupby(["Fac1", "Fac2"]).size().reset_index()
grouped.rename(index={0: "Fac1", 1: "Fac2", 2: "Count"})
m = pd.DataFrame(np.sort(grouped[['Fac1','Fac2']], axis=1), index=grouped.index).duplicated()
grouped = grouped[~m]
grouped
A B 2
A C 2
B C 1
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Here’s one option using merge and value_counts. Then remove pairs by making the index frozensets and filter out duplicates using duplicated:
counts = (df.merge(df, on='Product', suffixes=('_1', '_2'))
.query('Facility_1!=Facility_2')
.value_counts(['Facility_1','Facility_2']))
counts = (counts.reset_index(name='Number of shared products')
.loc[pd.Series(counts.index.map(frozenset)).duplicated(keep='last')]
.reset_index(drop=True))
Output:
Facility_1 Facility_2 Number of shared products
0 A B 2
1 A C 2
2 B C 1
Method 2
Here is an approach using a grouped aggregation:
from itertools import combinations
(df
.groupby('Product')['Facility']
.agg(lambda x: list(map(frozenset, combinations(set(x), r=2))))
.explode()
.value_counts()
)
Output:
(A, C) 2 (A, B) 2 (C, B) 1 Name: Facility, dtype: int64
Alternative, for a consistent ordering of the facilities:
from itertools import combinations
out = (df
.groupby('Product')['Facility']
.agg(lambda x: list(combinations(sorted(set(x)), r=2)))
.explode()
.value_counts()
)
Output:
(A, B) 2 (A, C) 2 (B, C) 1 Name: Facility, dtype: int64
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0