I am curious what would be an efficient way of uniquefying such data objects:
testdata =[ ['9034968', 'ETH'], ['14160113', 'ETH'], ['9034968', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15724032', 'ETH'], ['15481740', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['10307528', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['15481740', 'ETH'], ['15379365', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15379365', 'ETH'] ]
For each data pair, left numeric string PLUS the type at the right tells the uniqueness of a data element. The return value should be a list of lists as same as the testdata, but with only the unique values kept.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
You can use a set:
unique_data = [list(x) for x in set(tuple(x) for x in testdata)]
You can also see this page which benchmarks a variety of methods that either preserve or don’t preserve order.
Method 2
I tried @Mark’s answer and got an error. Converting the list and each elements into a tuple made it work. Not sure if this the best way though.
list(map(list, set(map(lambda i: tuple(i), testdata))))
Of course the same thing can be expressed using a list comprehension instead.
[list(i) for i in set(tuple(i) for i in testdata)]
I am using Python 2.6.2.
Update
@Mark has since changed his answer. His current answer uses tuples and will work. So will mine 🙂
Update 2
Thanks to @Mark. I have changed my answer to return a list of lists rather than a list of tuples.
Method 3
import sets testdata =[ ['9034968', 'ETH'], ['14160113', 'ETH'], ['9034968', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15724032', 'ETH'], ['15481740', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['10307528', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['15481740', 'ETH'], ['15379365', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15379365', 'ETH']] conacatData = [x[0] + x[1] for x in testdata] print conacatData uniqueSet = sets.Set(conacatData) uniqueList = [ [t[0:-3], t[-3:]] for t in uniqueSet] print uniqueList
Method 4
Expanding a bit on @Mark Byers solution, you can also just do one list comprehension and typecast to get what you need:
testdata = list(set(tuple(x) for x in testdata))
Also, if you don’t like list comprehensions as many find them confusing, you can do the same in a for loop:
for i, e in enumerate(testdata):
testdata[i] = tuple(e)
testdata = list(set(testdata))
Method 5
Use unique in numpy to solve this:
import numpy as np np.unique(np.array(testdata), axis=0)
Note that the axis keyword needs to be specified otherwise the list is first flattened.
Alternatively, use vstack:
np.vstack({tuple(row) for row in testdata})
Method 6
if you have a list of objects than you can modify @Mark Byers answer to:
unique_data = [list(x) for x in set(tuple(x.testList) for x in testdata)]
where testdata is a list of objects which has a list testList as attribute.
Method 7
I was about to post my own take on this until I noticed that @pyfunc had already come up with something similar. I’ll post my take on this problem anyway in case it’s helpful.
testdata =[ ['9034968', 'ETH'], ['14160113', 'ETH'], ['9034968', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15724032', 'ETH'], ['15481740', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['10307528', 'ETH'], ['15481757', 'ETH'], ['15481724', 'ETH'], ['15481740', 'ETH'], ['15379365', 'ETH'], ['11111', 'NOT'], ['9555269', 'NOT'], ['15379365', 'ETH']
]
flatdata = [p[0] + "%" + p[1] for p in testdata]
flatdata = list(set(flatdata))
testdata = [p.split("%") for p in flatdata]
print(testdata)
Basically, you concatenate each element of your list into a single string using a list comprehension, so that you have a list of single strings. This is then much easier to turn into a set, which makes it unique. Then you simply split it on the other end and convert it back to your original list.
I don’t know how this compares in terms of performance but it’s a simple and easy-to-understand solution I think.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0