- If I have a rdd, how do I understand the data is in key:value
format? is there a way to find the same – something like
type(object) tells me an object’s type. I triedprint, but it just says
type(rdd.take(1))<type 'list'>. - Let’s say I have a data like
(x,1),(x,2),(y,1),(y,3)and I use
groupByKeyand got(x,(1,2)),(y,(1,3)). Is there a way to define
(1,2)and(1,3)as values where x and y are keys? Or does a key has to be a single value? I noted that if I usereduceByKeyandsumfunction to get the data((x,3),(y,4))then it becomes much easier to define this data as a key-value pair
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Python is a dynamically typed language and PySpark doesn’t use any special type for key, value pairs. The only requirement for an object being considered a valid data for PairRDD operations is that it can be unpacked as follows:
k, v = kv
Typically you would use a two element tuple due to its semantics (immutable object of fixed size) and similarity to Scala Product classes. But this is just a convention and nothing stops you from something like this:
key_value.py
class KeyValue(object):
def __init__(self, k, v):
self.k = k
self.v = v
def __iter__(self):
for x in [self.k, self.v]:
yield x
from key_value import KeyValue
rdd = sc.parallelize(
[KeyValue("foo", 1), KeyValue("foo", 2), KeyValue("bar", 0)])
rdd.reduceByKey(add).collect()
## [('bar', 0), ('foo', 3)]
and make an arbitrary class behave like a key-value. So once again if something can be correctly unpacked as a pair of objects then it is a valid key-value. Implementing __len__ and __getitem__ magic methods should work as well. Probably the most elegant way to handle this is to use namedtuples.
Also type(rdd.take(1)) returns a list of length n so its type will be always the same.
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0