How to find median and quantiles using Spark
How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
I’d like to create a function that takes a (sorted) list as its argument and outputs a list containing each element’s corresponding percentile.