rdd Archives - Magenaut

How to find median and quantiles using Spark

August 20, 2022 by Magenaut

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.

‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark

August 17, 2022 by Magenaut

I’m trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark.
I’ve just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured).

PySpark DataFrames – way to enumerate without converting to Pandas?

August 14, 2022 by Magenaut

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)

Spark union of multiple RDDs

August 14, 2022 by Magenaut

In my pig code I do this:

Reduce a key-value pair into a key-list pair with Apache Spark

August 13, 2022 by Magenaut

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something of the flavor: