apache-spark Archives - Page 5 of 6

Spark DataFrame: Computing row-wise mean (or any aggregate operation)

August 15, 2022 by Magenaut

I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).

GroupBy column and filter rows with maximum value in Pyspark

August 14, 2022 by Magenaut

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really simple: I want to groupBy column “A” and then only keep the row of each group that has the maximum value in column “B”. Like this:

PySpark DataFrames – way to enumerate without converting to Pandas?

August 14, 2022 by Magenaut

I have a very big pyspark.sql.dataframe.DataFrame named df.
I need some way of enumerating records- thus, being able to access record with certain index. (or select group of records with indexes range)

Passing a data frame column and external list to udf under withColumn

August 14, 2022 by Magenaut

I have a Spark dataframe with the following structure. The bodyText_token has the tokens (processed/set of words). And I have a nested list of defined keywords

Pivot String column on Pyspark Dataframe

August 14, 2022 by Magenaut

I have a simple dataframe like this:

Spark union of multiple RDDs

August 14, 2022 by Magenaut

In my pig code I do this:

How to determine if object is a valid key-value pair in PySpark

August 14, 2022 by Magenaut

If I have a rdd, how do I understand the data is in key:value format? is there a way to find the same – something like type(object) tells me an object’s type. I tried print type(rdd.take(1)), but it just says <type ‘list’>. Let’s say I have a data like (x,1),(x,2),(y,1),(y,3) and I use groupByKey and … Read more

Pyspark: Parse a column of json strings

August 14, 2022 by Magenaut

I have a pyspark dataframe consisting of one column, called json, where each row is a unicode string of json. I’d like to parse each row and return a new dataframe where each row is the parsed json.

How to use a Scala class inside Pyspark

August 14, 2022 by Magenaut

I’ve been searching for a while if there is any way to use a Scala class in Pyspark, and I haven’t found any documentation nor guide about this subject.

Reduce a key-value pair into a key-list pair with Apache Spark

August 13, 2022 by Magenaut

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something of the flavor: