apache-spark-sql Archives - Page 3 of 3

Create Spark DataFrame. Can not infer schema for type

August 15, 2022 by Magenaut

Could someone help me solve this problem I have with Spark DataFrame?

Spark DataFrame: Computing row-wise mean (or any aggregate operation)

August 15, 2022 by Magenaut

I have a Spark DataFrame loaded up in memory, and I want to take the mean (or any aggregate operation) over the columns. How would I do that? (In numpy, this is known as taking an operation over axis=1).

GroupBy column and filter rows with maximum value in Pyspark

August 14, 2022 by Magenaut

I am almost certain this has been asked before, but a search through stackoverflow did not answer my question. Not a duplicate of [2] since I want the maximum value, not the most frequent item. I am new to pyspark and trying to do something really simple: I want to groupBy column “A” and then only keep the row of each group that has the maximum value in column “B”. Like this:

Passing a data frame column and external list to udf under withColumn

August 14, 2022 by Magenaut

I have a Spark dataframe with the following structure. The bodyText_token has the tokens (processed/set of words). And I have a nested list of defined keywords

Pivot String column on Pyspark Dataframe

August 14, 2022 by Magenaut

I have a simple dataframe like this:

How to use a Scala class inside Pyspark

August 14, 2022 by Magenaut

I’ve been searching for a while if there is any way to use a Scala class in Pyspark, and I haven’t found any documentation nor guide about this subject.

Dividing complex rows of dataframe to simple rows in Pyspark

August 13, 2022 by Magenaut

I have this code:

Why is Apache-Spark – Python so slow locally as compared to pandas?

August 12, 2022 by Magenaut

A Spark newbie here.
I recently started playing around with Spark on my local machine on two cores by using the command:

PySpark converting a column of type ‘map’ to multiple columns in a dataframe

August 12, 2022 by Magenaut

Input I have a column Parameters of type map of the form: >>> from pyspark.sql import SQLContext >>> sqlContext = SQLContext(sc) >>> d = [{'Parameters': {'foo': '1', 'bar': '2', 'baz': 'aaa'}}] >>> df = sqlContext.createDataFrame(d) >>> df.collect() [Row(Parameters={'foo': '1', 'bar': '2', 'baz': 'aaa'})] Output I want to reshape it in pyspark so that all the … Read more

How to explode multiple columns of a dataframe in pyspark

August 11, 2022 by Magenaut

I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.