apache-spark-sql Archives - Page 2 of 3

How do I add a new column to a Spark DataFrame (using PySpark)?

August 17, 2022 by Magenaut

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.

‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark

August 17, 2022 by Magenaut

I’m trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark.
I’ve just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured).

Encode and assemble multiple features in PySpark

August 17, 2022 by Magenaut

I have a Python class that I’m using to load and process some data in Spark. Among various things I need to do, I’m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I’m not sure how to properly define a User Defined Function to accomplish what I need.

Count number of non-NaN entries in each column of Spark dataframe with Pyspark

August 17, 2022 by Magenaut

I have a very large dataset that is loaded in Hive. It consists of about 1.9 million rows and 1450 columns. I need to determine the “coverage” of each of the columns, meaning, the fraction of rows that have non-NaN values for each column.

Does spark predicate pushdown work with JDBC?

August 17, 2022 by Magenaut

According to this

Explode in PySpark

August 17, 2022 by Magenaut

I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.

Retrieve top n in each group of a DataFrame in pyspark

August 16, 2022 by Magenaut

There’s a DataFrame in pyspark with data as below:

How to export a table dataframe in PySpark to csv?

August 16, 2022 by Magenaut

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame object (I have called it “table”) to a csv file so I can manipulate it and plot the columns. How do I export the DataFrame “table” to a csv file?

How can we JOIN two Spark SQL dataframes using a SQL-esque “LIKE” criterion?

August 16, 2022 by Magenaut

We are using the PySpark libraries interfacing with Spark 1.3.1.

Pyspark: explode json in column to multiple columns

August 15, 2022 by Magenaut

The data looks like this –