apache-spark Archives - Page 2 of 6

Spark Dataframe distinguish columns with duplicated name

August 18, 2022 by Magenaut

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:

How to change a dataframe column from String type to Double type in PySpark?

August 18, 2022 by Magenaut

I have a dataframe with column as String.
I wanted to change the column type to Double type in PySpark.

How to change dataframe column names in pyspark?

August 18, 2022 by Magenaut

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:

How to perform union on two DataFrames with different amounts of columns in spark?

August 17, 2022 by Magenaut

I have 2 DataFrames:

How to turn off INFO logging in Spark?

August 17, 2022 by Magenaut

I installed Spark using the AWS EC2 guide and I can launch the program fine using the bin/pyspark script to get to the spark prompt and can also do the Quick Start quide successfully.

Pyspark: Split multiple array columns into rows

August 17, 2022 by Magenaut

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as is.

creating spark data structure from multiline record

August 17, 2022 by Magenaut

I’m trying to read in retrosheet event file into spark. The event file is structured as such.

How do I add a new column to a Spark DataFrame (using PySpark)?

August 17, 2022 by Magenaut

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.

‘PipelinedRDD’ object has no attribute ‘toDF’ in PySpark

August 17, 2022 by Magenaut

I’m trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark.
I’ve just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured).

Encode and assemble multiple features in PySpark

August 17, 2022 by Magenaut

I have a Python class that I’m using to load and process some data in Spark. Among various things I need to do, I’m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I’m not sure how to properly define a User Defined Function to accomplish what I need.