apache-spark Archives

Read and group json files by date element using pyspark

August 22, 2022 by Magenaut

I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document.

Convert pyspark string to date format

August 21, 2022 by Magenaut

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column.

How to split Vector into columns – using PySpark

August 21, 2022 by Magenaut

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of “vector” is VectorUDT.

How to add a constant column in a Spark DataFrame?

August 20, 2022 by Magenaut

I want to add a column in a DataFrame with some arbitrary value (that is the same for each row). I get an error when I use withColumn as follows:

How to find median and quantiles using Spark

August 20, 2022 by Magenaut

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.

How to use JDBC source to write and read data in (Py)Spark?

August 20, 2022 by Magenaut

The goal of this question is to document:

Calling Java/Scala function from a task

August 20, 2022 by Magenaut

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Transpose column to row with Spark

August 19, 2022 by Magenaut

I’m trying to transpose some columns of my table to row.
I’m using Python and Spark 1.5.0. Here is my initial table:

Load CSV file with Spark

August 19, 2022 by Magenaut

I’m new to Spark and I’m trying to read CSV data from a file with Spark.
Here’s what I am doing :

How to link PyCharm with PySpark?

August 18, 2022 by Magenaut

I’m new with apache spark and apparently I installed apache-spark with homebrew in my macbook: