pyspark Archives - Page 3 of 6

Configuring Spark to work with Jupyter Notebook and Anaconda

August 17, 2022 by Magenaut

I’ve spent a few days now trying to make Spark work with my Jupyter Notebook and Anaconda. Here’s what my .bash_profile looks like:

Does spark predicate pushdown work with JDBC?

August 17, 2022 by Magenaut

According to this

Spark Error – Unsupported class file major version

August 17, 2022 by Magenaut

I’m trying to install Spark on my Mac. I’ve used home-brew to install spark 2.4.0 and Scala. I’ve installed PySpark in my anaconda environment and am using PyCharm for development. I’ve exported to my bash profile:

Explode in PySpark

August 17, 2022 by Magenaut

I would like to transform from a DataFrame that contains lists of words into a DataFrame with each word in its own row.

I can’t seem to get –py-files on Spark to work

August 16, 2022 by Magenaut

I’m having a problem with using Python on Spark. My application has some dependencies, such as numpy, pandas, astropy, etc. I cannot use virtualenv to create an environment with all dependencies, since the nodes on the cluster do not have any common mountpoint or filesystem, besides HDFS. Therefore I am stuck with using spark-submit --py-files. I package the contents of site-packages in a ZIP file and submit the job like with --py-files=dependencies.zip option (as suggested in Easiest way to install Python dependencies on Spark executor nodes?). However, the nodes on cluster still do not seem to see the modules inside and they throw ImportError such as this when importing numpy.

collect_list by preserving order based on another variable

August 16, 2022 by Magenaut

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below:

importing pyspark in python shell

August 16, 2022 by Magenaut

This is a copy of someone else’s question on another forum that was never answered, so I thought I’d re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)

Retrieve top n in each group of a DataFrame in pyspark

August 16, 2022 by Magenaut

There’s a DataFrame in pyspark with data as below:

How can we JOIN two Spark SQL dataframes using a SQL-esque “LIKE” criterion?

August 16, 2022 by Magenaut

We are using the PySpark libraries interfacing with Spark 1.3.1.

Rename nested field in spark dataframe

August 16, 2022 by Magenaut

Having a dataframe df in Spark: