apache-spark-ml Archives

How to split Vector into columns – using PySpark

August 21, 2022 by Magenaut

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of “vector” is VectorUDT.

Encode and assemble multiple features in PySpark

August 17, 2022 by Magenaut

I have a Python class that I’m using to load and process some data in Spark. Among various things I need to do, I’m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I’m not sure how to properly define a User Defined Function to accomplish what I need.

Create a custom Transformer in PySpark ML

August 15, 2022 by Magenaut

I am new to Spark SQL DataFrames and ML on them (PySpark).
How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one?