How to split Vector into columns – using PySpark
Context: I have a DataFrame
with 2 columns: word and vector. Where the column type of “vector” is VectorUDT
.
Context: I have a DataFrame
with 2 columns: word and vector. Where the column type of “vector” is VectorUDT
.
I have a Python class that I’m using to load and process some data in Spark. Among various things I need to do, I’m generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I’m not sure how to properly define a User Defined Function to accomplish what I need.
I am new to Spark SQL DataFrames and ML on them (PySpark).
How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one?