I have a dataframe which consists lists in columns similar to the following. The length of the lists in all columns is not same.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
I want to explode the dataframe in such a way that i get the following output-
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
How can I achieve this?
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
PySpark has added an arrays_zip function in 2.4, which eliminates the need for a Python UDF to zip the arrays.
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))
.withColumn("new", F.explode("new"))
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Method 2
This works,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Use udf with zip. Those columns needed to explode have to be merged before exploding.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))
.withColumn("new", F.explode("new"))
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Method 3
Arriving late to the party 🙂
The simplest way to go is by using inline that doesn’t have python API but is supported by selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
+----+---+---------+------+
|Name|Age| Subjects|Grades|
+----+---+---------+------+
| Bob| 16| Maths| A|
| Bob| 16| Physics| B|
| Bob| 16|Chemistry| C|
+----+---+---------+------+
Method 4
Have you tried this
df.select(explode(split(col("Subjects"))).alias("Subjects")).show()
you can convert the data frame to an RDD.
For an RDD you can use a flatMap function to separate the Subjects.
Method 5
Copy/paste function if you need to repeat this quickly and easily across a large number of columns in a dataset
cols = ["word", "stem", "pos", "ner"]
def explode_cols(self, data, cols):
data = data.withColumn('exp_combo', f.arrays_zip(*cols))
data = data.withColumn('exp_combo', f.explode('exp_combo'))
for col in cols:
data = data.withColumn(col, f.col('exp_combo.' + col))
return data.drop(f.col('exp_combo'))
result = explode_cols(data, cols)
Your welcome 🙂
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0