How can we JOIN two Spark SQL dataframes using a SQL-esque “LIKE” criterion?

We are using the PySpark libraries interfacing with Spark 1.3.1.

We have two dataframes, documents_df := {document_id, document_text} and keywords_df := {keyword}. We would like to JOIN the two dataframes and return a resulting dataframe with {document_id, keyword} pairs, using the criteria that the keyword_df.keyword appears in the document_df.document_text string.

In PostgreSQL, for example, we could achieve this using an ON clause of the form:

document_df.document_text ilike '%' || keyword_df.keyword || '%'

In PySpark however, I cannot get any form of join syntax to work. Has anybody achieved something like this before?

With kind regards,

Will

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

It is possible in a two different ways but generally speaking not recommended. First lets create a dummy data:

from pyspark.sql import Row

document_row = Row("document_id", "document_text")
keyword_row = Row("keyword") 

documents_df = sc.parallelize([
    document_row(1L, "apache spark is the best"),
    document_row(2L, "erlang rocks"),
    document_row(3L, "but haskell is better")
]).toDF()

keywords_df = sc.parallelize([
    keyword_row("erlang"),
    keyword_row("haskell"),
    keyword_row("spark")
]).toDF()
  1. Hive UDFs
    documents_df.registerTempTable("documents")
    keywords_df.registerTempTable("keywords")
    
    query = """SELECT document_id, keyword
        FROM documents JOIN keywords
        ON document_text LIKE CONCAT('%', keyword, '%')"""
    
    like_with_hive_udf = sqlContext.sql(query)
    like_with_hive_udf.show()
    
    ## +-----------+-------+
    ## |document_id|keyword|
    ## +-----------+-------+
    ## |          1|  spark|
    ## |          2| erlang|
    ## |          3|haskell|
    ## +-----------+-------+
  2. Python UDF
    from pyspark.sql.functions import udf, col 
    from pyspark.sql.types import BooleanType
    
    # Of you can replace `in` with a regular expression
    contains = udf(lambda s, q: q in s, BooleanType())
    
    like_with_python_udf = (documents_df.join(keywords_df)
        .where(contains(col("document_text"), col("keyword")))
        .select(col("document_id"), col("keyword")))
    like_with_python_udf.show()
    
    ## +-----------+-------+
    ## |document_id|keyword|
    ## +-----------+-------+
    ## |          1|  spark|
    ## |          2| erlang|
    ## |          3|haskell|
    ## +-----------+-------+

Why not recommended? Because in both cases it requires a Cartesian product:

like_with_hive_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter document_text#3 LIKE concat(%,keyword#4,%)
##   CartesianProduct
##    Scan PhysicalRDD[document_id#2L,document_text#3]
##    Scan PhysicalRDD[keyword#4]

like_with_python_udf.explain()

## TungstenProject [document_id#2L,keyword#4]
##  Filter pythonUDF#13
##   !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3,keyword#4), ...
##    CartesianProduct
##     Scan PhysicalRDD[document_id#2L,document_text#3]
##     Scan PhysicalRDD[keyword#4]

There are other ways to achieve a similar effect without a full Cartesian.

  1. Join on tokenized document – useful if keywords list is to large to be handled in a memory of a single machine
    from pyspark.ml.feature import Tokenizer
    from pyspark.sql.functions import explode
    
    tokenizer = Tokenizer(inputCol="document_text", outputCol="words")
    
    tokenized = (tokenizer.transform(documents_df)
        .select(col("document_id"), explode(col("words")).alias("token")))
    
    like_with_tokenizer = (tokenized
        .join(keywords_df, col("token") == col("keyword"))
        .drop("token"))
    
    like_with_tokenizer.show()
    
    ## +-----------+-------+
    ## |document_id|keyword|
    ## +-----------+-------+
    ## |          3|haskell|
    ## |          1|  spark|
    ## |          2| erlang|
    ## +-----------+-------+

    This requires shuffle but not Cartesian:

    like_with_tokenizer.explain()
    
    ## TungstenProject [document_id#2L,keyword#4]
    ##  SortMergeJoin [token#29], [keyword#4]
    ##   TungstenSort [token#29 ASC], false, 0
    ##    TungstenExchange hashpartitioning(token#29)
    ##     TungstenProject [document_id#2L,token#29]
    ##      !Generate explode(words#27), true, false, [document_id#2L, ...
    ##       ConvertToSafe
    ##        TungstenProject [document_id#2L,UDF(document_text#3) AS words#27]
    ##         Scan PhysicalRDD[document_id#2L,document_text#3]
    ##   TungstenSort [keyword#4 ASC], false, 0
    ##    TungstenExchange hashpartitioning(keyword#4)
    ##     ConvertToUnsafe
    ##      Scan PhysicalRDD[keyword#4]
  2. Python UDF and broadcast variable – if keywords list is relatively small
    from pyspark.sql.types import ArrayType, StringType
    
    keywords = sc.broadcast(set(
        keywords_df.map(lambda row: row[0]).collect()))
    
    bd_contains = udf(
        lambda s: list(set(s.split()) & keywords.value), 
        ArrayType(StringType()))
    
    
    like_with_bd = (documents_df.select(
        col("document_id"), 
        explode(bd_contains(col("document_text"))).alias("keyword")))
    
    like_with_bd.show()
    
    ## +-----------+-------+
    ## |document_id|keyword|
    ## +-----------+-------+
    ## |          1|  spark|
    ## |          2| erlang|
    ## |          3|haskell|
    ## +-----------+-------+

    It requires neither shuffle nor Cartesian but you still have to transfer broadcast variable to each worker node.

    like_with_bd.explain()
    
    ## TungstenProject [document_id#2L,keyword#46]
    ##  !Generate explode(pythonUDF#47), true, false, ...
    ##   ConvertToSafe
    ##    TungstenProject [document_id#2L,pythonUDF#47]
    ##     !BatchPythonEvaluation PythonUDF#<lambda>(document_text#3), ...
    ##      Scan PhysicalRDD[document_id#2L,document_text#3]
  3. Since Spark 1.6.0 you can mark a small data frame using sql.functions.broadcast to get a similar effect as above without using UDFs and explicit broadcast variables. Reusing tokenized data:
    from pyspark.sql.functions import broadcast
    
    like_with_tokenizer_and_bd = (broadcast(tokenized)
        .join(keywords_df, col("token") == col("keyword"))
        .drop("token"))
    
    like_with_tokenizer.explain()
    
    ## TungstenProject [document_id#3L,keyword#5]
    ##  BroadcastHashJoin [token#10], [keyword#5], BuildLeft
    ##   TungstenProject [document_id#3L,token#10]
    ##    !Generate explode(words#8), true, false, ...
    ##     ConvertToSafe
    ##      TungstenProject [document_id#3L,UDF(document_text#4) AS words#8]
    ##       Scan PhysicalRDD[document_id#3L,document_text#4]
    ##   ConvertToUnsafe
    ##    Scan PhysicalRDD[keyword#5]

Related:

Method 2

Precise way of doing this is as below:(Bit slow but accurate)

from pyspark.sql.functions import udf, col
from pyspark.sql.types import BooleanType

from pyspark.sql import Row


def string_match_percentage(col_1, col_2, confidence):
    s = col_1.lower()
    t = col_2.lower()

    global row, col
    rows = len(s) + 1
    cols = len(t) + 1
    array_diffrence = np.zeros((rows, cols), dtype=int)

    for i in range(1, rows):
        for k in range(1, cols):
            array_diffrence[i][0] = i
            array_diffrence[0][k] = k

    for col in range(1, cols):
        for row in range(1, rows):
            if s<div class="su-row"></div> == t[col - 1]:
                cost = 0
            else:
                cost = 2
            array_diffrence<div class="su-row"></div>[col] = min(array_diffrence<div class="su-row"></div>[col] + 1,
                                            array_diffrence<div class="su-row"></div>[col - 1] + 1,
                                            array_diffrence<div class="su-row"></div>[col - 1] + cost)
    match_percentage = ((len(s) + len(t)) - array_diffrence<div class="su-row"></div>[col]) / (len(s) + len(t)) * 100
    if match_percentage >= confidence:
        return True
    else:
        return False


document_row = Row("document_id", "document_text")
keyword_row = Row("keyword")

documents_df = sc.parallelize([
    document_row(1, "google llc"),
    document_row(2, "blackfiled llc"),
    document_row(3, "yahoo llc")
]).toDF()

keywords_df = sc.parallelize([
    keyword_row("yahoo"),
    keyword_row("google"),
    keyword_row("apple")
]).toDF()

conditional_contains = udf(lambda s, q: string_match_percentage(s, q, confidence=70), BooleanType())

like_joined_df = (documents_df.crossJoin(keywords_df)
                        .where(conditional_contains(col("document_text"), col("keyword")))
                        .select(col("document_id"), col("keyword"), col("document_text")))
like_joined_df.show()

Output:

# +-----------+-------+-------------+
# |document_id|keyword|document_text|
# +-----------+-------+-------------+
# |          1| google|   google llc|
# |          3|  yahoo|    yahoo llc|
# +-----------+-------+-------------+


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x