sampleclean.clean.deduplication.join

BroadcastJoin

class BroadcastJoin extends SimilarityJoin

A Broadcast join is an implementation of a Similarity Join that uses an optimization called Prefix Filtering. In a distributed environment, this optimization involves broadcasting a series of maps to each node.

Note: because the algorithm may collect large RDDs into maps by using driver memory, java heap problems could arise. In this case, it is recommended to increase allocated driver memory through Spark configuration spark.driver.memory

Linear Supertypes
SimilarityJoin, Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. BroadcastJoin
  2. SimilarityJoin
  3. Serializable
  4. Serializable
  5. AnyRef
  6. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new BroadcastJoin(sc: SparkContext, featurizer: AnnotatedSimilarityFeaturizer, weighted: Boolean = false)

    sc

    Spark Context

    featurizer

    Similarity Featurizer optimized for Prefix Filtering

    weighted

    If set to true, the algorithm will automatically calculate token weights. Default token weights are defined based on token idf values.

    Adding weights into the join might lead to more reliable pair comparisons but could add overhead to the algorithm. However, smart optimizations such as Prefix Filtering used in some implementations of AnnotatedSimilarityFeaturizer might actually reduce overhead if there is an abundance of common tokens in the dataset.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  14. def join(rddA: RDD[Row], rddB: RDD[Row], sampleA: Boolean = false): RDD[(Row, Row)]

    Perform a Broadcast Join

    Perform a Broadcast Join

    rddA

    First RDD of rows

    rddB

    Second RDD of rows

    sampleA

    true if rddA is a sample of rddB

    returns

    an RDD with pairs of similar rows.

    Definition Classes
    BroadcastJoinSimilarityJoin
    Annotations
    @Override()
  15. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  16. final def notify(): Unit

    Definition Classes
    AnyRef
  17. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  18. def setSimilarityFeaturizer(newSimilarity: String): Unit

    Definition Classes
    SimilarityJoin
  19. var simfeature: AnnotatedSimilarityFeaturizer

    Definition Classes
    SimilarityJoin
  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  21. def toString(): String

    Definition Classes
    AnyRef → Any
  22. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from SimilarityJoin

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped