sampleclean.clean.deduplication.join

SimilarityJoin

class SimilarityJoin extends Serializable

A class that contains an algorithm called a Similarity Join, which (in this context) is comprised of both a blocking and a matching step. A Similarity Join will compare two datasets and produce a single dataset that will contain unique records from the union of both datasets.

The uniqueness criteria is defined by a Similarity Function of the class AnnotatedSimilarityFeaturizer.

Linear Supertypes
Serializable, Serializable, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. SimilarityJoin
  2. Serializable
  3. Serializable
  4. AnyRef
  5. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SimilarityJoin(sc: SparkContext, simfeature: AnnotatedSimilarityFeaturizer, weighted: Boolean = false)

    sc

    Spark Context

    simfeature

    Similarity Featurizer used to decide whether a pair of records is similar or not. Featurizing a pair of rows in this context will return 1.0 if the pair is similar or 0.0 otherwise.

    weighted

    If set to true, the algorithm will automatically calculate token weights. Default token weights are defined based on token idf values.

    Adding weights into the join might lead to more reliable pair comparisons but could add overhead to the algorithm. However, smart optimizations such as Prefix Filtering used in some implementations of AnnotatedSimilarityFeaturizer might actually reduce overhead if there is an abundance of common tokens in the dataset.

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  9. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  10. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  11. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  12. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  13. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  14. def join(rddA: RDD[Row], rddB: RDD[Row], sampleA: Boolean = false): RDD[(Row, Row)]

    Join two RDDs.

    Join two RDDs. The resulting RDD will contain pairs of rows that are considered similar by the AnnotatedSimilarityFeaturizer. If rddA is the same as rddB, then the algorithm will trigger a self-join.

    The default implementation is naive but subclasses should override and optimize.

    rddA

    First RDD of rows

    rddB

    Second RDD of rows

    sampleA

    True if rddA is a sample of rddB

    returns

    an RDD with pairs of similar rows.

  15. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  16. final def notify(): Unit

    Definition Classes
    AnyRef
  17. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  18. def setSimilarityFeaturizer(newSimilarity: String): Unit

  19. var simfeature: AnnotatedSimilarityFeaturizer

    Similarity Featurizer used to decide whether a pair of records is similar or not.

    Similarity Featurizer used to decide whether a pair of records is similar or not. Featurizing a pair of rows in this context will return 1.0 if the pair is similar or 0.0 otherwise.

  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  21. def toString(): String

    Definition Classes
    AnyRef → Any
  22. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  23. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  24. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped