sampleclean.clean.featurize

AnnotatedSimilarityFeaturizer

abstract class AnnotatedSimilarityFeaturizer extends Featurizer

One particular use for similarity featurizers is in Similarity joins. A Similarity Join compares pairs of rows and decides whether they are similar or not. To compare a certain pair of rows, a Similarity Featurizer would use some criteria to output a single pair feature: 1.0 if they are similar or 0.0 otherwise.

A special class of Similarity Featurizers have properties that allow for a type of optimization called Prefix Filtering.

We encode this logic into AnnotatedSimilarityFeaturizer.

Annotations
@serializable()
Linear Supertypes
Featurizer, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. AnnotatedSimilarityFeaturizer
  2. Featurizer
  3. AnyRef
  4. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new AnnotatedSimilarityFeaturizer(copyConst: AnnotatedSimilarityFeaturizer)

  2. new AnnotatedSimilarityFeaturizer(colNames: List[String], context: List[String], tokenizer: Tokenizer, threshold: Double, minSize: Int, schemaMap: Map[Int, Int] = null)

    colNames

    the names of the columns that will be used for pairwise comparisons

    context

    names of all columns for the existing dataset

    tokenizer

    Function used to tokenize Strings

    threshold

    depends on similarity function used

    minSize

    this parameter is used by Prefix Filtering to filter out pairs of records that are similar but are too short to be considered "strongly" similar. For example, [Bob, bob] could be considered similar by the algorithm but not by a person since there is lack information about Bob's last name. In this case, minSize could be set to 2: pair members with token sizes less than 2 will be omitted.

    schemaMap

    maps columns from one table to another. Will assume equal schema if null.

Abstract Value Members

  1. abstract def optimizedSimilarity(tokens1: Seq[String], tokens2: Seq[String], thresh: Double, tokenWeights: Map[String, Double]): (Boolean, Double)

    Calculates similarity between two lists of tokens.

    Calculates similarity between two lists of tokens.

    Uses a known threshold > 0 for optimization. If no threshold > 0 is specified, use similarity()

    tokens1

    first token list.

    tokens2

    second token list.

    thresh

    specified threshold.

    tokenWeights

    token-to-weight map

    returns

    (true if they are similar, similarity)

  2. abstract def similarity(tokens1: Seq[String], tokens2: Seq[String], tokenWeights: Map[String, Double]): Double

    Calculates similarity between two lists of tokens.

    Calculates similarity between two lists of tokens.

    tokens1

    first token list.

    tokens2

    second token list.

    tokenWeights

    token-to-weight map

Concrete Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. var cols: List[Int]

    Definition Classes
    Featurizer
  9. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  11. def featurize[K, V](rows: Set[Row], params: Map[K, V] = null): (Set[Row], Array[Double])

    This function takes a set of rows, takes a token-to-weight map and calculates whether the set is similar or not.

    This function takes a set of rows, takes a token-to-weight map and calculates whether the set is similar or not.

    K

    String (tokens)

    V

    Double (weights)

    rows

    rows used for comparison. If the set is larger than 2, the algorithm compares the first and last row in the set.

    params

    token-to-weight map

    returns

    The first element of the tuple is a set of primary keys and the second element is Array(1.0) if the pair is similar, Array(0.0) otherwise.

    Definition Classes
    AnnotatedSimilarityFeaturizerFeaturizer
  12. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  13. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  14. def getSimilarityDouble[K, V](rows: Set[Row], params: Map[K, V] = null): (Set[Row], Double)

    This function takes a set of rows, takes a token-to-weight map and calculates whether the set is similar or not.

    This function takes a set of rows, takes a token-to-weight map and calculates whether the set is similar or not.

    K

    String (tokens)

    V

    Double (weights)

    rows

    rows used for comparison. If the set is larger than 2, the algorithm compares the first and last row in the set.

    params

    token-to-weight map

    returns

    The first element of the tuple is a set of primary keys and the second element is Array(Similarity Value)

  15. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  16. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  17. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  18. final def notify(): Unit

    Definition Classes
    AnyRef
  19. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  20. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  21. var threshold: Double

    depends on similarity function used

  22. def toString(): String

    Definition Classes
    AnnotatedSimilarityFeaturizer → AnyRef → Any
  23. var tokenizer: Tokenizer

    Function used to tokenize Strings

  24. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  25. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  26. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Featurizer

Inherited from AnyRef

Inherited from Any

Ungrouped