sampleclean.clean.deduplication

RecordDeduplication

object RecordDeduplication

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. RecordDeduplication
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def deduplication(scc: SampleCleanContext, sampleName: String, colNames: List[String], threshold: Double = 0.9, weighting: Boolean = true): RecordDeduplication

    This method builds an Record Deduplication algorithm that will resolve automatically.

    This method builds an Record Deduplication algorithm that will resolve automatically. It uses several default values and is designed for simple deduplication tasks. For more flexibility in parameters (such as setting a Similarity Featurizer and Tokenizer), refer to the RecordDeduplication class.

    This algorithm uses the Jaccard Similarity for pairwise comparisons and a word tokenizer.

    scc

    SampleClean Context

    sampleName
    colNames

    names of attributes that will be used for deduplication

    threshold

    threshold used in the algorithm. Must be between 0.0 and 1.0

    weighting

    If set to true, the algorithm will automatically calculate token weights. Default token weights are defined based on token idf values.

    Adding weights into the join might lead to more reliable pair comparisons and speed up the algorithm if there is an abundance of common words in the dataset.

  9. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  11. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  12. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  13. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  14. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  15. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  16. final def notify(): Unit

    Definition Classes
    AnyRef
  17. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  18. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  19. def toString(): String

    Definition Classes
    AnyRef → Any
  20. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  21. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  22. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped