sampleclean.api

SampleCleanContext

class SampleCleanContext extends AnyRef

As an analog to the SparkContext, the SampleCleanContext gives a handle to the current session. This class provides the basic API to manipulate the data structures. We assume that the data is initially in a HIVE store.

In its current implementation, the SampleCleanContext supports both persistent data in HIVE or keeping the data in memory as an RDD.

Annotations
@serializable()
Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. SampleCleanContext
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SampleCleanContext(sc: SparkContext)

    sc

    an existing Spark Context

Value Members

  1. final def !=(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  2. final def !=(arg0: Any): Boolean

    Definition Classes
    Any
  3. final def ##(): Int

    Definition Classes
    AnyRef → Any
  4. final def ==(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  5. final def ==(arg0: Any): Boolean

    Definition Classes
    Any
  6. final def asInstanceOf[T0]: T0

    Definition Classes
    Any
  7. def clone(): AnyRef

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def closeHiveSession(): Unit

    This function cleans up after using initializeHive by dropping any temp tables in HIVE.

    This function cleans up after using initializeHive by dropping any temp tables in HIVE. If you use HIVE and don't execute this command, the samples will persist between sessions as it will be written to disk.

  9. final def eq(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  10. def equals(arg0: Any): Boolean

    Definition Classes
    AnyRef → Any
  11. def finalize(): Unit

    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  12. final def getClass(): Class[_]

    Definition Classes
    AnyRef → Any
  13. def getCleanSample(sampleTable: String): SchemaRDD

    Returns an RDD which points to a sample of a base table.

    Returns an RDD which points to a sample of a base table.

    sampleTable

    sample table name

  14. def getHiveContext(): HiveContext

    Returns the HiveContext

  15. def getSamplingRatio(tableName: String): Double

    Returns the sampling ratio used to create a sample table.

    Returns the sampling ratio used to create a sample table. The table should exist in Hive. The clean sample name can be accessed through getCleanSampleName from the class val qb

    tableName

    table name

  16. def hashCode(): Int

    Definition Classes
    AnyRef → Any
  17. def hql(query: String): SchemaRDD

    This function executes a hiveql query (potentially re-writing it to sampleclean)

  18. def initialize(baseTable: String, sampleTable: String, samplingRatio: Double = 1.0, persist: Boolean = true): (SchemaRDD, SchemaRDD)

    This function initializes the clean and dirty samples as Schema RDD's and returns a tuple (Clean, Dirty).

    This function initializes the clean and dirty samples as Schema RDD's and returns a tuple (Clean, Dirty). There is an additional flag to persist the RDD in HIVE if desired.

    baseTable

    name of base table

    sampleTable

    name of sample table

    samplingRatio

    sampling ratio between 0.0 and 1.0

    persist

    set to true to persist RDD in HIVE (default = true)

  19. def initializeConsistent(baseTable: String, sampleTable: String, onKey: String, samplingRatio: Double = 1.0, persist: Boolean = true): (SchemaRDD, SchemaRDD)

    This function initializes the clean and dirty consistently hashed samples as Schema RDD's in a tuple (Clean, Dirty).

    This function initializes the clean and dirty consistently hashed samples as Schema RDD's in a tuple (Clean, Dirty). There is an additional flag to persist the rdd in HIVE if desired.

    baseTable

    name of base table

    sampleTable

    name of sample table

    onKey

    name of column that contains unique identifiers

    samplingRatio

    sampling ratio between 0.0 and 1.0

    persist

    set to true to persist RDD in HIVE (default = true)

  20. final def isInstanceOf[T0]: Boolean

    Definition Classes
    Any
  21. final def ne(arg0: AnyRef): Boolean

    Definition Classes
    AnyRef
  22. final def notify(): Unit

    Definition Classes
    AnyRef
  23. final def notifyAll(): Unit

    Definition Classes
    AnyRef
  24. def resetSample(sampleTable: String): SchemaRDD

    Resets the clean sample to the initial dirty sample

    Resets the clean sample to the initial dirty sample

    sampleTable

    sample table name

  25. final def synchronized[T0](arg0: ⇒ T0): T0

    Definition Classes
    AnyRef
  26. def toString(): String

    Definition Classes
    AnyRef → Any
  27. final def wait(): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  28. final def wait(arg0: Long, arg1: Int): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  29. final def wait(arg0: Long): Unit

    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  30. def writeToParent(sampleName: String): SchemaRDD

    Given a working set this function applies the changes back to the parent table.

Inherited from AnyRef

Inherited from Any

Ungrouped