One post tagged with "Cassandra"

Big Data Types Library

October 10, 2021 · 9 min read

Software Engineer

Big Data Types is a library that can safely convert types between different Big Data systems.

The power of the library

The library implements a few abstract types that can hold any kind of structure, and using type-class derivations, it can convert between multiple types without having any code relating them. In other words, there is no need to implement a transformation between type A to type B, the library will do it for you.

As an example, let's say we have a generic type called Generic. Now we want to convert from type A to type B. If we implement the conversion from A to Generic and the conversion from Generic to B, automatically we can convert from A to B although there is no single line of code mixing A and B.

We can also do the opposite, we can convert from B to A by implementing the conversion from B to Generic and the conversion from Generic to A. Now we can convert between A and B as we wish.

Now comes the good part of this. If we introduce a new type C and we want to have conversions, we would need to convert from A to C, and from B to C and the opposite (4 new implementations). If now we introduce D, we would need to implement the conversion from A to D, from B to D and from C to D and the opposite (6 new implementations). This is not scalable, and it is not maintainable.

Having this Generic type means that when we introduce C, we only need to implement the conversion from C to Generic and from Generic to C, without worrying at all about other implementations or types. Moreover, is likely that the conversion will be very similar to others, so we can reuse some of the code.

tip

It is important to know that one of these types is the Scala types themselves. So if we want to convert from Scala types (like case classes) to another type, we only need to implement Generic -> newType

How the library works

Modules

As mentioned, the library has multiple modules, each one of them representing a different system with its own types. Each module implements the conversion from and to Generic.

For now, the modules are core (for Scala types and common code), BigQuery, Cassandra, Circe, and Spark.

To use the library, only the modules that are needed should be imported. For example, if we want to convert from Scala types to BigQuery types, we only need to BigQuery module. (Core module is always included as a dependency) If we want to convert from Spark to BigQuery we need to import both Spark and BigQuery modules.

Generic type

The Generic type is called SqlType and it's implemented as sealed trait that can hold any kind of structure. In Scala 3, this type is implemented as an enum but both represents the same.

Repeated values

Usually, there are two ways of implementing a repeated value like an Array. Some systems use a type like Array or List and others flag a basic type with repeated. The implementation of this SqlType uses the latter, so any basic type can have a mode that can be Required, Nullable, or Repeated. This is closer to the BigQuery implementation.

note

This implementation does not allow for Nullable and Repeated at the same time, but a Repeated type can have 0 elements.

Nested values

The SqlStruct can hold a list of records, including other SqlStruct, meaning that we can have nested structures.

Type-class derivation

Type-classes are a way of implementing "ad-hoc polymorphism". This means that we can implement behaviour for a type without having to modify the type itself. In Scala, we achieve this through implicits.

The interesting part of type-classes for this library is that we can derive a type-class for a type without having to implement it.

For example, we can create a simple type-class:

trait MyTypeClass[A] {
  def doSomething(a: A): String
}

tip

A type-class is always a trait with a generic type.

Then, we can implement our type-class for an Int type:

implicit val myTypeClassForInt: MyTypeClass[Int] = new MyTypeClass[Int] {
  override def doSomething(a: Int): String = "This is my int" + a.toString
}

tip

Scala 2.13 has a simplified syntax for this when there is only one method in the trait:

implicit val myTypeClassForInt: MyTypeClass[Int] = (a: Int) => "This is my int" + a.toString

We can do similar for other types:

implicit val myTypeClassForString: MyTypeClass[String] = new MyTypeClass[String] {
  override def doSomething(a: String): String = "This is my String" + a
}

Now, if we want to have a List[Int] or a List[String], and use our type-class, we need to implement both List[Int] and List[String]. But, if we implement the type-class for List[A] where A is any type, the compiler can derive the implementation for List[Int] and List[String] automatically, and for any other type already implemented.

implicit def myTypeClassForList[A](implicit myTypeClassForA: MyTypeClass[A]): MyTypeClass[List[A]] = new MyTypeClass[List[A]] {
  override def doSomething(a: List[A]): String = a.map(myTypeClassForA.doSomething).mkString(",")
}

Similarly, if we want to have a case class like:

case class MyClass(a: Int, b: String)

We would need to implement the type-class for MyClass. But, if we implement the type-class for a generic Product type, the compiler can derive the implementation for MyClass automatically, and for any other case class that has types already implemented.

note

Implementing the conversion for a Product type is more complex than implementing it for a List type, and usually Shapeless is the library we use to do this in Scala 2.

In Scala 3, the language already allows us to derive the type-class for a Product type, so we don't need to use Shapeless.

In big-data-types we have the implementation for all basic types, including iterables and Product types here for Scala 2 and here for Scala 3.

Implementing a new type

To implement a new type, we need to implement the conversion from and to Generic type. There is a complete guide, step by step, with examples, in the official documentation

A quick example, let's say we want to implement a new type called MyType. We need to implement the conversion MyType -> Generic and Generic -> MyType.

tip

Both conversions are not strictly needed, if we only need to use Scala -> MyType we only need to implement Generic -> MyType because the library already has the conversion Scala -> Generic. The same happens with other types, like BigQuery -> MyType will also be ready automatically.

To do that, we need a type-class that works with our type. This will be different depending on the type we want to implement. For example:

trait GenericToMyType[A] {
  def getType: MyTypeObject
}

Maybe our type works with a List at the top level, as Spark does, so instead, we will do:

trait GenericToMyType[A] {
  def getType: List[MyTypeObject]
}

tip

getType can be renamed to anything meaningful, like toMyType or myTypeSchema

And we need to implement this type-class for all the (Generic) SqlType types:

Scala 2
Scala 3

implicit val genericToMyTypeForInt: GenericToMyType[SqlInt] = new GenericToMyType[SqlInt] {
  override def getType: MyTypeObject = MyIntType
}

given GenericToMyType[SqlInt] = new GenericToMyType[SqlInt] {
    override def getType: MyTypeObject = MyIntType
}

Using conversions

The defined type-classes allow you to convert MyType -> Generic by doing this:

val int: SqlInt = SqlTypeConversion[MyIntType].getType

And Generic -> MyType by doing this:

val int: MyIntType = SqlTypeToBigQuery[SqlInt].getType

This can work well when we work these case classes and we don't have an instance of them. For example, a case class definition can be converted into a BigQuery Schema, ready to be used for table creation.

But, sometimes, our types work with instances rather than definitions, and we need to use them to convert to other types.

There is another type-class on all implemented types that allows to work with instances. In general, this type-class can be implemented using code from the other, but this one expects an argument of the type we want to convert to.

trait SqlInstanceToMyType[A] {
  def myTypeSchema(value: A): MyTypeObject
}

Implementing this type-class allows to use the conversion like this:

val mySchema: MyTypeObject = SqlInstanceToMyType.myTypeSchema(theOtherType)

But these syntaxis are not very friendly, and we can use extension methods to make it more readable.

Extension methods

Extension methods in Scala 2 are done through implicit classes and allow us to create new methods for existing types.

In the library, we implement extension methods for Generic -> SpecificType, and the interesting part, again, is that we don't need to implement A -> B directly, the compiler can derive it for us.

Scala 2
Scala 3

  implicit class InstanceSyntax[A: SqlInstanceToMyType](value: A) {
    def asMyType: MyTypeObject = SqlInstanceToMyType[A].myTypeSchema(value)
  }

  extension[A: SqlInstanceToMyType](value: A) {
    def asMyType: MyTypeObject = SqlInstanceToMyType[A].myTypeSchema(value)
  }

and suddenly, we can use the conversion like this:

val mySchema: MyTypeObject = theOtherType.asMyType

And this is a syntax that can be easier to use. For example, if we work with Spark and BigQuery, we can do the following:

val sparkDf: DataFrame = ???
val bigQuerySchema = sparkDf.schema.asBigQuery

More types to come

The library has only a few types implemented (BigQuery, Spark, Cassandra, and Circe) but implementing a new type is fairly easy and it gets automatically methods that can be used to convert it into any other type already implemented. As this grows, the number of conversions grows exponentially, and the library becomes more powerful.

Some types that could be potentially implemented:

Avro
Parquet
Athena (AWS)
Redshift (AWS)
Snowflake
RDS (relational databases)
Protobuf
ElasticSearch templates
...

Some types could have some restrictions, but they could be implemented differently, for example, a type conversion could be implemented as a String conversion, being the string a "Create table" statement for a specific database and automatically any other type could be printed as a "Create table" statement.

The power of the library​

How the library works​

Modules​

Generic type​

Repeated values​

Nested values​

Type-class derivation​

Implementing a new type​

Using conversions​

Extension methods​

More types to come​