## Sunday, June 18, 2017

### Domain models, Algebraic laws and Unit tests

In a domain model, when you have a domain element that forms an algebraic abstraction honoring certain laws, you can get rid of many of your explicitly written unit tests just by checking the laws. Of course you have to squint hard and discover the lawful abstraction that hides behind your concrete domain element.

Consider this simple abstraction for Money that keeps track of amounts in various currencies.

scala> import Money._
import Money._

// 1000 USD
scala> val m = Money(1000, USD)
m: laws.Money = (USD,1000)

scala> val n = add(m, Money(248, AUD))
n: laws.Money = (AUD,248),(USD,1000)

scala> val p = add(n, Money(230, USD))
p: laws.Money = (AUD,248),(USD,1230)

// value of the money in base currency (USD)
scala> p.toBaseCurrency
res1: BigDecimal = 1418.48

// debit amount
scala> val q = Money(-250, USD)
q: laws.Money = (USD,-250)

scala> val r = add(p, q)
r: laws.Money = (AUD,248),(USD,980)

The valuation of Money is done in terms of its base currency which is usually USD. One of the possible implementations of Money is the following (some parts elided for future explanations) ..

sealed trait Currency
case object USD extends Currency
case object AUD extends Currency
case object JPY extends Currency
case object INR extends Currency

class Money private[laws] (val items: Map[Currency, BigDecimal]) {
def toBaseCurrency: BigDecimal =
items.foldLeft(BigDecimal(0)) { case (a, (ccy, amount)) =>
a + Money.exchangeRateWithUSD.get(ccy).getOrElse(BigDecimal(1)) * amount
}

override def toString = items.toList.mkString(",")
}

object Money {
final val zeroMoney = new Money(Map.empty[Currency, BigDecimal])

def apply(amount: BigDecimal, ccy: Currency) = new Money(Map(ccy -> amount))
def add(m: Money, amount: BigDecimal, ccy: Currency) = ???

final val exchangeRateWithUSD: Map[Currency, BigDecimal] =
Map(AUD -> 0.76, JPY -> 0.009, INR -> 0.016, USD -> 1.0)
}

Needless to say we will have quite a number of unit tests that check for addition of Money, including the boundary cases of adding to zeroMoney.

It's not very hard to see that the type Money forms a Monoid under the add operation. Or to speak a bit loosely we can say that Money is a Monoid under the add operation.

A Monoid has laws that every instance needs to honor - associativity, left identity and right identity. And when your model element needs to honor the laws of algebra, it's always recommended to include the verification of the laws as part of your test suite. Besides validating the sanity of your abstractions, one side-effect of verifying laws is that you can get rid of many of your explicitly written unit tests for the operation that forms the Monoid. They will be automatically verified when verifying the laws of Monoid[Money].

Here's how we define Monoid[Money] using Cats ..

val MoneyAddMonoid: Monoid[Money] = new Monoid[Money] {
def combine(m: Money, n: Money): Money = add(m, n)
def empty: Money = zeroMoney
}

and the implementation of the previously elided add operation on Money using Monoid on Map ..

object Money {
//..

def add(m: Money, amount: BigDecimal, ccy: Currency) =
new Money(m.items |+| Map(ccy -> amount))

//..

}

Now we can verify the laws of Monoid[Money] using specs2 and ScalaCheck and the helper classes that Cats offers ..

import cats._
import kernel.laws.GroupLaws
import org.scalacheck.{ Arbitrary, Gen }
import Arbitrary.arbitrary

class MoneySpec extends CatsSpec { def is = s2"""

This is a specification for validating laws of Money

(Money) should
form a monoid under addition    $e1 """ implicit lazy val arbCurrency: Arbitrary[Currency] = Arbitrary { Gen.oneOf(AUD, USD, INR, JPY) } implicit def moneyArbitrary: Arbitrary[Money] = Arbitrary { for { i <- Arbitrary.arbitrary[Map[Currency, BigDecimal]] } yield new Money(i) } def e1 = checkAll("Money", GroupLaws[Money].monoid(Money.MoneyAddMonoid)) } and running the test suite will verify the Monoid laws for Monoid[Money] .. [info] This is a specification for validating laws of Money [info] [info] (Money) should [info] form a monoid under addition monoid laws must hold for Money [info] + monoid.associativity [info] + monoid.combineAll [info] + monoid.combineAll(Nil) == id [info] + monoid.combineAllOption [info] + monoid.combineN(a, 0) == id [info] + monoid.combineN(a, 1) == a [info] + monoid.combineN(a, 2) == a |+| a [info] + monoid.isEmpty [info] + monoid.leftIdentity [info] + monoid.rightIdentity [info] + monoid.serializable In summary .. • strive to find abstractions in your domain model that are constrained by algebraic laws • check all laws as part of your test suite • you will find that you can get rid of quite a few explicitly written unit tests just by checking the laws of your abstraction • and of course use property based testing for unit tests In case you want to take a look at the full code base, it's there on my Github repo. In the next post we will take the next step towards modeling with generic algebraic code using the Monoid pattern from this example. Code written in parametric form without depending on specialized concrete types can be more robust, easier to test and easier to reason about. I have also discussed this at length in my book Functional and Reactive Domain Modeling. I plan to supplement the materials covered there with more examples and code patterns .. ## Saturday, June 13, 2015 ### Baking a π can teach you a bit of Parametricity Even though I got my copy of Prof. Eugenia Cheng's awesome How to Bake π a couple of weeks back, I started reading it only over this weekend. I am only on page 19 enjoying all the stuff regarding cookies that Prof. Cheng is using to explain abstraction. This is a beautiful piece of explanation and if you are a programmer you may get an extra mile out of the concepts that she explains here. Let's see if we can unravel a few of them .. She starts with a real life situation such as: If Grandma gives you five cookies and Grandpa gives you five cookies, how many cookies will you have ? Let's model this as box of cookies that you get from your Grandma and Grandpa and you need to count them and find the total. Let's model this in Scala and we may have something like the following .. case class CookieBox(count: Int) and we can define a function that gives you a CookieBox containing the total number of cookies from the 2 boxes that we pass to the function .. def howManyCookies(gm: CookieBox, gp: CookieBox) = { CookieBox(gm.count + gp.count) } and we use howManyCookies to find the count .. scala> val gm = CookieBox(5) gm: CookieBox = CookieBox(5) scala> val gp = CookieBox(5) gp: CookieBox = CookieBox(5) scala> howManyCookies(gm, gp) res5: CookieBox = CookieBox(10) .. so we have 10 cookies from our Grandma & Grandpa .. Perfect! The problem is .. the child answers: "None, because I'll eat them all". To model this let's add a function eat to our CookieBox abstraction .. case class CookieBox(count: Int) { // let's assume n < count for simplicity def eat(n: Int): CookieBox = CookieBox(count - n) } So instead of the correct way to answer the question, the child cheats and implements howManyCookies as .. def howManyCookies(gm: CookieBox, gp: CookieBox) = { CookieBox(gm.eat(gm.count).count + gp.eat(gp.count).count) } and we get the following .. scala> howManyCookies(gm, gf) res6: CookieBox = CookieBox(0) Prof. Cheng continues .. The trouble here is that cookies do not obey the rules of logic, so using math to study them doesn't quite work. .. We could impose an extra rule on the situation by adding "... and you're not allowed to eat the cookies". If you're not allowed to eat them, what's the point of them being cookies ? This is profound indeed. When we are asked to count some stuff, it really doesn't matter if they are cookies or stones or pastries. The only property we need here is to be able to add together the 2 stuff that we are handed over. The fact that we have implemented howManyCookies in terms of CookieBox gives the little child the opportunity to cheat by using the eat function. More information is actually hurting us here, being concrete with data types is actually creating more avenues for incorrect implementation. Prof. Cheng is succinct here when she explains .. We could treat the cookies as just things rather than cookies. We lose some resemblance to reality, but we gain scope and with it efficiency. The point of numbers is that we can reason about "things" without having to change the reasoning depending on what "thing" we are thinking about. Yes, she is talking about generalization, being polymorphic over what we count. We just need the ability to add 2 "things", be it cookies, monkeys or anchovies. In programming we model this with parametric polymorphism, and use a universal quantification over the set of types for which we implement the behavior. def howMany[A](gm: A, gp: A) = //.. We have made the implementation parametric and got rid of the concrete data type CookieBox. But how do we add the capability to sum the 2 objects and get the result ? You got it right - we already have an abstraction that makes this algebra available to a generic data type. Monoids FTW .. and it doesn't get simpler than this .. trait Monoid[T] { def zero: T def append(t1: T, t2: T): T } zero is the identity function and append is a binary associative function over 2 objects of the type. So given a monoid instance for our data type, we can model howMany in a completely generic way irrespective of whether A is a CookieBox or Monkey. def howMany[A : Monoid](gm: A, gp: A): A = gm append gp Implementing a monoid for CookieBox is also simple .. object CookieBox { implicit val CookieBoxMonoid = new Monoid[CookieBox] { val zero = CookieBox(0) def append(i: CookieBox, j: CookieBox) = CookieBox(i.count + j.count) } }  With the above implementation of howMany, the little child will not be able to cheat. By providing a simpler data type we have made the implementation more robust and reusable across multiple data types. Next time someone wants me to explain parametricity, I will point them to Page 19 of How to Bake π. ## Thursday, March 26, 2015 ### Randomization and Probabilistic Techniques to scale up Machine Learning Some time back I blogged about the possibilities that probabilistic techniques and randomization bring on to the paradigm of stream computing. Architectures based on big data not only relate to high volume storage, but also on low latency velocities, and this is exactly where stream computing has a role to play. I discussed a few data structures like bloom filters, count min sketch and hyperloglog and algorithms like Locality Sensitive Hashing that use probabilistic techniques to reduce the search and storage space while processing huge volumes of data. Of late, I have been studying some of the theories behind machine learning algorithms and how they can be used in conjunction with the petabytes of data that we generate everyday. And the same thing strikes here - there are algorithms that can model the most perfect classifier. But you need randomization and probabilistic techniques to make them scale, even at the expense of a small amount of inaccuracy creeping within your model. In most cases we will see that the small inaccuracy that comes within your algorithm because of probabilistic bounds can be compensated by the ability to process more data within the specified computational timeline. This is true even for some of the basic algorithms like matrix multiplication that form the core of machine learning models. The contents of this post is nothing original or new. It's just to share some of my thoughts in learning the usage of approximation techniques in building machine learning classifiers. # Matrix Multiplication Not only these specialized data structures or algorithms, randomization has been found to be quite effective for processing large data sets even for standard algorithms like matrix multiplication, polynomial identity verification or min cut identification from large graphs. In all such cases the best available algorithms have computational complexity which works well for a small data set but doesn't scale well enough with the volumes of data. Consider a case where we are given 3 matrices,$A$,$B$and$C$and we need to verify if$AB = C$. The standard algorithm for matrix multiplication takes$\Theta(n^3)$operations and there's also a sophisticated algorithm that works in$\Theta(n^{2.37})$operations. Instead let's consider some randomization and choose a random vector$\bar{r} = (r_1, r_2, .. r_n) \in \{0, 1\}^n$. Now we can compute$AB\bar{r}$by first computing$B\bar{r}$and then$A(B\bar{r})$. And then we compute$C\bar{r}$. If we find$A(B\bar{r}) \neq C\bar{r}$, then$AB \neq C$. Otherwise we return$AB = C$. Instead of matrix-matrix multiplication our randomized algorithm uses matrix-vector multiplication, which can be done in$\Theta(n^2)$operations the standard way. Obviously a$\Theta(n^2)$algorithm has a lower computational complexity than$\Theta(n^3)$and scales better with larger data sets. Now the question is how accurate is this algorithm ? Is it guaranteed to give the correct answer every time we run it ? As with other probabilistic algorithms, there's a chance that our algorithm will return a wrong result. But as long as we can show that the chance is minimal and can be reduced by tuning some parameters, we should be fine. It can be shown that if$AB \neq C$and if$\bar{r}$is chosen uniformly at random from$\{0, 1\}^n$then$Pr(AB\bar{r} = C\bar{r}) <= 1/2$. But the trick is that we can run our randomized algorithm many times choosing$\bar{r}$with replacement from$\{0, 1\}^n$. If for any of these trials we get$AB\bar{r} \neq C\bar{r}$, then we can conclude$AB \neq C$. And the probability that we get$AB\bar{r} = C\bar{r}$for all$k$trials despite$AB \neq C$is$2^{-k}$. So for$100$trials, the chance of error is$2^{-100}$, which we can see is really small. The detailed proof of this analysis can be found in the excellent book Probability and Computing by Michael Mitzenmacher & Eli Upfal. Matrix multiplication is something that's used heavily especially in implementing machine learning classifiers. And if we can tolerate that little chance of error we get an algorithm with lower computational complexity that scales much better. # Stochastic Gradient Descent Consider another use case from core machine learning classifier design. Gradient descent is a standard way to minimize the empirical risk for measuring training set performance. The empirical risk is given by the following equation: $$E_n(f) = (1/n)\sum_i l(f_w(x_i),y_i)$$ where$l$is the loss function that measures the cost of predicting$f_w(x_i)$from$n$training examples where the actual answer is$y$and$f_w(x)$is the function parameterized by the weight vector$w$. Each iteration of gradient descent updates the weights$w$on the basis of the gradient of$E_n(f_w)$according to the following iterative step: $$w_{t+1} = w_t - \gamma (1/n) \sum_i \nabla_w Q(z_i, w_t)$$ where$\gamma$is an adequately chosen gain. Note that a single update step for the parameter runs through all the training examples and this gets repeated for every update step that you do before convergence. Compare this with Stochastic Gradient Descent (SGD) where the update step is given by the following: $$w_{t+1} = w_t - \gamma \nabla_w Q(z_t, w_t)$$ Note instead of running through all examples and compute the exact gradient, SGD computes the gradient based on one randomly picked example$z_t$. So, SGD does a noisy approximation to the true gradient. But since it does not have to process all the examples in every iteration it scales better with a large data set. In this paper on Large Scale Machine Learning With Stochastic Gradient Descent, Leon Bottou classifies the error in building the classifier into 3 components: • Approximation Error, which comes from the fact that the function$f$that we choose is different from the optimal function$f^*$and we approximate using a few examples • Estimation Error, which comes from the fact that we have a finite number of training examples and would have gone away with infinite number of them • Optimization Error, which comes from the fact that we are using an inferior algorithm to estimate the gradient • With normal gradient descent we will have low optimization error since we run through all the training examples in every iteration to compute the gradient, which is clearly superior to the algorithm of SGD that does a noisy approximation. But SGD will report a lower approximation and estimation error since we will be able to process a larger dataset within the stipulated computation time. So it's a tradeoff of that we make using SGD, but clearly we scale better with larger data sets. # Singular Value Decomposition Singular Value Decomposition is a dimensionality reduction technique to unearth a smaller number of intrinsic concepts from a high dimensional matrix by removing unnecessary information. It does so by projecting the original matrix on to lower dimensions such that the reconstruction error is minimized. What this means is that given a matrix$A$we decompose it into lower dimensional matrices by removing the lesser important information. And we do this in such a way that we can reconstruct a fairly close approximation to$A$from those lower dimensional matrices. In theory SVD gives the best possible projection in terms of reconstruction error (optimal low rank approximation). But in practice it suffers from scalability problems with large data sets. It generates dense singular vectors even if the original matrix is a sparse one and hence is computationally inefficient, taking cubic time in the size of the data. This can be addressed by another algorithm, the CUR algorithm which allows larger reconstruction error but lesser computation time. CUR decomposes the original matrix into ones of lesser dimensions but uses a randomized algorithm in selection of columns and rows based on their probability distribution. Now it can be shown that CUR reconstruction is just an additive term away from SVD reconstruction and it's a probabilistic bound subject to the condition that we select a specific range of columns and rows from$A\$. The computational bound of CUR is of the order of the data set, which is much less than that of SVD (which as I mentioned earlier is cubic). This is yet another example where we apply randomization and probabilistic techniques to scale our algorithm better for larger data sets in exchange for a little amount of inaccuracy.

These are only a few instances of probabilistic bounds being applied to solve real world machine learning problems. There are a lots more. In fact I find that scalability of machine learning has a vey direct correlation with application of probabilistic techniques to the model. As I mentioned earlier the point of this post is to share some of my thoughts as I continue to learn techniques to scale up machine learning models. Feel free to share your ideas, thoughts and discussions in comments.

## Wednesday, February 11, 2015

### Functional Patterns in Domain Modeling - Composing a domain workflow with statically checked invariants

I have been doing quite a bit of domain modeling using functional programming mostly in Scala. And as it happens when you work on something for a long period of time you tend to identify more and more patterns that come up repeatedly within your implementations. You may ignore these as patterns the first time, get a feeling of mere coincidence the next time, but third time really gives you that aha! moment and you feel like documenting it as a design pattern. In course of my learnings I have started blogging on some of these patterns - you can find the earlier ones in the series in:

• Functional Patterns in Domain Modeling - The Specification Pattern

• Functional Patterns in Domain Modeling - Immutable Aggregates and Functional Updates

• Functional Patterns in Domain Modeling - Anemic Models and Compositional Domain Behaviors

• In this continuing series of functional patterns in domain modeling, I will go through yet another idiom which has been a quite common occurrence in my explorations across various domain models. You will find many of these patterns explained in details in my upcoming book on Functional and Reactive Domain Modeling, the early access edition of which is already published by Manning.

One of the things that I strive to achieve in implementing domain models is to use the type system to encode as much domain logic as possible. If you can use the type system effectively then you get the benefits of parametricity, which not only makes your code generic, concise and polymorphic, but also makes it self-testing. But that's another story which we can discuss in another post. In this post I will talk about a pattern that helps you design domain workflows compositionally, and also enables implementing domain invariants within the workflow, all done statically with little help from the type system.

As an example let's consider a loan processing system (simplified for illustration purposes) typically followed by banks issuing loans to customers. A typical simplified workflow looks like the following :-

# The Domain Model

The details of each process is not important - we will focus on how we compose the sequence and ensure that the API verifies statically that the correct sequence is followed. Let's start with a domain model for the loan application - we will keep on enriching it as we traverse the workflow.

case class LoanApplication private[Loans](
// date of application
date: Date,
// name of applicant
name: String,
// purpose of loan
purpose: String,
// intended period of repayment in years
repayIn: Int,
// actually sanctioned repayment period in years
actualRepaymentYears: Option[Int] = None,
// actual start date of loan repayment
startDate: Option[Date] = None,
// loan application number
loanNo: Option[String] = None,
// emi
emi: Option[BigDecimal] = None
)

Note we have a bunch of attributes that are defined as optional and will be filled out later as the loan application traverses through the sequence of workflow. Also we have declared the class private and we will have a smart constructor to create an instance of the class.

# Wiring the workflow with Kleisli

Here are the various domain behaviors modeling the stages of the workflow .. I will be using the scalaz library for the Kleisli implementation.

def applyLoan(name: String, purpose: String, repayIn: Int,
date: Date = today) =
LoanApplication(date, name, purpose, repayIn)

def approve = Kleisli[Option, LoanApplication, LoanApplication] { l =>
// .. some logic to approve
l.copy(
loanNo = scala.util.Random.nextString(10).some,
actualRepaymentYears = 15.some,
startDate = today.some
).some
}

def enrich = Kleisli[Option, LoanApplication, LoanApplication] { l =>
//.. may be some logic here
val x = for {
y <- l.actualRepaymentYears
s <- l.startDate
} yield (y, s)

l.copy(emi = x.map { case (y, s) => calculateEMI(y, s) }).some
}

applyLoan is the smart constructor that creates the initial instance of LoanApplication. The other 2 functions approve and enrich perform the approval and enrichment steps of the workflow. Note both of them return an enriched version of the LoanApplication within a Kleisli, so that we can use the power of Kleisli composition and wire them together to model the workflow ..

val l = applyLoan("john", "house building", 10)
val op = approve andThen enrich
op run l

When you have a sequence to model that takes an initial object and then applies a chain of functions, you can use plain function composition like h(g(f(x))) or using the point free notation, (h compose g compose f) or using the more readable order (f andThen g andThen h). But in the above case we need to have effects along with the composition - we are returning Option from each stage of the workflow. So here instead of plain composition we need effectful composition of functions and that's exactly what Kleisli offers. The andThen combinator in the above code snippet is actually a Kleisli composition aka function composition with effects.

So we have everything the workflow needs and clients use our API to construct workflows for processing loan applications. But one of the qualities of good API design is to design it in such a way that it becomes difficult for the client to use it in the wrong way. Consider what happens with the above design of the workflow if we invoke the sequence as enrich andThen approve. This violates the domain invariant that states that enrichment is a process that happens after the approval. Approval of the application generates some information which the enrichment process needs to use. But because our types align, the compiler will be perfectly happy to accept this semantically invalid composition to pass through. And we will have the error reported during run time in this case.

Remembering that we have a static type system at our disposal, can we do better ?

# Phantom Types in the Mix

Let's throw in some more types and see if we can tag in some more information for the compiler to help us. Let's tag each state of the workflow with a separate type ..

trait Applied
trait Approved
trait Enriched

Finally make the main model LoanApplication parameterized on a type that indicates which state it is in. And we have some helpful type aliases ..

case class LoanApplication[Status] private[Loans]( //..

type LoanApplied  = LoanApplication[Applied]
type LoanApproved = LoanApplication[Approved]
type LoanEnriched = LoanApplication[Enriched]

These types will have no role in modeling domain behaviors - they will just be used to dispatch to the correct state of the sequence that the domain invariants mandate. The workflow functions need to be modified slightly to take care of this ..

def applyLoan(name: String, purpose: String, repayIn: Int,
date: Date = today) =
LoanApplication[Applied](date, name, purpose, repayIn)

def approve = Kleisli[Option, LoanApplied, LoanApproved] { l =>
l.copy(
loanNo = scala.util.Random.nextString(10).some,
actualRepaymentYears = 15.some,
startDate = today.some
).some.map(identity[LoanApproved])
}

def enrich = Kleisli[Option, LoanApproved, LoanEnriched] { l =>
val x = for {
y <- l.actualRepaymentYears
s <- l.startDate
} yield (y, s)

l.copy(emi = x.map { case (y, s) => calculateEMI(y, s) }).some.map(identity[LoanEnriched])
}

Note how we use the phantom types within the Kleisli and ensure statically that the sequence can flow only in one direction - that which is mandated by the domain invariant. So now an invocation of enrich andThen approve will result in a compilation error because the types don't match. So once again yay! for having the correct encoding of domain logic with proper types.

## Thursday, January 01, 2015

### Probabilistic techniques, data streams and online learning - Looking forward to a bigger 2015

I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.

Many such data structures are already being used quite heavily for specialized processing of data streams ..

These data structures are becoming more and more useful as we prepare to embrace and process larger data sets with fairly strict online requirements. And it has started making a difference. Take for example Impala, the open source analytic database from Cloudera that works on top of Hadoop. Impala's NDV aggregate function (number of distinct values) uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space. This blog post has the details of the performance improvement that it offers in comparison to the standard distinct count. The immensely popular NoSQL store Redis also offers a HyperLogLog implementation that you can use to get an approximation on the cardinality of a set using randomization. Salvatore has the details here on the implementation of HyperLogLog algorithm in Redis.

The most important reason these algorithms and data structures are becoming popular is the increased focus on our "online" requirements. We are not only processing bigger and bigger data set, we need results faster too. We just cannot afford to push all analytics to the batch mode and expect results coming out after an overnight batch processing. Various architectural paradigms like the lambda architecture also target to address this niche area. But before investing on such complex architectures, often some neat data structures that use probabilistic techniques and randomization may offer a much lighter weight solution that you are looking for.

Consider processing the Twitter stream and generating analytics (of whatever form) online. This means that immediately after seeing one twitter feed you must be able to predict something and update your model at the same time. Which means you need to memorize the data that you see in the feed, apply it to update your model and yet cannot store the entire hose that you have seen so far. This is online learning and is the essence of techniques like stochastic gradient descent that help you do this - the model is capable of making up to date predictions after every data that you see. John Myles White has an excellent presentation on this topic.

Consider this other problem of detecting similarities between documents. When you are doing this on a Web scale you will have to deal with millions of documents to find the similar sets. There are techniques like minhash which enable you to compress documents into signature matrices. But even then the scale becomes too big to be processed and reported to the user in a meaningful amount of time. As an example (from Mining Massive Datasets), if you process 1 million document using signatures of length 250, you still have to use 1000 bytes per document - the total comes to 1 gigabyte which very well fits into the memory of a standard laptop. But when you check for similar pairs, you need to process (1,000,000 choose 2) or half a trillion pairs of documents which will take almost 6 days to compute all similarities on a laptop. Enter probabilistic techniques and locality sensitive hashing (LSH) algorithm fits this problem like a charm. Detecting similarity is a problem that arises in recommender systems with collaborative filtering and LSH can be used there as well. The basic idea of LSH as applied to similarity detection is to use hashing multiple number of times and identify candidate pairs that qualify for similarity checking. The idea is to reduce the search space using probabilistic techniques so that we can eliminate a class of candidates which have very low chance of being similar.

Here I have only scratched the surface of the areas where we apply randomization and probabilistic techniques to solve problems that are very real today. There are plentiful other areas in data mining, graph clustering, machine learning and big data processing where similar techniques are employed to reduce the curse of dimensionality and provide practical solution at scale. 2014 has already seen a big surge in terms of popularizing these techniques. I expect 2015 to be bigger and more mainstream in terms of their usage.

Personally I have been exploring data stream algorithms a lot and have prepared a collection of some useful references. Feel free to share in case you find it useful. I hope to do something more meaningful with stream processing data structures and online learning in 2015. Have a very happy and joyous new year ..

## Monday, November 03, 2014

### Functional and Reactive Domain Modeling

Manning has launched the MEAP of my upcoming book on Domain Modeling.

The first time I was formally introduced to the topic was way back when I played around with Erik Evans' awesome text on the subject of Domain Driven Design. In the book he discusses various object lifecycle patterns like the Factory, Aggregate or Repository that help separation of concerns when you are implementing the various interactions between the elements of the domain model. Entities are artifacts with identities, value objects are pure values while services model the coarse level use cases of the model components.

Traditionally we followed the recommendations of Erik in our real world implementations and used the object oriented paradigm for modeling all interactions. We started talking about rich domain models and anemic domain models. The rich model espoused a richer agglomeration of state and behavior within the model, while the anemic model preferred to keep them decoupled. Martin Fowler sums this up in his post on Anemic Domain Models ..
"The basic symptom of an Anemic Domain Model is that at first blush it looks like the real thing. There are objects, many named after the nouns in the domain space, and these objects are connected with the rich relationships and structure that true domain models have. The catch comes when you look at the behavior, and you realize that there is hardly any behavior on these objects, making them little more than bags of getters and setters. Indeed often these models come with design rules that say that you are not to put any domain logic in the the domain objects. Instead there are a set of service objects which capture all the domain logic. These services live on top of the domain model and use the domain model for data."

## Go Functional

In Functional and Reactive Domain Modeling I look at the problem with a different lens. The primary focus of the book is to encourage building domain models using the principles of functional programming. It's a completely orthogonal approach than OO and focuses on verbs first (as opposed to nouns first in OO), algebra first (as opposed to objects in OO), function composition first (as opposed to object composition in OO), lightweight objects as ADTs (instead of rich class models).

## Go Reactive

The term "reactive" has recently become quite popular in describing systems that are responsive, scalable and adaptive. In implementing complex domain models, we find many areas which can be made more responsive by implementing them as non blocking operations instead of the standard blocking function calls. Using higher level concurrency primitives like actors, futures and data flow based computing we can compose asynchronous operations and increase the net throughput of your model. The second part of the book discusses how you can combine the principles of functional programming with the reactive way of implementing behaviors. The paradigm of application design using the principles of functional programming together with an asynchronous non-blocking mode of communication between the participating entities promises to be a potent combination towards developing performant systems that are relatively easy to manage, maintain and evolve. But designing and implementing such models need a different way of thinking. The behaviors that you implement have to be composable using pure functions and they can form building blocks of bigger abstractions that communicate between them using non-blocking asynchronous message passing.

## Functional meets Reactive

Functional and Reactive Domain Modeling takes you through the steps teaching you how to think of the domain model in terms of pure functions and how to compose them to build larger abstractions. You will start learning with the basics of functional programming principles and gradually progress to the advanced concepts and patterns that you need to know to implement complex domain models. The book demonstrates how advanced FP patterns like algebraic data types, typeclass based design and isolation of side-effects can make your model compose for readability and verifiability. On the subject of reactive modeling, the book focuses on the higher order concurrency patterns like actors and futures. It uses the Akka framework as the reference implementation and demonstrates how advanced architectural patterns like event sourcing and command-query-responsibility-segregation can be put to great use while implementing scalable models. You will learn techniques that are radically different from the standard RDBMS based applications that are based on mutation of records. You’ll also pick up important patterns like using asynchronous messaging for interaction based on non blocking concurrency and model persistence, which delivers the speed of in-memory processing along with suitable guarantees of reliability.

Looking forward to an exciting journey with the book. I am sure you will also find interest in the topics that I discuss there. And feel free to jump on to AuthorOnline and fire questions that we all can discuss. I am sure this will also lead to an overall improvement in the quality of the book.

## Monday, May 12, 2014

### Functional Patterns in Domain Modeling - Anemic Models and Compositional Domain Behaviors

I was looking at the presentation that Dean Wampler made recently regarding domain driven design, anemic domain models and how using functional programming principles help ameliorate some of the problems there. There are some statements that he made which, I am sure made many OO practitioners chuckle. They contradict popular beliefs that encourage OOP as the primary way of modeling using DDD principles.

One statement that resonates a lot with my thought is "DDD encourages understanding of the domain, but don't implement the models". DDD does a great job in encouraging developers to understand the underlying domain model and ensuring a uniform vocabulary throughout the lifecycle of design and implementation. This is what design patterns also do by giving you a vocabulary that you can heartily exchange with your fellow developers without influencing any bit of implementation of the underlying pattern.

On the flip side of it, trying to implement DDD concepts using standard techniques of OO with joined state and behavior often gives you a muddled mutable model. The model may be rich from the point of view that you will find all concepts related to the particular domain abstraction baked in the class you are modeling. But it makes the class fragile as well since the abstraction becomes more locally focused losing the global perspective of reusability and composability. As a result when you try to compose multiple abstractions within the domain service layer, it becomes too much polluted with glue code that resolves the impedance mismatch between class boundaries.

So when Dean claims "Models should be anemic", I think he means to avoid this bundling of state and behavior within the domain object that gives you the false sense of security of richness of the model. He encourages the practice that builds domain objects to have the state only while you model behaviors using standalone functions.

One other strawman argument that I come across very frequently is that bundling state and behavior by modeling the latter as methods of the class increases encapsulation. If you are still a believer of this school of thought, have a look at Scott Meyer's excellent article which he wrote as early as 2000. He eschews the view that a class is the right level of modularization and encourages more powerful module systems as better containers of your domain behaviors.

As continuation of my series on functional domain modeling, we continue with the example of the earlier posts and explore the theme that Dean discusses ..

Here's the anemic domain model of the Order abstraction ..

case class Order(orderNo: String, orderDate: Date, customer: Customer,
lineItems: Vector[LineItem], shipTo: ShipTo,
netOrderValue: Option[BigDecimal] = None, status: OrderStatus = Placed)

In the earlier posts we discussed how to implement the Specification and Aggregate Patterns of DDD using functional programming principles. We also discussed how to do functional updates of aggregates using data structures like Lens. In this post we will use these as the building blocks, use more functional patterns and build larger behaviors that model the ubiquitous language of the domain. After all, one of the basic principles behind DDD is to lift the domain model vocabulary into your implementation so that the functionality becomes apparent to the developer maintaining your model.

The core idea is to validate the assumption that building domain behaviors as standalone functions leads to an effective realization of the domain model according to the principles of DDD. The base classes of the model contain only the states that can be mutated functionally. All domain behaviors are modeled through functions that reside within the module that represents the aggregate.

Functions compose and that's precisely how we will chain sequence of domain behaviors to build bigger abstractions out of smaller ones. Here's a small function that values an Order. Note it returns a Kleisli, which essentially gives us a composition over monadic functions. So instead of composing a -> b and b -> c, which we do with normal function composition, we can do the same over a -> m b and b -> m c, where m is a monad. Composition with effects if you may say so.

def valueOrder = Kleisli[ProcessingStatus, Order, Order] {order =>
val o = orderLineItems.set(
order,
setLineItemValues(order.lineItems)
)
o.lineItems.map(_.value).sequenceU match {
case Some(_) => right(o)
case _ => left("Missing value for items")
}
}

But what does that buy us ? What exactly do we gain from these functional patterns ? It's the power to abstract over families of similar abstractions like applicatives and monads. Well, that may sound a bit rhetoric and it needs a separate post to justify their use. Stated simply, they encapsulate effects and side-effects of your computation so that you can focus on the domain behavior itself. Have a look at the process function below - it's actually a composition of monadic functions in action. But all the machinery that does the processing of effects and side-effects are abstracted within the Kleisli itself so that the user level implementation is simple and concise.

With Kleisli it's the power to compose over monadic functions. Every domain behavior has a chance of failure, which we model using the Either monad - here ProcessingStatus is just a type alias for this .. type ProcessingStatus[S] = \/[String, S]. Using the Kleisli, we don't have to write any code for handling failures. As you will see below, the composition is just like the normal functions - the design pattern takes care of alternate flows.

Once the Order is valued, we need to apply discounts to qualifying items. It's another behavior that follows the same pattern of implementation as valueOrder.

def applyDiscounts = Kleisli[ProcessingStatus, Order, Order] {order =>
val o = orderLineItems.set(
order,
setLineItemValues(order.lineItems)
)
o.lineItems.map(_.discount).sequenceU match {
case Some(_) => right(o)
case _ => left("Missing discount for items")
}
}

Finally we check out the Order ..

def checkOut = Kleisli[ProcessingStatus, Order, Order] {order =>
val netOrderValue = order.lineItems.foldLeft(BigDecimal(0).some) {(s, i) =>
s |+| (i.value |+| i.discount.map(d => Tags.Multiplication(BigDecimal(-1)) |+| Tags.Multiplication(d)))
}
right(orderNetValue.set(order, netOrderValue))
}

And here's the service method that composes all of the above domain behaviors into the big abstraction. We don't have any object to instantiate. Just plain function composition that results in an expression modeling the entire flow of events. And it's the cleanliness of abstraction that makes the code readable and succinct.

def process(order: Order) = {
(valueOrder andThen
applyDiscounts andThen checkOut) =<< right(orderStatus.set(order, Validated))
}
In case you are interested in the full source code of this small example, feel free to take a peek at my github repo.