ScalaFP: Understanding Monoids In Scala Pragmatically

Knoldus Blogs

As we discussed in our previous post, Monoids are semigroups meaning they have properties called closure and associative, along with identity element. So, now our question is, why do we require the identity element? Let’s add this one to our questions which were remaining from our previous post, :

  1. How can we use monoids with Scala?

  2. Where do we require Monoids?

Now let’s answer these questions one by one.

Q. Why do we require the identity element?

Ans: As we have an idea by now that the monoids perform binary operations on the similar type of elements and return the similar type of result. In our semigroups example, we were dealing with case class Money. Let’s suppose, we have a list of elements, i.e money: List[Money]. Now, the company wants to know the total expenses, which they spent in the last financial year. The…

View original post 483 more words


Installing and Running Presto

Hi Folks !
In my previous blog, I had talked about Getting Introduced with Presto.
In today’s blog, I shall be talking about setting up(installing) and running presto.

The basic pre-requisites for setting up Presto are:

  • Linux or Mac OS X
  • Java 8, 64-bit
  • Python 2.4+


  1. Download the Presto Tarball from here
  2. Unpack the Tarball
  3. After unpacking you will see a directory presto-server-0.175 which we will call the installation directory.


Inside the installation directory create a directory called etc. This directory will hold the following configurations :

  1. Node Properties: environmental configuration specific to each node
  2.  JVM Config: command line options for the Java Virtual Machine
  3. Config Properties: configuration for the Presto server
  4. Catalog Properties: configuration for Connectors (data sources)
  5. Log Properties : configuring the log levels

Now we will setup the above properties one by one.

Step 1 : Setting up Node Properties

Create a file called inside the etc folder. This file will contain the configuration specific to each node. Given below is description of of the properties we need to set in this file

  • node.environment: The name of the presto environment. All the nodes in the cluster must have identical environment name.
  • This is the unique identifier for every node.
  • The path of the data directory.

Note : Presto will stores the logs and other data at the location specified in the  It is recommended to create data directory external to the installation directory, this allows easy preservation during the upgrade.

Continue reading “Installing and Running Presto”

Getting Introduced with Presto

Hi Folks!

In today’s blog I will be introducing you to a new open source distributed Sql Query Engine – Presto. It is designed for running SQL queries over Big Data( petabytes of Data). It was designed by the people at Facebook.


Quoting it’s formal definition “Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.”

The motive behind the inception of Presto was to enable interactive analytics and approaches the speed of commercial data warehouses with the power to scala size of organisations matching Facebook.

Presto is a distributed query engine that runs on a cluster of machines. A full setup includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyses and plans the query execution, then distributes the processing to the workers.


Idea behind inception of Presto
Working with terabytes or petabytes of data, one is likely to use tools that interact with Hadoop and HDFS. Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra.

Capabilities of Presto

  • Allow querying over data where it is residing like Hive, Cassandra, relational databases or even proprietary data stores.
  • Allowing a single Presto query to combine data from multiple sources.
  • Faster response time breaking the myth that “having fast analytics using an expensive commercial solution or using a slow “free” solution that requires excessive hardware.”

Credit ability

Facebook uses Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day over several internal data stores, including their 300PB data warehouse.

Connectors in Presto
Presto supports pluggable connectors that provide data for queries. There are several pre- existent connectors, while presto provides ability to connect with custom connectors as well. It supports the following connectors :

  • Hadoop / Hive
    (Apache Hadoop 1.x, Apache Hadoop 2.x, Cloudera CDH 4,Cloudera CDH 5)
  • Cassandra
    (Cassandra 2.x is required. This connector is completely independent of the Hive connector and only requires an existing Cassandra installation.)
  • TPC-H
    (The connector dynamically generates data that can be used for experimenting with Presto)

Before we go further , while analyzing the tool for its features it becomes equally important to know what it is not capable of. This helps in determining its use cases and usability.

What Presto is Not
Presto is not a general-purpose relational database.
It is not a replacement for databases like MySQL, PostgreSQL or Oracle.
Presto is not designed to handle Online Transaction Processing (OLTP)

Competitors vs Presto

  • Presto continue lead in BI-type queries and Spark leads performance-wise in large analytics queries. Presto scales better than Hive and Spark for concurrent dashboard queries. Production enterprise BI user-bases may be on the order of 100s or 1,000s of users. As such, support for concurrent query workloads is critical. Benchmarks show that Presto performed the best – that is, showed the least query degradation – as concurrent query workload increased and showed the best results in user concurrency testing.
  • Another advantage of Presto over Spark and Impala is it Gets ready in minutes
  • Presto works directly on files in s3 , requiring no ETL transformations.

In my next blog , I will discuss how to get started with Presto.

Happy Reading:


The Dominant APIs of Spark: Datasets, DataFrames and RDDs

While working with Spark often we come across the three APIs: DataFrames, Datasets and RDDs.  In this blog I will discuss the three in terms of use case, performance and optimization.  It is essential to keep in mind that there is seamless transformation available between the three DataFrames, Datasets and RDDs. Implicitly the RDD forms the apex of both DataFrame and Datasets.

The inception of the three is somewhat described below:

RDD (Spark1.0) —> Dataframe(Spark1.3) —> Dataset(Spark1.6)

Let us begin with the Resilient Distributed Dataset (RDD).


The crux of a Spark lies in the RDD. It is an immutable distributed collection of elements partitioned across the nodes of the cluster that can be operated on in parallel with low level API allowing easy transformations and actions.

Use –cases

–        On unstructured data like streams

–        When data manipulation involves constructs of functional programming

–        The data access and processing is free of schema impositions

–        Require low level transformations and actions

Salient Features of RDDs

–        Versatile:

It can easily and efficiently process data which is structured as well as unstructured data.

It is available in several programming languages like Java, Scala, Python and R.

–        Distributed collection:

It is based on MapReduce operations which are widely popular for processing and generating large sets of data in parallel using distributed algorithm on a cluster. It allows us to write parallel computations, with help of high-level operators, without overhead of work distribution and fault tolerance.

–        Immutable:

RDDs are collection of records which are partitioned. A partition is a primitive unit of parallel programming in a RDD, and every partition forms a logical division of data which is immutable and generated with transformations on existing partitions.

–        Fault tolerant:

In case of a loss of RDD, one can redo the transformation on that same partition in order to achieve the same computation results, rather than doing data replication across multiple nodes.

–        Lazy evaluations:

All transformations are lazy, not compute their results right away. In place remembers transformations applied to the dataset. The transformations are performed as an when required and returned to the caller program.

Drawbacks for RDDs

No inbuilt optimization engine: On working with structured data, RDDs do not take advantage of Spark’s advanced optimizers (catalyst optimizer and Tungsten execution engine).

While working on the developers need to optimize each RDD based on its characteristics attributes.

Also unlike Dataframe and Datasets, RDDs don’t infer the schema of the data ingested and user is required to specify it explicitly.

Let us move a step ahead and discuss about DataFrames and Datasets.


DataFrames are immutable distributed collection of data where the data is organised in a relational manner that is named columns drawing parallel to tables in a relational database. The essence of datasets is to superimpose a structure on distributed collection of data in order to allow efficient and easier processing. It is conceptually very equivalent to a table in a relational database. Along with Dataframe, Spark also uses catalyst optimizer.

Salient Features:

–        It is conceptually equivalent to a table in a relational database, but has richer optimizations.

–        Can process structured and unstructured data formats (Avro, CSV, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, MySQL)

–        It empowers SQL queries and the DataFrame API.

Drawbacks of DataFrames

The Dataframe API does not support compile time safety which limits user from manipulating data when the structure of the data is not known.

Also after transformation of domain object into dataframe, user cannot regenerate it.


Dataset acquire two discrete APIs characteristics namely strongly typed and un-typed. DataFrame can be seen as a collection of generic type Dataset[Row], where Row can be a generic and un-typed JVM object. And unlike DataFrames, Datasets by default are collection of strongly typed JVM objects. Speaking in  java they are mapped by class and in Scala they are governed by case class.

The Datasets provide Static type and runtime type safety. Talking in layman language with Datasets and DataFrames allow us to catch errors at compile time. Another advantage is that DataFrames render a structured view for semi structured data as collection of Datasets[Row].

At the core of the Api is encoder responsible for conversion between jvm object and tabular representation. This representation is stored in the Tungsten Binary Format improving the memory utilisation.

Salient Features :

It has best of both the above Api like

– Functional Programming

– Type Safety

– Query Optimisation

– Encoding

Drawbacks of DataSets

It requires type casting into Strings. Querying currently requires specification of class as String. And later casting of column into data type.

Let us now discuss the type safety figuratively:


Syntax Error Runtime Compile Time Compile Time
Analysis Error Runtime Runtime Compile Time

Let us go a step ahead and discuss the Performance and Optimisation.

Performance Optimisation:

– The DataFrame and Dataset APIs use Catalyst to generate optimised logical and physical plans under Java Scala or Python.

– Also Dataset[T] typed API is optimised for engineering tasks and the DataFrame is faster and suitable for interactive analysis.

Space Optimisation:

The presence of Encoders in Dataset API efficiently serializes and deserializes JVM objects to generate compact bytecode. A smaller bytecode ensures faster execution speeds.

Having discussed all the important aspects related to the Spark APIs, the blog shall remain incomplete if I don’t discuss the use case of each of them against the other.

When to use DataFrame or Datasets:

  • Rich Semantics
  • High level abstractions
  • Domain specific APIs
  • Processing of high level operations: filters maps etc
  • Use columnar access and lambda functions on semi structured data

When to use Datasets:

  • High degree safety at runtime
  • Take advantage of typed JVM objects
  • Take advantage of Catalyst Optimizer
  • Save space
  • Faster execution

When to use DataFrames:

  • low level functionality
  • have tight control

Happy Reading !!



Web Designing using Pure.css

In today’s blog I will be introducing Pure.css (simply referred to as “Pure”), it’s use case and advantages over its counterparts. The blog will get you acquainted with the basics of the Pure. I shall discuss the basic idea behind its inception and what are its different components and finally how to integrate and implement it in web projects.

What is Pure.css?

Pure.css a set of responsive CSS modules build by YUI team.It is a small framework made up of CSS modules. All its modules put together weigh less than 3.8KB if served minified and gzip’d. One can save even more space(bytes) if we decide to use only one or two of the modules.

Pure’s minimal styling interference lets one ensure that one can customise the look and feel of web elements to meet the needs without much fuss.

Whose alternative is Pure.css?

Twitter Bootstrap, Zurb Foundation etc.

How is it Better?

There are several popular and widely accepted CSS frameworks available, but they are very heavy for they contain lots of CSS code.  Leaving that , one needs to customise the vanilla CSS code with personal styling making it even more stuffed.

The Pure fixes this  issue by making only requisite CSS modules required by almost every web projects. Pure allows writing of own style for customisation on top of it.

  • Pure is ridiculously tiny. The entire set of modules clocks in at 3.8KB minified and gzipped. That is pretty tiny compared to other CSS frameworks.
  • Unlike other popular CSS frameworks like Bootstrap, Pure does not come with any JavaScript Plugin out of the box. In fact, there only a few instance where it uses JavaScript, like drop down menus and fixed top menu.     

Continue reading “Web Designing using Pure.css”

Writing documentation from a user perspective has always been a challenging job for Developers.Markdowns allows the developers to write docs using an easy-to-read, easy-to-write plain text format.

But the challenge does not end here, providing users with offline documentation in book like format is also becoming essential and popular. Also a large number of organisations contributing to open source do not want to spend extra resources in creating separate documentation for gitHub and website etc.

In this blog I am going to discuss simple ways in which your markdown files can be used to cater to your website and other documentation needs.

Let us discuss how we can utilise our markdowns to design documentation for website. There are several maven plugins, we will be talking about two of these namely:

1. Maven Doxia

2. Markdown-page-generator Plugin

Maven Doxia

Going by the apache definition ‘Doxia is a content generation framework which aims to provide its users with powerful techniques for generating static and dynamic content: Doxia can be used in web-based publishing context to generate static sites, in addition to being incorporated into dynamic content generation systems like blogs, wikis and content management systems.’

Discussing Doxia in great length would take several blogs itself, so let us just discuss the functionality of using Doxia to convert markdowns to html.

Step 1: Create a Maven Project

Step 2: Add the following plugin in the pom.xml





















Step 3: Add the markdowns in following structure


Step 4: mvn site command can be used to generate the html(s) from the markdowns

Markdown-page-generator Plugin

This Plugin creates static HTML pages with Maven and Markdown. Underlying it is a pegdown Markdown processor. The plugin is simple to use and integrate. It allows you to configure the input and output directories, which files to copy and which pegdown options are used. We can also include a custom header and footer and general title.

Adding headers and footers into generated html(s) allows us to customize and style the html(s).

The default configuration of input and output directories can be easily overridden:

inputDirectory : ${project.basedir}/src/main/resources/markdown/

outputDirectory : ${}/html/

Let us discuss the steps to convert markdowns to html using this:

Step 1:Create a Maven project

Step 2: Add the plugin to pom.xml


















This is a simple configuration to generate html(s).

Alternatively we can add custom header and footer



















Additional information regarding the input and output source can be added.

Step 3: run mvn install to generate the html.

Let us now discuss, how we can generate offline or book form documentation from these markdown files.

We can use Plugins like Doxia and Maven–pdf plugin to generate books from these markdown files.

Following below is a quick guide to generate pdf document from your markdowns.

Step 1: Create a Maven Project

Step 2: Add all markdown files in a folder

Step 3: Add the following plugin to your pom.xml


















Step 4: Add a file called pdf.xml. This file acts like a book descriptor. It is optional. It acts like an indexer which can be used to sequence the content in the pdf-book. Absence of this file allows the entire folder of markdowns to be written into pdf-book in no particular order as specified in the site.xml.

Step 5: Command mvn pdf:pdf is used to generate the pdf.

Please note that by default, the PDF plugin generates a PDF document which aggregates all your site documents. If you wish to generate each site document individually, you need to add following parameter -Daggregate=false in the command line.

A sample of pdf.xml is given below :

<document xmlns=”;





<title>PDF Plugin Demo</title>



<toc name=”Table of Contents”>

<item name=”Introduction” ref=””/>

<item name=”Usage” ref=””/>

<item name=”FAQ” ref=””/>




<coverSubTitle>v. ${project.version}</coverSubTitle>

<coverType>User Guide</coverType>


<projectLogo><Path of Image></projectLogo>

<companyName>Knoldus Software LLP</companyName>

<companyLogo><Path of Image></companyLogo>



There are several other tools like markdown-pp and docbkx which can also be used for book generation and documentation.

Happy Reading !





Functors in Scala

While programming in Scala we  often come across a term called Functor. A functor is an extremely simple but powerful concept. In this blog, let us discuss that in more detail.

Theoretically functor is a type of mapping between categories. Given two categories A and B, a functor F maps the objects or entities of A to objects or entities of B. We can simply call it a function of objects or entities.

In programming the functors come into play when we have types or values wrapped inside context or containers. This wrapping up inside context, blocks application of normal functions on the values. This happens because the result on application of function is dependent on the context.

The solution to above scenario is a function that knows how to apply functions to the values wrapped in a context. Internally speaking this function should have a potential to

  • unwrap(fetch) the value from the context
  • apply the function onto the value
  • re-wrap the resultant into context

A Functor is defined by a type Constructor F[_] together with a function

                                  map :(A=>B) => F[A]=>F[B]

where the following holds true :

  1. map identity means the same thing as identity
  2. (map f) compose (map g) is equivalent of map ( f compose g)

Functor(s) are not in-built in Scala i.e as abstractions they do not exist in Scala but functor like behavior is build into Scala. The invocations of map on various types in scala is consistent with the definition of Functors. The example below illustrates the point :

Example :


But certain libraries like Scalaz provide us with concrete concept of Functors. Scalaz provides us with a type class Functor[F[_]] which defines a map as

                                 def map[A,B](fa:F[A])(f:A=>B):F[B] 

along with a trait Functor which formulates the above stated laws.

Let us look at an implementation with an example:

First of all we need to import the Scalaz

import scalaz.Functor

case class Container[A](first:A,second:A)

implicit val demoFunctor= new Functor[Container]{

   def map[A,B](fa:Container[A])(f:A=>B):Container[B]=Container(f(fa.first),f(fa.second))


There are many more example of Functors like Options, Streams ,List etc

implicit val OptionFunctor =new Functor[Option]{

    def fmap[A,B](f:A=>B):Option[A]=>Option[B]=option =>option map f


implicit val ListFunctor =new Functor[List]{

    def fmap[A,B](f:A=>B):List[A]=>List[B]=list =>list map f


In my next blog I shall be discussing more about Functors and Applicatives.

Happy Reading !