Monday, September 07, 2015

Elasticsearch Notes

Been recently playing with a lot of open source tool sets to figure out core solutions for different product ideas that I have. One of the recent technologies I have used is the Elasticsearch tool. Elasticsearch is basically a NoSql based indexing solution that allows one to use Lucene indexes on top of massive data sets especially string based documents. This blog post is just a bunch of notes that I have compiled. What is Elasticsearch?
Elasticsearch is a document store with each document stored as an index in a cluster with multiple shards. Sharding is basically a concept of partitioning data based on some metric within the data:
Now Elasticsearch exposes an http based request-response to query the individual documents stored in the index.
In my case I created a 2 node cluster as shown in the following image:

After this step I created an index called imdb_search . Initially wanted to create a Graphing tool to showcase the connections that I had in facebook and the relations between them...but then I decided nah....will go with a more open public api search for JSON feeds. Note: I used the SENSE api provided by Marvel for the GET/POST/TRACE http commands to store/parse/write the data. 

POST /imdb_test
 "post": {
type: "integer"
type: "date"

A note here is that the nodes can be reconfigured with a API call:
PUT /_cluster/settings { "persistent" : { "discovery.zen.minimum_master_nodes" : 2 } }
In this case I created an index with the following mappings --> movie_id, movie_name, movie_box_office_value, movie_date_of_release with the data types as shown in the above table.

Next I pulled the required JSON feeds for the documents from the imdb open api's. A couple of the queries I used are as follows:

Using a mix and match of some of the data and generating random box office and date results in my temp C# parser, I dynamically created a few entries for my Elasticsearch document repository. A couple of the POSTS are as follows:

POST imdb_test/post { "movie_name":"Star Wars: Episode VI - Return of the Jedi", "movie_id":1, "movie_box_office_value":7000000, "movie_date_of_release":"2005-07-01" }
POST imdb_test/post { "movie_name":"Terminator", "movie_id":2, "movie_box_office_value":10000000, "movie_date_of_release":"1994-06-04" }

After generating a couple of data files for my index.... I was able to query the list with specific filters/ basic queries using _search like GET /imdb_test/_search or more complex queries like
GET imdb_test/post/_search
  "query": {
    "match": {
      "user_name": "terminator"
  "aggs": {
    "all_words": {
      "terms": {"field":"movie_name"}
I just stepped short after checking out my routing using the GET commands:
GET /imdb_test/post/Terminator
The next step was to create a SPA to generate a dashboard with the resultant set. The advantages of using Elasticsearch is primary its querying abilities on massive volume sets and can be useful in terms of document repositories like in my case, Blogging and even Geo based analysis of data. It works with JSON documents and also has a really cool analytics dashboard to show case the metrics of the environment called Kibana. We can also alias our index like in my example we can categorize the movies into Horror/Action/Comedy etc. genres by aliasing which is a pretty handy feature.

On that note, I took the above ElasticSearch setttings and applied a Life Sciences need for the same. I stored my meta data results for the graphs inside Elastic Search and pulled in the required metric for the same. Now in my case I needed the graphs to be jazzy --> so I had 3 indexes sets with two seperate aliases each. And I changed the data feeds to the same. For the genetic algorithm, I basically leveraged the json feed from Karsten @ Also got a good idea for the dashboard from Colin @ which was more for Hospital management.

Here is a sneek peek at the dashboard:-

Note: The red, green and blue were to mimic a heat map with trends that are hot - Red, luke warm - green and shallow - yellow. Used simple SVG rect for that portion.
The visualizations were done using D3 and the gets were fetched using Angular.....

Monday, January 26, 2015

Data Implosion Continued....

Now this post is a part 2 to my previous post on Data Implosion and not Explosion. Off late I have been reading a lot of books on data especially related to Data Science and one of the books I am on the verge of completion is Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville. Now the reason I purchased this book was to get a deeper insight into what Data Scienists do and what is required in order to become one. The unfortunate part of this book is that getting a deeper insight into Data Science is absolutely incomprehensible. The reason behind this is that majority of the formulae (which I would love to understand and apply) are not defined clearly. And majority of these formulae's do not have a definition of the co-efficients used. Coming from a Computer Science background, I am familiar with the core aspects like the Big O notation and how to derive it from one's code, so reading through these parts was eating with a spoon since I do leverage these concepts on a regular basis (or whenever I actively code). However the statistical representation of data barring frequency tables ( a few other aspects like mean, average and standard deviation) are understandable. When it comes to certain formulae like weights and sample spaces are not clearly represented and are garbled. But I would definitly recommend the chapter on Excel which I think has been written fantabulously and is a must read for data geeks. When it comes to clustering and when to leverage certain principles like Sampling for example again was very shaky for me to read and understand. However I am going to talk about cluster sampling in this article:
1. Let us consider a population of 500 people.
2. Divide this sample space into clusters: For example->
    A] number of men (140)
    B] number of women (150)
    C] number of children (210)
3. Now each cluster is a sample in its own right and we will not combine them together but delineate this after step 2.
4. Now in a random population sample, we choose a subset of men, women and children from our respective clusters and apply the necessary parameters we need to gauge, a couple of examples here would be:
   A] What time does one sleep?
   B] What time does one have lunch?
   C] How many times in a month does one shop? etc..

5. Let us take only [A] for our current problem.
--> I choose 20 from each subset as a sample
Let us consider Men-
Space Definition                          Number of Men who sleep                  Average
1pm - 5pm                                                   4                                   4/20 =0.20
6pm - 10pm                                                12                                  12/20 =3/5 =0.60
post 10 pm                                                  4                                    4/20 =0.20
                                                                   20                                         1

From this we can apply the same to the overall Men's cluster and gather the required metrics.

Now clustering sampling can actually bombard  your analysis if your data points are redundant and is well not worth the time and effort. But as I mentioned earlier a Data Implosion Mechanic is someone who will not deduce these aspects of the data volume but rather force the metrics that need to be infringed on the data. I could be wrong here but this is my take on it... Let's move onto Monte Carlo simulations a bit....
Let us consider [B] and we allocate 2 time slots 12pm-1pm and 1pm-2pm and let us say that none of the people who took the survey were willing to respond to this specific question.
So let us randomize a 12pm - 1 pm(1) and 1pm - 2pm (0) for the 150 women:
In this case let us take an average of the 150 entries we have 44% of the women take their meal between 12 pm - 1 pm and the remaining 56% take their meal after 1pm. So if I were to set up a Sandwich shop, I might consider going after the 1pm slot to open the shop..... (Since this is more of random application of metrics, you could cross reverse the data results and then try your experiment).


Wednesday, October 22, 2014

Data Implosion and not Explosion

I was planning on writing this article a while ago but never came down to it.... but I do have some time on my hands to write about how data has changed over the years. Nowadays data has been imploding and what I want to infer from implosion is that the manner in which data is being collated and published to end users with various transformations along the way creates a sense of cause and effect. You have massive data-sets but the information from this can be co-related in so many different ways that eventually one is unable to figure out how to go about getting the required information onto the end user's field of vision. Let us consider Big Data whenever we feel that the volume of data has reached its significant end in terms of storage limitations we can go about introducing a solution (a big data solution) in order to contain the explosion and ensure that there is no data loss.... hence in this case we can always ensure that in the eventuality of a massive uptake in terms of data we can always control the size limitations by doing something concrete. But now comes the important aspect of how multiple streams of well contained data actually bump into each other to try and make information ambiguous to end users. Let us consider a database which contains the list of movies released in the past 5 years..... This is a massive amount of data but we still have containers in place to collect it. Now lets say we need to figure out which were successful (not only in terms of monetary aspects) and which were not. Initially this sounds simple and we think that performing this endeavor is going to be easy but as we look into the data, there are so many metrics which need to be considered here:

A] Box Office Revenue per region
B] Number of Theaters released in per region
C] Actors in the movie
D] Production Budget
E] Director of the Movie
F] Studio responsible for the movie
G] Genre of the movie
H] Region based dissemination of people who watch the particular genre
I] Video Rentals and Digital sales

And the list goes on.... even though all these data points are present it is still difficult to actually decipher the data and figure out which movie can be considered a success (box office, critically acclaimed, fan acclaimed etc...) or not  because the data being stored is actually imploding at such a massive scale that it becomes in-deterministic to even collate meaningful representation  of data from the container.
Example here:
The above figure showcases how the metrics are gathered and the manner in which they collide

Even Creating hierarchies becomes an incredible effort

Now the important aspect over here is to pre-determine what scenario one can gain from the massive volume of data and create a standard or norm which will be the end all for all metrics guidelines. This eventually will have to be a criteria decided among the masses and not an individual entity which comes back to my earlier statement that Data is imploding at a massive rate and not exploding. We need to introduce implosion mechanisms and maybe that will eventually be a future job by itself. Think about it--> Data Implosion Handler @ xyz co..... interesting yet is eventually bound to happen.

Tuesday, August 19, 2014

Blue Screen of Death --> Microsoft Power Map

I was fiddling around with Power Map and wanted to see how far I could go ahead with the mapping visualization.......
I basically leveraged the Power Station data file available at:

After downloading this excel file, I started to create Power Map with the basic idea of having 3 layers.
1. The first layer would take into account all the counties based on the Power Transmitted
2. The second layer would consist of all the companies based the Power distribution
3. The third layer considers the power distribution based on Plant Name....

Now this excel file contains roughly about 20,000 rows...... My machine configuration is basically a Windows 8 with an Intel i7 core and 16 GB of RAM. I was good up to step 1 in the steps described above but when I hit step 2 that's when I noticed that my memory just bloated like crazy...... I decided like a crazy person to try step 3 and that's when my machine crashed. Unfortunately I could not take screen shots as well as would not like to reproduce  this issue. But to just give a general idea even playing around with roughly 20,000 rows of data and using the above layering with probably close to 8 GB of free space on my box killed my machine with even the disk usage going to 100%. I decided to work on a smaller subset and show you what exactly I was trying to produce (This is for the state of California with a much lesser subset of counties):

 Step 1 is shown in the above diagram

Step 2 is as shown in the above diagram

The above figure is very close to what I wanted to do but I used a clustered column instead. I would definitely recommend going slow while creating a Power Map. It does provide mind blowing visualization but it is dependent on the amount of memory one can leverage at any given point in time and can blow up if the resultant data set is way too large. The above examples were created using about 500 rows of data. I was way more cautious as I did not want to blow my machine again. But at the end of the day, the map was beyond imagination in terms of visualization of data points. Post a comment on this post if this has happened to you as well!!

Monday, August 11, 2014

Redshift Experience

Big Data - the keyword given to solutions that can handle massive amount of data usually in the petabyte or greater amount. There are several big data solutions out there and all of them have their unique characteristics which can be useful in different scenarios. I was looking into Cloudera's versions of Hadoop like Impala, Sentry and HBase. All these vary based on the use case. For some of my clients I have leveraged Amazon Redshift, Cassandra (and hopefully soon Apache Hadoop). The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (

If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibilty is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.

It is similar to the analogy of using a bucket..... Consider this:

You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)

The concept  scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....

The costing of data nodes and compute nodes are as follows:

As the data node volume keeps increasing, reserve a data node and then utilize it as required. While creating the Redshift cluster ensure that the cluster type is Multi - Node and not Single-Node for large data related projects.
The tables are stored with a columnar partitioning algorithm on the different nodes governed by the distribution key of the respective table. The database used within Redshift is Postgres (the version keeps getting updated based on the release), this will be handled by Amazon. On a quick note whenever you are trying to add a node to the cluster, the database will go down so ensure that all your ETL's or import/export jobs are stopped during this time frame.
Even though daily snapshots are taken inside the Redshift cluster, a key note here is if one is ever going to delete the cluster and recreate it from a previous snapshot, ensure that a new cluster is created from the snapshot which is required and delete the first redshift cluster.
Even though Redshift has a lot of data integration tools partnerships, I would definitely recommend leveraging Talend ETL. I might give a breakdown of an Talend ETL in a upcoming article.
For reporting on the other hand there are several analytics tools like Pentaho, Microstrategy, Tableau, Birst etc..., I would recommend using Tableau/Pentaho/Microstrategy for analytical dashboarding needs based on the cost consideration.
Let us get back to Redshift, the storage mechanism is similar to the Tabular vertipaq engine model that Microsoft leverages. And the processing is done using MPP i.e. Massive parallel Processing engines. The encryption is two fold, the first fold is at the database level where the Data Encryption Key encrypt data blocks in the cluster. Each data block is assigned a randomly-generated AES-256 key. These keys are encrypted by using the database key for the cluster.  The other fold is at the data level or commonly known as encoding. I am just giving a flat table of what encoding needs to be used for what data type:-
int,date,datetime, timestamps delta
varchar or char upto 50 characters runlength
varchar  > 50 characters text255
smallint,decimal Mostly8
Others Raw
The cons of using Redshift are as follows:
  • You need an Administrator to monitor the  Redshift cluster at all times
  • The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
  • Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
Redshift allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:
Like any other traditional RDBMS, if there is a constant amount of queries running on the same table, the locking of the table is inevitable, so care must be taken to ensure that the tables are not locked using the following statement:
                -->select * from STL_TR_CONFLICT
 Also ensuring that the statistics are built on a regular the following command:
               -->Analyze Table_Name
for all the tables on which queries are performed on a periodic basis.
Also enable auditing on the cluster using the Amazon Redshift console.
A good practice is to ensure that the datawarehouse schema conforms to a Star schema to improve the query performance.
This is it for now but stay tuned to more updates in this regard!!!

Thursday, May 08, 2014

SSAS Cube issues ..... Incorrect Measure Values

Have you ever noticed that your measure value in SSAS do not correspond to the value in the data warehouse.... this is a big hindrance....because one is wasting precious development time in
extrapolating data between the cube and the warehouse. I am just going to create a checklist for issues to look at really quick if one does come across this issue!!!

Problem Statement:
Let us consider a Fact table called FactInternetSales with a fact called internetsalesamount
select sum(internetsalesamount) from FactInternetSales;

Let us say that this value is 25000.
Now if we run this against the cube .... lets call it Sales.
select measures].[internetsalesamount] on 0 from Sales;
Now the value returned from this query is 510......
Why is this happening? Now just follow the following checklist to ensure that you can rectify this issue as soon as one possibly can....
1. Go to the cube and in the [internetsalesamount] properties, change the aggregation value to count instead of sum. Validate whether the measure value matches the count in the datawarehouse i.e. select count(1) from FactInternetSales.... and ensure that the count matches.
2. Check the dimensions that are slicing the cube and ensure that all the dimension attribute values have a 1:1 or 1: many mapping with the FactInternetSales. Majority of the time, there could be a dimension value missing in the dimension which is present in the cube. If this is the case, try and figure out why the dimension value is missing and add it back in.
3. Sometimes the datatype used as part of the measure could also lead to such problems so just a quick cheat sheet to leverage:
int (in datawarehouse) --> corresponding measure should be bigint
decimal --> double
4. Constant changes to the fact table..... now if the fact table constantly changes, it is best practice to change the processing model to ROLAP instead of the MOLAP (I know you end up with more processing time but nothing beats real time data especially for the analysts)
5. Check the SQL query that is generated in populating the Fact in SSAS... this might be an inconclusive in majority of the scenarios but its still worth a shot
6. Final step in the checklist, if all else fails try try and succeed........

Friday, February 07, 2014

New CEO for Microsoft

It was recently announced that Satya Nadella (a Microsoft veteran) has become the CEO of Microsoft. Even though I have a great sense of pride that a person of Indian origins or a first generation American immigrant of Indian origins is now the biggest and probably the most powerful person in the I.T. industry today, Microsoft placed a safe bet in naming Nadella as its CEO. I feel that a better choice would have been Sundar Pichai (VP @ Google) or even Robin Li ( The founder and CEO of Baidu). Of course from an experience point of view Nadella would need absolutely no introduction. His resume speaks for itself. He created and successfully handled a multitude of Microsoft's unique silo's or divisions. But infusing new ideas decoupled from the earlier thought process that Microsoft has always had would have paved for a new era when it comes to Microsoft revolutionizing the tech industry. This is a factor that will bear a brunt in Microsoft's cogs because clearly Nadella has been part of this thought process that has been detrimental to Microsoft and probably will not deviate from this (Old habits die hard). Another big let down was the fact that my idol Bill Gates is back in a technical role. It just shows the level of confidence that the old guard at Microsoft has for the new CEO. Of course Microsoft has pushed it across as a highlight of Nadella becoming the CEO, since the legendary Bill Gates with all his experience of building an empire is back at the tech table. But to me this is more of a smoke screen than anything else. This made Nadella look a bit weak in retrospect with what should have considered to have been a tribute to the old era. Bill Gates and Steve Ballmer could have come out with their heads held high with a great sense of achievement of building a global entity from scratch and handing over the reigns to Nadella but this does not remain the case as Mr. Gates is back in a full time capacity and Ballmer would definitely be present for an interim period till Nadella settles down. This is the reason Microsoft should have gone with a fresh CEO, a person who can enable Microsoft's key strengths as well as infuse new ideas. I was rooting for Sundar since his name was dropped into the hat. Who knows this might have been the handshake that the two biggest giants in the industry Google and Microsoft might have required. Another person who I thought of when Microsoft was searching for a new CEO was Mark Zuckerburg. Sure he has a Facebook to handle, but just imagine if he was named the CEO of Microsoft and Facebook would no longer remain a unique identity but becomes a part of Microsoft. That truly would have been an earth shattering event. Two giants coming together as one integrated unit which would continue to create magic ( a dream that never became reality). These are my thoughts and I felt like expressing them but you never know maybe the Nadella-Gates combination might become a "Six" (as they call it in cricket).