Wednesday, October 22, 2014

Data Implosion and not Explosion

I was planning on writing this article a while ago but never came down to it.... but I do have some time on my hands to write about how data has changed over the years. Nowadays data has been imploding and what I want to infer from implosion is that the manner in which data is being collated and published to end users with various transformations along the way creates a sense of cause and effect. You have massive data-sets but the information from this can be co-related in so many different ways that eventually one is unable to figure out how to go about getting the required information onto the end user's field of vision. Let us consider Big Data solutions....now whenever we feel that the volume of data has reached its significant end in terms of storage limitations we can go about introducing a solution (a big data solution) in order to contain the explosion and ensure that there is no data loss.... hence in this case we can always ensure that in the eventuality of a massive uptake in terms of data we can always control the size limitations by doing something concrete. But now comes the important aspect of how multiple streams of well contained data actually bump into each other to try and make information ambiguous to end users. Let us consider a database which contains the list of movies released in the past 5 years..... This is a massive amount of data but we still have containers in place to collect it. Now lets say we need to figure out which were successful (not only in terms of monetary aspects) and which were not. Initially this sounds simple and we think that performing this endeavor is going to be easy but as we look into the data, there are so many metrics which need to be considered here:

A] Box Office Revenue per region
B] Number of Theaters released in per region
C] Actors in the movie
D] Production Budget
E] Director of the Movie
F] Studio responsible for the movie
G] Genre of the movie
H] Region based dissemination of people who watch the particular genre
I] Video Rentals and Digital sales

And the list goes on.... even though all these data points are present it is still difficult to actually decipher the data and figure out which movie can be considered a success (box office, critically acclaimed, fan acclaimed etc...) or not  because the data being stored is actually imploding at such a massive scale that it becomes in-deterministic to even collate meaningful representation  of data from the container.
Example here:
The above figure showcases how the metrics are gathered and the manner in which they collide

Even Creating hierarchies becomes an incredible effort


Now the important aspect over here is to pre-determine what scenario one can gain from the massive volume of data and create a standard or norm which will be the end all for all metrics guidelines. This eventually will have to be a criteria decided among the masses and not an individual entity which comes back to my earlier statement that Data is imploding at a massive rate and not exploding. We need to introduce implosion mechanisms and maybe that will eventually be a future job by itself. Think about it--> Data Implosion Handler @ xyz co..... interesting yet is eventually bound to happen.

Tuesday, August 19, 2014

Blue Screen of Death --> Microsoft Power Map

I was fiddling around with Power Map and wanted to see how far I could go ahead with the mapping visualization.......
I basically leveraged the Power Station data file available at:
http://office.microsoft.com/en-us/excel-help/redir/XT104048055.aspx?CTT=5&origin=HA104091224

After downloading this excel file, I started to create Power Map with the basic idea of having 3 layers.
1. The first layer would take into account all the counties based on the Power Transmitted
2. The second layer would consist of all the companies based the Power distribution
3. The third layer considers the power distribution based on Plant Name....

Now this excel file contains roughly about 20,000 rows...... My machine configuration is basically a Windows 8 with an Intel i7 core and 16 GB of RAM. I was good up to step 1 in the steps described above but when I hit step 2 that's when I noticed that my memory just bloated like crazy...... I decided like a crazy person to try step 3 and that's when my machine crashed. Unfortunately I could not take screen shots as well as would not like to reproduce  this issue. But to just give a general idea even playing around with roughly 20,000 rows of data and using the above layering with probably close to 8 GB of free space on my box killed my machine with even the disk usage going to 100%. I decided to work on a smaller subset and show you what exactly I was trying to produce (This is for the state of California with a much lesser subset of counties):


 Step 1 is shown in the above diagram


Step 2 is as shown in the above diagram


The above figure is very close to what I wanted to do but I used a clustered column instead. I would definitely recommend going slow while creating a Power Map. It does provide mind blowing visualization but it is dependent on the amount of memory one can leverage at any given point in time and can blow up if the resultant data set is way too large. The above examples were created using about 500 rows of data. I was way more cautious as I did not want to blow my machine again. But at the end of the day, the map was beyond imagination in terms of visualization of data points. Post a comment on this post if this has happened to you as well!!

Monday, August 11, 2014

Redshift Experience

Big Data - the keyword given to solutions that can handle massive amount of data usually in the petabyte or greater amount. There are several big data solutions out there and all of them have their unique characteristics which can be useful in different scenarios. I was looking into Cloudera's versions of Hadoop like Impala, Sentry and HBase. All these vary based on the use case. For some of my clients I have leveraged Amazon Redshift, Cassandra (and hopefully soon Apache Hadoop). The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)

If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibilty is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.

It is similar to the analogy of using a bucket..... Consider this:

You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)

The concept  scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....

The costing of data nodes and compute nodes are as follows:


 
 
As the data node volume keeps increasing, reserve a data node and then utilize it as required. While creating the Redshift cluster ensure that the cluster type is Multi - Node and not Single-Node for large data related projects.
 
The tables are stored with a columnar partitioning algorithm on the different nodes governed by the distribution key of the respective table. The database used within Redshift is Postgres (the version keeps getting updated based on the release), this will be handled by Amazon. On a quick note whenever you are trying to add a node to the cluster, the database will go down so ensure that all your ETL's or import/export jobs are stopped during this time frame.
 
Even though daily snapshots are taken inside the Redshift cluster, a key note here is if one is ever going to delete the cluster and recreate it from a previous snapshot, ensure that a new cluster is created from the snapshot which is required and delete the first redshift cluster.
 
Even though Redshift has a lot of data integration tools partnerships, I would definitely recommend leveraging Talend ETL. I might give a breakdown of an Talend ETL in a upcoming article.
For reporting on the other hand there are several analytics tools like Pentaho, Microstrategy, Tableau, Birst etc..., I would recommend using Tableau/Pentaho/Microstrategy for analytical dashboarding needs based on the cost consideration.
 
Let us get back to Redshift, the storage mechanism is similar to the Tabular vertipaq engine model that Microsoft leverages. And the processing is done using MPP i.e. Massive parallel Processing engines. The encryption is two fold, the first fold is at the database level where the Data Encryption Key encrypt data blocks in the cluster. Each data block is assigned a randomly-generated AES-256 key. These keys are encrypted by using the database key for the cluster.  The other fold is at the data level or commonly known as encoding. I am just giving a flat table of what encoding needs to be used for what data type:-
int,date,datetime, timestamps delta
varchar or char upto 50 characters runlength
varchar  > 50 characters text255
smallint,decimal Mostly8
Others Raw
 
The cons of using Redshift are as follows:
  • You need an Administrator to monitor the  Redshift cluster at all times
  • The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
  • Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
 
Redshift allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:
 
Like any other traditional RDBMS, if there is a constant amount of queries running on the same table, the locking of the table is inevitable, so care must be taken to ensure that the tables are not locked using the following statement:
 
                -->select * from STL_TR_CONFLICT
 
 Also ensuring that the statistics are built on a regular basis....run the following command:
               -->Analyze Table_Name
for all the tables on which queries are performed on a periodic basis.
 
 
Also enable auditing on the cluster using the Amazon Redshift console.
A good practice is to ensure that the datawarehouse schema conforms to a Star schema to improve the query performance.
This is it for now but stay tuned to more updates in this regard!!!
 
 
 

Thursday, May 08, 2014

SSAS Cube issues ..... Incorrect Measure Values

Have you ever noticed that your measure value in SSAS do not correspond to the value in the data warehouse.... this is a big hindrance....because one is wasting precious development time in
extrapolating data between the cube and the warehouse. I am just going to create a checklist for issues to look at really quick if one does come across this issue!!!

Problem Statement:
Let us consider a Fact table called FactInternetSales with a fact called internetsalesamount
select sum(internetsalesamount) from FactInternetSales;

Let us say that this value is 25000.
Now if we run this against the cube .... lets call it Sales.
select measures].[internetsalesamount] on 0 from Sales;
Now the value returned from this query is 510......
Why is this happening? Now just follow the following checklist to ensure that you can rectify this issue as soon as one possibly can....
1. Go to the cube and in the [internetsalesamount] properties, change the aggregation value to count instead of sum. Validate whether the measure value matches the count in the datawarehouse i.e. select count(1) from FactInternetSales.... and ensure that the count matches.
2. Check the dimensions that are slicing the cube and ensure that all the dimension attribute values have a 1:1 or 1: many mapping with the FactInternetSales. Majority of the time, there could be a dimension value missing in the dimension which is present in the cube. If this is the case, try and figure out why the dimension value is missing and add it back in.
3. Sometimes the datatype used as part of the measure could also lead to such problems so just a quick cheat sheet to leverage:
int (in datawarehouse) --> corresponding measure should be bigint
decimal --> double
4. Constant changes to the fact table..... now if the fact table constantly changes, it is best practice to change the processing model to ROLAP instead of the MOLAP (I know you end up with more processing time but nothing beats real time data especially for the analysts)
5. Check the SQL query that is generated in populating the Fact in SSAS... this might be an inconclusive in majority of the scenarios but its still worth a shot
6. Final step in the checklist, if all else fails try try and succeed........

Friday, February 07, 2014

New CEO for Microsoft

It was recently announced that Satya Nadella (a Microsoft veteran) has become the CEO of Microsoft. Even though I have a great sense of pride that a person of Indian origins or a first generation American immigrant of Indian origins is now the biggest and probably the most powerful person in the I.T. industry today, Microsoft placed a safe bet in naming Nadella as its CEO. I feel that a better choice would have been Sundar Pichai (VP @ Google) or even Robin Li ( The founder and CEO of Baidu). Of course from an experience point of view Nadella would need absolutely no introduction. His resume speaks for itself. He created and successfully handled a multitude of Microsoft's unique silo's or divisions. But infusing new ideas decoupled from the earlier thought process that Microsoft has always had would have paved for a new era when it comes to Microsoft revolutionizing the tech industry. This is a factor that will bear a brunt in Microsoft's cogs because clearly Nadella has been part of this thought process that has been detrimental to Microsoft and probably will not deviate from this (Old habits die hard). Another big let down was the fact that my idol Bill Gates is back in a technical role. It just shows the level of confidence that the old guard at Microsoft has for the new CEO. Of course Microsoft has pushed it across as a highlight of Nadella becoming the CEO, since the legendary Bill Gates with all his experience of building an empire is back at the tech table. But to me this is more of a smoke screen than anything else. This made Nadella look a bit weak in retrospect with what should have considered to have been a tribute to the old era. Bill Gates and Steve Ballmer could have come out with their heads held high with a great sense of achievement of building a global entity from scratch and handing over the reigns to Nadella but this does not remain the case as Mr. Gates is back in a full time capacity and Ballmer would definitely be present for an interim period till Nadella settles down. This is the reason Microsoft should have gone with a fresh CEO, a person who can enable Microsoft's key strengths as well as infuse new ideas. I was rooting for Sundar since his name was dropped into the hat. Who knows this might have been the handshake that the two biggest giants in the industry Google and Microsoft might have required. Another person who I thought of when Microsoft was searching for a new CEO was Mark Zuckerburg. Sure he has a Facebook to handle, but just imagine if he was named the CEO of Microsoft and Facebook would no longer remain a unique identity but becomes a part of Microsoft. That truly would have been an earth shattering event. Two giants coming together as one integrated unit which would continue to create magic ( a dream that never became reality). These are my thoughts and I felt like expressing them but you never know maybe the Nadella-Gates combination might become a "Six" (as they call it in cricket).

Tuesday, January 21, 2014

SSIS SFTP handling using WinSCP

Just a follow on to the earlier post.....
try {
string file = Dts.Variables["strLocalDirectory"].Value.ToString() + "\\" + Dts.Variables["strLocalFile"].Value.ToString();
string username = Dts.Variables["strFTPUserName"].Value.ToString();
string password = Dts.Variables["strFTPPassword"].Value.ToString();
// Setup session options
SessionOptions sessionOptions = new SessionOptions {
HostName = Dts.Variables["strFTPHost"].Value.ToString(),
UserName = username,
Password = password,
Protocol = Protocol.Sftp,
PortNumber = int.Parse(Dts.Variables["strFTPPort"].Value.ToString()),
FtpMode = FtpMode.Active,
FtpSecure = FtpSecure.None,
SshHostKeyFingerprint = Dts.Variables["strFTPFingerprint"].Value.ToString()
}; using (Session session = new Session()) {
session.ExecutablePath = Dts.Variables["strWinSCPExecutable"].Value.ToString();
// Connect
session.Open(sessionOptions);
// Upload files
TransferOptions transferOptions = new TransferOptions();
transferOptions.TransferMode = TransferMode.Binary;
TransferOperationResult transferResult;
transferResult = session.PutFiles(file, Dts.Variables["strRemotePath"].Value.ToString()); //./
}
}
catch (Exception e) {
Dts.Variables["strExceptions"].Value = e.Message + ":" + e.StackTrace;
Dts.TaskResult = (int)ScriptResults.Failure;
}

Friday, January 17, 2014

SSIS FTPS File handling

The following scripttask showcases both the FTP and FTPS based connections for uploading a file (it can be modified to perform more options). My earlier approach was to leverage the FTPWebRequest but the "AUTH SSL" command was taking a huge amount of time so decided to fall back on using WinSCP instead. Also ensure that the WinSCP.exe is added to the path environment variable and the WinSCP.dll is placed in the GAC.
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
/* * Traditional FTP */
/*ConnectionManager mgr = Dts.Connections.Add("FTP"); try { mgr.Properties["ServerName"].SetValue(mgr, "ftp://"+Dts.Variables["strFTPHost"].Value.ToString()); mgr.Properties["ServerUserName"].SetValue(mgr, Dts.Variables["strFTPUserName"].Value.ToString()); mgr.Properties["ServerPassword"].SetValue(mgr, Dts.Variables["strFTPPassword"].Value.ToString()); mgr.Properties["ServerPort"].SetValue(mgr, Dts.Variables["strFTPPort"].Value.ToString()); mgr.Properties["Timeout"].SetValue(mgr, Dts.Variables["strFTPTimeout"].Value.ToString()); mgr.Properties["UsePassiveMode"].SetValue(mgr, true); mgr.ConnectionString = Dts.Variables["strFTPHost"].Value.ToString() + ":" + Dts.Variables["strFTPPort"].Value.ToString(); //string conn = Dts.Connections["FTP Connection Manager"].ConnectionString; FtpClientConnection ftpConn = new FtpClientConnection(mgr.AcquireConnection(null)); ftpConn.Connect(); string[] fileLocation = new string[1]; fileLocation[0] = Dts.Variables["strLocalDirectory"].Value.ToString()+"\\"+Dts.Variables["strLocalFile"].Value.ToString(); ftpConn.SendFiles(fileLocation, Dts.Variables["strRemotePath"].Value.ToString(), true, false); ftpConn.Close(); } catch (Exception exp) { Dts.Variables["strExceptions"].Value = exp.Message + ":" + exp.StackTrace; }*/
/* * FTPS using WinSCP */
try { string file = Dts.Variables["strLocalDirectory"].Value.ToString() + "\\" + Dts.Variables["strLocalFile"].Value.ToString();
string username = Dts.Variables["strFTPUserName"].Value.ToString();
string password = Dts.Variables["strFTPPassword"].Value.ToString();
//string.Format("Running sync with {0}", username).Dump(DateTime.Now.ToString());
// Setup session options SessionOptions sessionOptions = new SessionOptions {
HostName = Dts.Variables["strFTPHost"].Value.ToString(),
UserName = username,
Password = password,
Protocol = Protocol.Ftp,
PortNumber = Dts.Variables["strFTPPort"].Value.ToString(),
FtpMode = FtpMode.Passive,
FtpSecure = FtpSecure.Implicit,
TlsHostCertificateFingerprint = Dts.Variables["strFTPFingerprint"].Value.ToString()
};
using (Session session = new Session()) {
session.ExecutablePath = Dts.Variables["strWinSCPExecutable"].Value.ToString();
// Connect session.Open(sessionOptions);
// Upload files
TransferOptions transferOptions = new TransferOptions();
transferOptions.TransferMode = TransferMode.Binary;
TransferOperationResult transferResult;
transferResult = session.PutFiles(file, Dts.Variables["strRemotePath"].Value.ToString());
// Throw on any error
transferResult.Check();
}
}
catch (Exception e) {
Dts.Variables["strExceptions"].Value = e.Message + ":" + e.StackTrace;
Dts.TaskResult = (int)ScriptResults.Failure;
} Dts.TaskResult = (int)ScriptResults.Success;
}

The variables used in the above code is self explanatory.... Remember to get the thumbprint using WinSCP initially to get the certificate or check with the FTP system admin.