Tuesday, August 19, 2014

Blue Screen of Death --> Microsoft Power Map

I was fiddling around with Power Map and wanted to see how far I could go ahead with the mapping visualization.......
I basically leveraged the Power Station data file available at:
http://office.microsoft.com/en-us/excel-help/redir/XT104048055.aspx?CTT=5&origin=HA104091224

After downloading this excel file, I started to create Power Map with the basic idea of having 3 layers.
1. The first layer would take into account all the counties based on the Power Transmitted
2. The second layer would consist of all the companies based the Power distribution
3. The third layer considers the power distribution based on Plant Name....

Now this excel file contains roughly about 20,000 rows...... My machine configuration is basically a Windows 8 with an Intel i7 core and 16 GB of RAM. I was good up to step 1 in the steps described above but when I hit step 2 that's when I noticed that my memory just bloated like crazy...... I decided like a crazy person to try step 3 and that's when my machine crashed. Unfortunately I could not take screen shots as well as would not like to reproduce  this issue. But to just give a general idea even playing around with roughly 20,000 rows of data and using the above layering with probably close to 8 GB of free space on my box killed my machine with even the disk usage going to 100%. I decided to work on a smaller subset and show you what exactly I was trying to produce (This is for the state of California with a much lesser subset of counties):


 Step 1 is shown in the above diagram


Step 2 is as shown in the above diagram


The above figure is very close to what I wanted to do but I used a clustered column instead. I would definitely recommend going slow while creating a Power Map. It does provide mind blowing visualization but it is dependent on the amount of memory one can leverage at any given point in time and can blow up if the resultant data set is way too large. The above examples were created using about 500 rows of data. I was way more cautious as I did not want to blow my machine again. But at the end of the day, the map was beyond imagination in terms of visualization of data points. Post a comment on this post if this has happened to you as well!!

Monday, August 11, 2014

Redshift Experience

Big Data - the keyword given to solutions that can handle massive amount of data usually in the petabyte or greater amount. There are several big data solutions out there and all of them have their unique characteristics which can be useful in different scenarios. I was looking into Cloudera's versions of Hadoop like Impala, Sentry and HBase. All these vary based on the use case. For some of my clients I have leveraged Amazon Redshift, Cassandra (and hopefully soon Apache Hadoop). The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)

If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibilty is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.

It is similar to the analogy of using a bucket..... Consider this:

You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)

The concept  scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....

The costing of data nodes and compute nodes are as follows:


 
 
As the data node volume keeps increasing, reserve a data node and then utilize it as required. While creating the Redshift cluster ensure that the cluster type is Multi - Node and not Single-Node for large data related projects.
 
The tables are stored with a columnar partitioning algorithm on the different nodes governed by the distribution key of the respective table. The database used within Redshift is Postgres (the version keeps getting updated based on the release), this will be handled by Amazon. On a quick note whenever you are trying to add a node to the cluster, the database will go down so ensure that all your ETL's or import/export jobs are stopped during this time frame.
 
Even though daily snapshots are taken inside the Redshift cluster, a key note here is if one is ever going to delete the cluster and recreate it from a previous snapshot, ensure that a new cluster is created from the snapshot which is required and delete the first redshift cluster.
 
Even though Redshift has a lot of data integration tools partnerships, I would definitely recommend leveraging Talend ETL. I might give a breakdown of an Talend ETL in a upcoming article.
For reporting on the other hand there are several analytics tools like Pentaho, Microstrategy, Tableau, Birst etc..., I would recommend using Tableau/Pentaho/Microstrategy for analytical dashboarding needs based on the cost consideration.
 
Let us get back to Redshift, the storage mechanism is similar to the Tabular vertipaq engine model that Microsoft leverages. And the processing is done using MPP i.e. Massive parallel Processing engines. The encryption is two fold, the first fold is at the database level where the Data Encryption Key encrypt data blocks in the cluster. Each data block is assigned a randomly-generated AES-256 key. These keys are encrypted by using the database key for the cluster.  The other fold is at the data level or commonly known as encoding. I am just giving a flat table of what encoding needs to be used for what data type:-
int,date,datetime, timestamps delta
varchar or char upto 50 characters runlength
varchar  > 50 characters text255
smallint,decimal Mostly8
Others Raw
 
The cons of using Redshift are as follows:
  • You need an Administrator to monitor the  Redshift cluster at all times
  • The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
  • Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
 
Redshift allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:
 
Like any other traditional RDBMS, if there is a constant amount of queries running on the same table, the locking of the table is inevitable, so care must be taken to ensure that the tables are not locked using the following statement:
 
                -->select * from STL_TR_CONFLICT
 
 Also ensuring that the statistics are built on a regular basis....run the following command:
               -->Analyze Table_Name
for all the tables on which queries are performed on a periodic basis.
 
 
Also enable auditing on the cluster using the Amazon Redshift console.
A good practice is to ensure that the datawarehouse schema conforms to a Star schema to improve the query performance.
This is it for now but stay tuned to more updates in this regard!!!
 
 
 

Thursday, May 08, 2014

SSAS Cube issues ..... Incorrect Measure Values

Have you ever noticed that your measure value in SSAS do not correspond to the value in the data warehouse.... this is a big hindrance....because one is wasting precious development time in
extrapolating data between the cube and the warehouse. I am just going to create a checklist for issues to look at really quick if one does come across this issue!!!

Problem Statement:
Let us consider a Fact table called FactInternetSales with a fact called internetsalesamount
select sum(internetsalesamount) from FactInternetSales;

Let us say that this value is 25000.
Now if we run this against the cube .... lets call it Sales.
select measures].[internetsalesamount] on 0 from Sales;
Now the value returned from this query is 510......
Why is this happening? Now just follow the following checklist to ensure that you can rectify this issue as soon as one possibly can....
1. Go to the cube and in the [internetsalesamount] properties, change the aggregation value to count instead of sum. Validate whether the measure value matches the count in the datawarehouse i.e. select count(1) from FactInternetSales.... and ensure that the count matches.
2. Check the dimensions that are slicing the cube and ensure that all the dimension attribute values have a 1:1 or 1: many mapping with the FactInternetSales. Majority of the time, there could be a dimension value missing in the dimension which is present in the cube. If this is the case, try and figure out why the dimension value is missing and add it back in.
3. Sometimes the datatype used as part of the measure could also lead to such problems so just a quick cheat sheet to leverage:
int (in datawarehouse) --> corresponding measure should be bigint
decimal --> double
4. Constant changes to the fact table..... now if the fact table constantly changes, it is best practice to change the processing model to ROLAP instead of the MOLAP (I know you end up with more processing time but nothing beats real time data especially for the analysts)
5. Check the SQL query that is generated in populating the Fact in SSAS... this might be an inconclusive in majority of the scenarios but its still worth a shot
6. Final step in the checklist, if all else fails try try and succeed........

Friday, February 07, 2014

New CEO for Microsoft

It was recently announced that Satya Nadella (a Microsoft veteran) has become the CEO of Microsoft. Even though I have a great sense of pride that a person of Indian origins or a first generation American immigrant of Indian origins is now the biggest and probably the most powerful person in the I.T. industry today, Microsoft placed a safe bet in naming Nadella as its CEO. I feel that a better choice would have been Sundar Pichai (VP @ Google) or even Robin Li ( The founder and CEO of Baidu). Of course from an experience point of view Nadella would need absolutely no introduction. His resume speaks for itself. He created and successfully handled a multitude of Microsoft's unique silo's or divisions. But infusing new ideas decoupled from the earlier thought process that Microsoft has always had would have paved for a new era when it comes to Microsoft revolutionizing the tech industry. This is a factor that will bear a brunt in Microsoft's cogs because clearly Nadella has been part of this thought process that has been detrimental to Microsoft and probably will not deviate from this (Old habits die hard). Another big let down was the fact that my idol Bill Gates is back in a technical role. It just shows the level of confidence that the old guard at Microsoft has for the new CEO. Of course Microsoft has pushed it across as a highlight of Nadella becoming the CEO, since the legendary Bill Gates with all his experience of building an empire is back at the tech table. But to me this is more of a smoke screen than anything else. This made Nadella look a bit weak in retrospect with what should have considered to have been a tribute to the old era. Bill Gates and Steve Ballmer could have come out with their heads held high with a great sense of achievement of building a global entity from scratch and handing over the reigns to Nadella but this does not remain the case as Mr. Gates is back in a full time capacity and Ballmer would definitely be present for an interim period till Nadella settles down. This is the reason Microsoft should have gone with a fresh CEO, a person who can enable Microsoft's key strengths as well as infuse new ideas. I was rooting for Sundar since his name was dropped into the hat. Who knows this might have been the handshake that the two biggest giants in the industry Google and Microsoft might have required. Another person who I thought of when Microsoft was searching for a new CEO was Mark Zuckerburg. Sure he has a Facebook to handle, but just imagine if he was named the CEO of Microsoft and Facebook would no longer remain a unique identity but becomes a part of Microsoft. That truly would have been an earth shattering event. Two giants coming together as one integrated unit which would continue to create magic ( a dream that never became reality). These are my thoughts and I felt like expressing them but you never know maybe the Nadella-Gates combination might become a "Six" (as they call it in cricket).

Tuesday, January 21, 2014

SSIS SFTP handling using WinSCP

Just a follow on to the earlier post.....
try {
string file = Dts.Variables["strLocalDirectory"].Value.ToString() + "\\" + Dts.Variables["strLocalFile"].Value.ToString();
string username = Dts.Variables["strFTPUserName"].Value.ToString();
string password = Dts.Variables["strFTPPassword"].Value.ToString();
// Setup session options
SessionOptions sessionOptions = new SessionOptions {
HostName = Dts.Variables["strFTPHost"].Value.ToString(),
UserName = username,
Password = password,
Protocol = Protocol.Sftp,
PortNumber = int.Parse(Dts.Variables["strFTPPort"].Value.ToString()),
FtpMode = FtpMode.Active,
FtpSecure = FtpSecure.None,
SshHostKeyFingerprint = Dts.Variables["strFTPFingerprint"].Value.ToString()
}; using (Session session = new Session()) {
session.ExecutablePath = Dts.Variables["strWinSCPExecutable"].Value.ToString();
// Connect
session.Open(sessionOptions);
// Upload files
TransferOptions transferOptions = new TransferOptions();
transferOptions.TransferMode = TransferMode.Binary;
TransferOperationResult transferResult;
transferResult = session.PutFiles(file, Dts.Variables["strRemotePath"].Value.ToString()); //./
}
}
catch (Exception e) {
Dts.Variables["strExceptions"].Value = e.Message + ":" + e.StackTrace;
Dts.TaskResult = (int)ScriptResults.Failure;
}

Friday, January 17, 2014

SSIS FTPS File handling

The following scripttask showcases both the FTP and FTPS based connections for uploading a file (it can be modified to perform more options). My earlier approach was to leverage the FTPWebRequest but the "AUTH SSL" command was taking a huge amount of time so decided to fall back on using WinSCP instead. Also ensure that the WinSCP.exe is added to the path environment variable and the WinSCP.dll is placed in the GAC.
public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
{
public void Main()
{
/* * Traditional FTP */
/*ConnectionManager mgr = Dts.Connections.Add("FTP"); try { mgr.Properties["ServerName"].SetValue(mgr, "ftp://"+Dts.Variables["strFTPHost"].Value.ToString()); mgr.Properties["ServerUserName"].SetValue(mgr, Dts.Variables["strFTPUserName"].Value.ToString()); mgr.Properties["ServerPassword"].SetValue(mgr, Dts.Variables["strFTPPassword"].Value.ToString()); mgr.Properties["ServerPort"].SetValue(mgr, Dts.Variables["strFTPPort"].Value.ToString()); mgr.Properties["Timeout"].SetValue(mgr, Dts.Variables["strFTPTimeout"].Value.ToString()); mgr.Properties["UsePassiveMode"].SetValue(mgr, true); mgr.ConnectionString = Dts.Variables["strFTPHost"].Value.ToString() + ":" + Dts.Variables["strFTPPort"].Value.ToString(); //string conn = Dts.Connections["FTP Connection Manager"].ConnectionString; FtpClientConnection ftpConn = new FtpClientConnection(mgr.AcquireConnection(null)); ftpConn.Connect(); string[] fileLocation = new string[1]; fileLocation[0] = Dts.Variables["strLocalDirectory"].Value.ToString()+"\\"+Dts.Variables["strLocalFile"].Value.ToString(); ftpConn.SendFiles(fileLocation, Dts.Variables["strRemotePath"].Value.ToString(), true, false); ftpConn.Close(); } catch (Exception exp) { Dts.Variables["strExceptions"].Value = exp.Message + ":" + exp.StackTrace; }*/
/* * FTPS using WinSCP */
try { string file = Dts.Variables["strLocalDirectory"].Value.ToString() + "\\" + Dts.Variables["strLocalFile"].Value.ToString();
string username = Dts.Variables["strFTPUserName"].Value.ToString();
string password = Dts.Variables["strFTPPassword"].Value.ToString();
//string.Format("Running sync with {0}", username).Dump(DateTime.Now.ToString());
// Setup session options SessionOptions sessionOptions = new SessionOptions {
HostName = Dts.Variables["strFTPHost"].Value.ToString(),
UserName = username,
Password = password,
Protocol = Protocol.Ftp,
PortNumber = Dts.Variables["strFTPPort"].Value.ToString(),
FtpMode = FtpMode.Passive,
FtpSecure = FtpSecure.Implicit,
TlsHostCertificateFingerprint = Dts.Variables["strFTPFingerprint"].Value.ToString()
};
using (Session session = new Session()) {
session.ExecutablePath = Dts.Variables["strWinSCPExecutable"].Value.ToString();
// Connect session.Open(sessionOptions);
// Upload files
TransferOptions transferOptions = new TransferOptions();
transferOptions.TransferMode = TransferMode.Binary;
TransferOperationResult transferResult;
transferResult = session.PutFiles(file, Dts.Variables["strRemotePath"].Value.ToString());
// Throw on any error
transferResult.Check();
}
}
catch (Exception e) {
Dts.Variables["strExceptions"].Value = e.Message + ":" + e.StackTrace;
Dts.TaskResult = (int)ScriptResults.Failure;
} Dts.TaskResult = (int)ScriptResults.Success;
}

The variables used in the above code is self explanatory.... Remember to get the thumbprint using WinSCP initially to get the certificate or check with the FTP system admin.

Thursday, September 19, 2013

Rhino - ETL

Using Rhino-ETL ( a C# based) framework for developing standard ETL's is pretty easy and one can do a lot of fun stuff with the underlying source data. I just wrote up a quick console app to generate data into a text file and push the same data into a table in SQL Server as well as an external file. Here are the following steps (for the external file push):
1. Create a new C# console application solution in Visual Studio.
2. Target the .Net framework as shown in the below screen shot in your project properties:-

3. Create 3 Sub Folders underneath your project as shown in the following screen shot
DataObjects --> Contains the class files associated with each and every table/file in your environment. Example:- if your source file contains student data, then you would create a class file called Student with the individual properties (in relation to the properties) exposed (nouns).
Operations--> This primarily contains the class files that contain the activities (adjectives) that need to be performed on the DataObjects. Example:- Writing the Student Data to a database, Reading the Student data from a file etc.
WorkFolder--> Contains the external file sources to interact with. Example:- a flat file, a csv or a tsv. In this case it will be student.txt.


Lets write some code to insert a student record from a flat file into another flat file.......(as simple as it sounds)

4. Create a class file called StudentRecord.cs (pipe delimited) and declare the required entity attributes as shown in the following code snippet:-
Contains records in the following manner (student.txt)
StudentId|StudentName|StudentAddress|StudentClassId|StudentMarks //header
1|Ishwar|TestAddress|1|85 //row

5. Create a class file called NewStudentRecord (which contains the attributes that need to be transferred to the new file)
This will be outputted in the following manner(tab separated)
StudentId\tStudentName\tStudentAddress\tStudentClass
1\tIshwar\tTestAddress\t1

Let us now create the action called student write i.e. Let us go about writing this out to an output file called studentoutput.txt and I am creating a new C# class file called StudentWriteFile which will be as shown in the following code snippet:-

Now let us go about writing the main program... create the main program in the following manner:-
The setting's values basically point to the settings files that I have created which contains the absolute path of the student.txt and the studentoutput.txt files.

After which in your main just initialize the MainProgram in the following manner:-
 new MainProgram().Execute();

and you will have your first rhino-etl to rock and roll with......