Wednesday, June 15, 2016

Microsoft acquires LinkedIn

On Monday 6/13/2016, Microsoft announced its acquisition of LinkedIn. This is a major game changer in the world of IT. But before we get to some of the advantages of this acquisition, Microsoft actually was working on a LinkedIn killer on its CRM dynamics platform. The idea was to generate more footprint for its CRM solution as well as create something unique with it. This was started in early 2012 and was way before its actual acquisition of LinkedIn. Here are my thoughts into where this acquisition will lead Microsoft & LinkedIn to:


  • Microsoft gains a huge database of professionals and organizations in various streams: This alone is the most massive gain by Microsoft. It could start targeting professionals/organizations to either move onto the Microsoft platform or join the Microsoft platform which can bolster its sales by a huge margin/
  • Microsoft integration of LinkedIn ads with Bing: Just imagine an organization trying to establish a marketing campaign. Now with LinkedIn ads and Bing ads integrated, an organization will have more opportunities to get page views or clicks and the potential to accelerate CTR's to conversions. This could come with a potential increase in cost of a campaign but it might make a really profitable decision.
  • Changes in technology trends at Microsoft will accelerate: Now since the foundation of LinkedIn is based on Cloud computing and Big Data, LinkedIn would have significantly made a lot of strides in terms of architectural and open source technology aspects. Now if these can be converted to products on the Azure suite, Microsoft might make a sizable profit on it. Also this will increase the trend of embracing open source vs closed technology stacks. (Azkaban, Voldemort, Increased usage of Kafka and Rabbit MQ etc...)
  • Added stream of Revenue: During an interview with Satya and Jeff, Satya did mention the integration with O365 and Azure as the major driver and Jeff mentioned that it made sense for LinkedIn to sell at this point. But I think the potential driver for this deal was the significant monetary gain that Microsoft will have in terms of Ads and revenue generated from LinkedIn subscribers. LinkedIn had major competition coming its way in the form of more local or regional based corporate social websites, but with this it actually gives it a major edge over its competitors due to Microsoft's global reach. Azure and O365 could potentially outlets for Microsoft to create LinkedIn based apps similar to Yammer
  • Allow organizations to create internal social platforms using the technology gained by LinkedIn: Microsoft could potentially make a configurable LinkedIn app for all its devices inclusive of the XBOX that organizations can tap into and create internal social networking platforms.

It would be fun to just do a prediction as to where this acquisition would lead Microsoft. probably a story to tell another day.

Tuesday, May 24, 2016

R - Notes

The following are basically my notes while studying R and is meant as a reference point for myself
Just a few pointers to anyone preparing for R or studying R:
  • Take a quick look at your statistical math basics before proceeding
  • Before applying any formula on your base data, try to understand what the formula is and how it was derived (this will make it easier for one to understand)
  • Use it in tangent with the Data Analysis in Excel
  • Refer to the cheat sheets available on https://www.rstudio.com/resources/cheatsheets/
  • Segregate the workbench for each module
  • There are best practices that can be incorporated while programming in R
  • Try and jot notes when and where one can... 
  • Refer to existing data-sets embedded in R before jumping into a data.gov file
  • Refer to R programs written already in Azure ML

rnorm() by default has mean 0 and variance 1
head() has its own built in precision
*default settings in R can be modified by the options() function
example:
options(digits = 15)
#will display 15 digits (Max digit for option display --> 22 and min digit --> 0): Error if > 22 --> Error in options(digits = 30) :
#invalid 'digits' parameter, allowed 0...22

#Infinity Operations
Inf/0 --> Inf
Inf * 0 --> Inf
Inf + 0 + (0/0) --> NaN
Inf + 0  --> Inf

*The ls() lists all the variable stored in R memory at a given point in time
*rm() will remove contents from the list

*To figure out the commands in R use the following command ? followed by the function that needs to be leveraged:
?c()
?rand()
?max()

*Functions and Datastructures
sin()
integrate()
plot()
paste()

*Again single valued functions and multi valued functions

*A special vector is called a factor
gl() --> generate levels

*creating a function in R
test<-function p="" x="">
{
x=5
return (x*x+(x^2))
}

*for loop in R

l*apply() vs sapply()

*Binding elements
rbind() --> bind elements in a matrix in a row manner
cbind() --> bind elements in a matrix in a columnar manner

*Every vector/matrix has a data mode....
logical
numerical

*Can be found using mode()

*dimensions in matrices
=defines the number of rows and columns in a matrix

*can be used with dimnames(),rownames(),columnnames()

*Navigating through R package libraries really bad....

*HMISC --> Harrell misc... Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX code, and recoding variables.

*R search path is the R working directory

getwd() --> get working directory
setwd()

*to read in a table format:
testfile <- filename="" p="" read.table="">
read.csv
read.csv2
read.fwf (fixed width file)

*readLines()
scan()--> reads a content of a file into a list or vector

f*ile() connections can create connections to files for read/write purposes
write.table(line,file="myfile",append=TRUE)
f1<-file p="">
close(f1)--> close the file connection

write.table(dataFieldName,filename)
write.csv
write.csv2
base::sink                Send R Output to a File
dump()
dput() --> save complicated R objects (in ASCII format)
dget() --> inverse of dput()

*file in conjunction with open="w" option
R has its own internal binary object
use save() & load() for binary format

*RODBC Package
Common Functions
odbcDriverConnect(Connection)
sqlQuery()
sqlTable()
sqlFetch()
sqlColumns()
close(Connection)

*specify the version of the driver TDS_Version=8.0 and which port to use default:1433.
Ex:
sqlFetch(conn,"Tablename")
query<- from="" p="" selet="" t1="" t2="" test="">
sqlQuery(conn,Query)
sqlColumns(conn,"Tablename")
sqlColumns(conn,"Tablename")[c("COLUMN_NAME"),c("TYPE_NAME")]
check dimensions of a table using dim()

*summary() -> gives a range of stats on the underlying vector,list,matrix









Which function should you use to display the structure of an R object?
Str()

Log(dataframe) to investigate the data


Calculate Groups
tapply()
aggregate()
by()


Attach()
Detach()


Convert to frequency using prop.table()

Simulations in R
MCMC (Markov Chain Monte Carlo)
Encryption
Performance Testing
Drawback --> Uncertainity

Pseudo Random Number Generator - The Mersenne Twister
Mersenne Prime

set.seed(number)
rnorm(3)


Uniform distribution - runif(5,min=1,max=2)
Normal distribution - rnorm(5,mean=2,sd=1)
Gamma distribution - rgamma(5,shape=2,rate=1)
Binomial distribution -rbinom(5,size=100,prob=.3)

Multinomial Distribution - rmultinom(5,size=100,prob=c(.2,.4,.7))

Regression:
eruption.lm = lm(eruptions ~ waiting, data=faithful)
coeffs = coefficients(eruption.lm)
coeffs
coeffs[1]
coeffs[2]
waiting = 80           # the waiting time 
duration = coeffs[1] + coeffs[2]*waiting 
duration --> Predicted value

loadd ggplot2 or ggplot using load("gplot")

Compare models using ANOVA
X1 <- nbsp="" span="" style="font-family: 'Lucida Console', 'courier new', monospace; font-size: 13px; line-height: 19.5px;">lm(y ~ x1 + x2 + x3 + x4, data=mydata)
Y1 <- lm="" span="" x1="" x2="" y="">
anova(X1, Y1)




Saturday, February 13, 2016

Hadoop Installation on Win 10 OS

Setting the Hadoop files prior to Spark installation on Win 10:
1. Ensure that your JAVA_HOME is properly set. A recommended approach here is to navigate to the installed Java folder in Program Files and copy the contents into a new folder
you can locate easily for eg:- C:\Projects\Java.
2. Create a user variable called JAVA_HOME and enter "C:\Projects\Java"
3. Add to the path system variable the following entry: "C:\Projects\Java\Bin;"
4. Create a HADOOP_HOME variable and specify the root path that contains all the Hadoop files for eg:- "C:\Projects\Hadoop"
5. Add to the path variable the bin location for your Hadoop repository: "C:\Projects\Hadoop\bin" <Keep track of your Hadoop installs like C:\Projects\Hadoop\2_5_0\bin>
6. Once these variables are set, open command prompt as an administrator and run the following commands to ensure that everything is set correctly:
A] java
B] javac
C] Hadoop
D] Hadoop Version
7. Also ensure your winutils.exe is in the Hadoop bin location.
< Download the same from - https://www.barik.net/archive/2015/01/19/172716/>
8. Also an error might related to the onfiguration location might occur -Add the following to the hadoop-env.cmd file to rectify the issue:
set HADOOP_IDENT_STRING=%USERNAME%
set HADOOP_PREFIX=C:\Projects\Hadoop
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin

9. Another issue that I did face while leveraging Hadoop 2.6.0 install was the issue with the hadoop.dll. I had to recompile the source using MS VS to generate the hadoop.dll and pdb files and replaced the hadoop.dll which came along with the install.
10. Another error that I faced was "The system cannot find the batch label specified - nodemanager". Replace all the "\n" characters in the Yarn.cmd file to "\r\n".
11. Also replace the "\n" characters in the Hadoop.cmd file to "\r\n".

12. Yarn-site.xml change is as shown in the screenshot below:

13. Make changes to the core-site.xml as shown in the screenshot below:


14. Make the configuration changes as per the answer here :
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows/23959201#23959201
15. Download Eclipse Helios for your Win OS to generate the jar's required for your map reduce applications. Use jdk1.7.0_71 and not the 1.8+ versions to compile your hadoop mapreduce programs.
16. Kickstart your Hadoop dfs and yarn and add data from any of your data sources and get ready to map reduce the heck out of it.... < A quick note,after formatting your named node it defaults to a tmp folder along with your machine name... in my case it is C:\tmp\hadoop-myPC\dfs\data>

Monday, December 21, 2015

Tableau Dashboards Published...

A few Tableau dashboards I have published off late to give a flair for different visualizations within Tableau:







Sunday, November 22, 2015

Cyclotron's Android App

Just created a Xamarin Android mobile application . Extremely easy to use and did not require much reading the resources to understand how to go about building it. The first iteration is as shown in the figure below:
Though the emulator (Nexus 5 through 7) did not render as clearly I wanted it to, but still a great start for v1.0. The next version would basically integrate with Google maps. As soon as one clicks the app, you get a splash screen and then navigate to Cyclotron's main menu from where you can navigate to the layouts. The support aspect would also be a part of the next iteration of the app along with the Login. There were a few more images added to the individual activities .

Probably leverage this article as the initial help on how to use the app.... The follow up items are as follows:
1. Integration with Google Maps
2. Synchronization with Cyclotron's support database
3. Login for Support
4. Tweak the UI
5. Replicate for IOS


Tuesday, October 20, 2015

Excel Regular Expression Parsing

Sample piece of code to parse Excel files with regular expressions using Excel Query (The first piece does not loop through all the rows to make it more efficient and fetches the resultant set based on the first row. The second piece goes through all the columns and finds a match):

//This piece uses ExcelQuery -->

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.IO;
using System.Text;
using System.Web.UI;
using System.Text.RegularExpressions;
using System.Web.UI.WebControls;
using Scanning;
using System.Data;
using Innovative.Data;

//Install ExcelQuery
namespace WebApplication1
{
    public partial class WebForm1 : System.Web.UI.Page
    {
         protected void Page_Load(object sender, EventArgs e)
        {

            ExcelQuery excelQuery = new ExcelQuery("C:\\Test\\Test.xlsx");
            excelQuery.HeaderHasFieldNames = true;
                     excelQuery.Provider = ExcelQuery.ConnectionStringItems.Providers.Jet12;
                     excelQuery.ExcelVersion = ExcelQuery.ConnectionStringItems.ExcelVersions.Excel2007to2013;
            excelQuery.ConnectionProperties = "IMEX=0";
            bool test = searchResults(excelQuery);
            Response.Write(test.ToString());
                  
        }


        protected bool searchResults(ExcelQuery  excelQuery)
        {
            var x = excelQuery.GetSheets();
            foreach (string y in x)
            {
                int i = 1,a=1;
                string sql = "SELECT * FROM [" + y+"]";
                Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
                DataSet data = excelQuery.ExecuteDataSet(sql,"Table");
                while (i <= 1)
                {
                    foreach (DataTable dt in data.Tables)
                    {
                        do
                        {
                            a++;

                            var rowAsString = string.Join(", ", dt.Rows[0].ItemArray);

                           Match match = numberRegex.Match(rowAsString);
                            if (match.Success)
                            {
                                return true;
                            }


                       } while (a <= 1);
                      
                        i++;
                    }
                }

           }
           
            return false;
         }

    }
}   


//This piece uses ClosedXML and loops through columns


            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook("C:\\Test.xlsx");
            //List results;

            foreach (IXLWorksheet worksheet in wbk.Worksheets)
            {
                //var test  = worksheet.ToString();
                var companyRange = worksheet.RangeUsed();
                var companyTable = companyRange.AsTable();

                if (companyTable != null)
                {
                 var results = companyTable.DataRange.Rows()
                       .Where(companyRow => numberRegex.IsMatch(companyRow.Field(x).GetString()) == true)
                       .ToList();

                        int cnt = results.Count;

                        if (cnt > 0)
                        {
                            Response.Write("Testing");
                        }

                    }
                }

//This Piece just loops through Cells Used

byte[] array = File.ReadAllBytes("C:\\Projects\\Sheet1test.xlsx");
            Stream stream = new MemoryStream(array);
            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook(stream);
            foreach (IXLWorksheet worksheet in wbk.Worksheets)
            {
      
                    IXLCells cells = worksheet.Columns().CellsUsed();
                    var results = cells.Where(cellval => numberRegex.IsMatch(cellval.GetString()) == true).ToList();

                    int cnt = results.Count;

                    if (cnt > 0)
                    {
                        Response.Write("Testing");
                        return;
                    }
           }