Tuesday, May 24, 2016

R - Notes

The following are basically my notes while studying R and is meant as a reference point for myself
Just a few pointers to anyone preparing for R or studying R:
  • Take a quick look at your statistical math basics before proceeding
  • Before applying any formula on your base data, try to understand what the formula is and how it was derived (this will make it easier for one to understand)
  • Use it in tangent with the Data Analysis in Excel
  • Refer to the cheat sheets available on https://www.rstudio.com/resources/cheatsheets/
  • Segregate the workbench for each module
  • There are best practices that can be incorporated while programming in R
  • Try and jot notes when and where one can... 
  • Refer to existing data-sets embedded in R before jumping into a data.gov file
  • Refer to R programs written already in Azure ML

rnorm() by default has mean 0 and variance 1
head() has its own built in precision
*default settings in R can be modified by the options() function
options(digits = 15)
#will display 15 digits (Max digit for option display --> 22 and min digit --> 0): Error if > 22 --> Error in options(digits = 30) :
#invalid 'digits' parameter, allowed 0...22

#Infinity Operations
Inf/0 --> Inf
Inf * 0 --> Inf
Inf + 0 + (0/0) --> NaN
Inf + 0  --> Inf

*The ls() lists all the variable stored in R memory at a given point in time
*rm() will remove contents from the list

*To figure out the commands in R use the following command ? followed by the function that needs to be leveraged:

*Functions and Datastructures

*Again single valued functions and multi valued functions

*A special vector is called a factor
gl() --> generate levels

*creating a function in R
test<-function p="" x="">
return (x*x+(x^2))

*for loop in R

l*apply() vs sapply()

*Binding elements
rbind() --> bind elements in a matrix in a row manner
cbind() --> bind elements in a matrix in a columnar manner

*Every vector/matrix has a data mode....

*Can be found using mode()

*dimensions in matrices
=defines the number of rows and columns in a matrix

*can be used with dimnames(),rownames(),columnnames()

*Navigating through R package libraries really bad....

*HMISC --> Harrell misc... Contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, importing and annotating datasets, imputing missing values, advanced table making, variable clustering, character string manipulation, conversion of R objects to LaTeX code, and recoding variables.

*R search path is the R working directory

getwd() --> get working directory

*to read in a table format:
testfile <- filename="" p="" read.table="">
read.fwf (fixed width file)

scan()--> reads a content of a file into a list or vector

f*ile() connections can create connections to files for read/write purposes
f1<-file p="">
close(f1)--> close the file connection

base::sink                Send R Output to a File
dput() --> save complicated R objects (in ASCII format)
dget() --> inverse of dput()

*file in conjunction with open="w" option
R has its own internal binary object
use save() & load() for binary format

*RODBC Package
Common Functions

*specify the version of the driver TDS_Version=8.0 and which port to use default:1433.
query<- from="" p="" selet="" t1="" t2="" test="">
check dimensions of a table using dim()

*summary() -> gives a range of stats on the underlying vector,list,matrix

Which function should you use to display the structure of an R object?

Log(dataframe) to investigate the data

Calculate Groups


Convert to frequency using prop.table()

Simulations in R
MCMC (Markov Chain Monte Carlo)
Performance Testing
Drawback --> Uncertainity

Pseudo Random Number Generator - The Mersenne Twister
Mersenne Prime


Uniform distribution - runif(5,min=1,max=2)
Normal distribution - rnorm(5,mean=2,sd=1)
Gamma distribution - rgamma(5,shape=2,rate=1)
Binomial distribution -rbinom(5,size=100,prob=.3)

Multinomial Distribution - rmultinom(5,size=100,prob=c(.2,.4,.7))

eruption.lm = lm(eruptions ~ waiting, data=faithful)
coeffs = coefficients(eruption.lm)
waiting = 80           # the waiting time 
duration = coeffs[1] + coeffs[2]*waiting 
duration --> Predicted value

loadd ggplot2 or ggplot using load("gplot")

Compare models using ANOVA
X1 <- nbsp="" span="" style="font-family: 'Lucida Console', 'courier new', monospace; font-size: 13px; line-height: 19.5px;">lm(y ~ x1 + x2 + x3 + x4, data=mydata)
Y1 <- lm="" span="" x1="" x2="" y="">
anova(X1, Y1)

Saturday, February 13, 2016

Hadoop Installation on Win 10 OS

Setting the Hadoop files prior to Spark installation on Win 10:
1. Ensure that your JAVA_HOME is properly set. A recommended approach here is to navigate to the installed Java folder in Program Files and copy the contents into a new folder
you can locate easily for eg:- C:\Projects\Java.
2. Create a user variable called JAVA_HOME and enter "C:\Projects\Java"
3. Add to the path system variable the following entry: "C:\Projects\Java\Bin;"
4. Create a HADOOP_HOME variable and specify the root path that contains all the Hadoop files for eg:- "C:\Projects\Hadoop"
5. Add to the path variable the bin location for your Hadoop repository: "C:\Projects\Hadoop\bin" <Keep track of your Hadoop installs like C:\Projects\Hadoop\2_5_0\bin>
6. Once these variables are set, open command prompt as an administrator and run the following commands to ensure that everything is set correctly:
A] java
B] javac
C] Hadoop
D] Hadoop Version
7. Also ensure your winutils.exe is in the Hadoop bin location.
< Download the same from - https://www.barik.net/archive/2015/01/19/172716/>
8. Also an error might related to the onfiguration location might occur -Add the following to the hadoop-env.cmd file to rectify the issue:
set HADOOP_PREFIX=C:\Projects\Hadoop

9. Another issue that I did face while leveraging Hadoop 2.6.0 install was the issue with the hadoop.dll. I had to recompile the source using MS VS to generate the hadoop.dll and pdb files and replaced the hadoop.dll which came along with the install.
10. Another error that I faced was "The system cannot find the batch label specified - nodemanager". Replace all the "\n" characters in the Yarn.cmd file to "\r\n".
11. Also replace the "\n" characters in the Hadoop.cmd file to "\r\n".

12. Yarn-site.xml change is as shown in the screenshot below:

13. Make changes to the core-site.xml as shown in the screenshot below:

14. Make the configuration changes as per the answer here :
15. Download Eclipse Helios for your Win OS to generate the jar's required for your map reduce applications. Use jdk1.7.0_71 and not the 1.8+ versions to compile your hadoop mapreduce programs.
16. Kickstart your Hadoop dfs and yarn and add data from any of your data sources and get ready to map reduce the heck out of it.... < A quick note,after formatting your named node it defaults to a tmp folder along with your machine name... in my case it is C:\tmp\hadoop-myPC\dfs\data>

Monday, December 21, 2015

Tableau Dashboards Published...

A few Tableau dashboards I have published off late to give a flair for different visualizations within Tableau:

Sunday, November 22, 2015

Cyclotron's Android App

Just created a Xamarin Android mobile application . Extremely easy to use and did not require much reading the resources to understand how to go about building it. The first iteration is as shown in the figure below:
Though the emulator (Nexus 5 through 7) did not render as clearly I wanted it to, but still a great start for v1.0. The next version would basically integrate with Google maps. As soon as one clicks the app, you get a splash screen and then navigate to Cyclotron's main menu from where you can navigate to the layouts. The support aspect would also be a part of the next iteration of the app along with the Login. There were a few more images added to the individual activities .

Probably leverage this article as the initial help on how to use the app.... The follow up items are as follows:
1. Integration with Google Maps
2. Synchronization with Cyclotron's support database
3. Login for Support
4. Tweak the UI
5. Replicate for IOS

Tuesday, October 20, 2015

Excel Regular Expression Parsing

Sample piece of code to parse Excel files with regular expressions using Excel Query (The first piece does not loop through all the rows to make it more efficient and fetches the resultant set based on the first row. The second piece goes through all the columns and finds a match):

//This piece uses ExcelQuery -->

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.IO;
using System.Text;
using System.Web.UI;
using System.Text.RegularExpressions;
using System.Web.UI.WebControls;
using Scanning;
using System.Data;
using Innovative.Data;

//Install ExcelQuery
namespace WebApplication1
    public partial class WebForm1 : System.Web.UI.Page
         protected void Page_Load(object sender, EventArgs e)

            ExcelQuery excelQuery = new ExcelQuery("C:\\Test\\Test.xlsx");
            excelQuery.HeaderHasFieldNames = true;
                     excelQuery.Provider = ExcelQuery.ConnectionStringItems.Providers.Jet12;
                     excelQuery.ExcelVersion = ExcelQuery.ConnectionStringItems.ExcelVersions.Excel2007to2013;
            excelQuery.ConnectionProperties = "IMEX=0";
            bool test = searchResults(excelQuery);

        protected bool searchResults(ExcelQuery  excelQuery)
            var x = excelQuery.GetSheets();
            foreach (string y in x)
                int i = 1,a=1;
                string sql = "SELECT * FROM [" + y+"]";
                Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
                DataSet data = excelQuery.ExecuteDataSet(sql,"Table");
                while (i <= 1)
                    foreach (DataTable dt in data.Tables)

                            var rowAsString = string.Join(", ", dt.Rows[0].ItemArray);

                           Match match = numberRegex.Match(rowAsString);
                            if (match.Success)
                                return true;

                       } while (a <= 1);

            return false;


//This piece uses ClosedXML and loops through columns

            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook("C:\\Test.xlsx");
            //List results;

            foreach (IXLWorksheet worksheet in wbk.Worksheets)
                //var test  = worksheet.ToString();
                var companyRange = worksheet.RangeUsed();
                var companyTable = companyRange.AsTable();

                if (companyTable != null)
                 var results = companyTable.DataRange.Rows()
                       .Where(companyRow => numberRegex.IsMatch(companyRow.Field(x).GetString()) == true)

                        int cnt = results.Count;

                        if (cnt > 0)


//This Piece just loops through Cells Used

byte[] array = File.ReadAllBytes("C:\\Projects\\Sheet1test.xlsx");
            Stream stream = new MemoryStream(array);
            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook(stream);
            foreach (IXLWorksheet worksheet in wbk.Worksheets)
                    IXLCells cells = worksheet.Columns().CellsUsed();
                    var results = cells.Where(cellval => numberRegex.IsMatch(cellval.GetString()) == true).ToList();

                    int cnt = results.Count;

                    if (cnt > 0)