Saturday, February 13, 2016

Hadoop Installation on Win 10 OS

Setting the Hadoop files prior to Spark installation on Win 10:
1. Ensure that your JAVA_HOME is properly set. A recommended approach here is to navigate to the installed Java folder in Program Files and copy the contents into a new folder
you can locate easily for eg:- C:\Projects\Java.
2. Create a user variable called JAVA_HOME and enter "C:\Projects\Java"
3. Add to the path system variable the following entry: "C:\Projects\Java\Bin;"
4. Create a HADOOP_HOME variable and specify the root path that contains all the Hadoop files for eg:- "C:\Projects\Hadoop"
5. Add to the path variable the bin location for your Hadoop repository: "C:\Projects\Hadoop\bin" <Keep track of your Hadoop installs like C:\Projects\Hadoop\2_5_0\bin>
6. Once these variables are set, open command prompt as an administrator and run the following commands to ensure that everything is set correctly:
A] java
B] javac
C] Hadoop
D] Hadoop Version
7. Also ensure your winutils.exe is in the Hadoop bin location.
< Download the same from - https://www.barik.net/archive/2015/01/19/172716/>
8. Also an error might related to the onfiguration location might occur -Add the following to the hadoop-env.cmd file to rectify the issue:
set HADOOP_IDENT_STRING=%USERNAME%
set HADOOP_PREFIX=C:\Projects\Hadoop
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin

9. Another issue that I did face while leveraging Hadoop 2.6.0 install was the issue with the hadoop.dll. I had to recompile the source using MS VS to generate the hadoop.dll and pdb files and replaced the hadoop.dll which came along with the install.
10. Another error that I faced was "The system cannot find the batch label specified - nodemanager". Replace all the "\n" characters in the Yarn.cmd file to "\r\n".
11. Also replace the "\n" characters in the Hadoop.cmd file to "\r\n".

12. Yarn-site.xml change is as shown in the screenshot below:

13. Make changes to the core-site.xml as shown in the screenshot below:


14. Make the configuration changes as per the answer here :
http://stackoverflow.com/questions/18630019/running-apache-hadoop-2-1-0-on-windows/23959201#23959201
15. Download Eclipse Helios for your Win OS to generate the jar's required for your map reduce applications. Use jdk1.7.0_71 and not the 1.8+ versions to compile your hadoop mapreduce programs.
16. Kickstart your Hadoop dfs and yarn and add data from any of your data sources and get ready to map reduce the heck out of it.... < A quick note,after formatting your named node it defaults to a tmp folder along with your machine name... in my case it is C:\tmp\hadoop-myPC\dfs\data>

Monday, December 21, 2015

Tableau Dashboards Published...

A few Tableau dashboards I have published off late to give a flair for different visualizations within Tableau:







Sunday, November 22, 2015

Cyclotron's Android App

Just created a Xamarin Android mobile application . Extremely easy to use and did not require much reading the resources to understand how to go about building it. The first iteration is as shown in the figure below:
Though the emulator (Nexus 5 through 7) did not render as clearly I wanted it to, but still a great start for v1.0. The next version would basically integrate with Google maps. As soon as one clicks the app, you get a splash screen and then navigate to Cyclotron's main menu from where you can navigate to the layouts. The support aspect would also be a part of the next iteration of the app along with the Login. There were a few more images added to the individual activities .

Probably leverage this article as the initial help on how to use the app.... The follow up items are as follows:
1. Integration with Google Maps
2. Synchronization with Cyclotron's support database
3. Login for Support
4. Tweak the UI
5. Replicate for IOS


Tuesday, October 20, 2015

Excel Regular Expression Parsing

Sample piece of code to parse Excel files with regular expressions using Excel Query (The first piece does not loop through all the rows to make it more efficient and fetches the resultant set based on the first row. The second piece goes through all the columns and finds a match):

//This piece uses ExcelQuery -->

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.IO;
using System.Text;
using System.Web.UI;
using System.Text.RegularExpressions;
using System.Web.UI.WebControls;
using Scanning;
using System.Data;
using Innovative.Data;

//Install ExcelQuery
namespace WebApplication1
{
    public partial class WebForm1 : System.Web.UI.Page
    {
         protected void Page_Load(object sender, EventArgs e)
        {

            ExcelQuery excelQuery = new ExcelQuery("C:\\Test\\Test.xlsx");
            excelQuery.HeaderHasFieldNames = true;
                     excelQuery.Provider = ExcelQuery.ConnectionStringItems.Providers.Jet12;
                     excelQuery.ExcelVersion = ExcelQuery.ConnectionStringItems.ExcelVersions.Excel2007to2013;
            excelQuery.ConnectionProperties = "IMEX=0";
            bool test = searchResults(excelQuery);
            Response.Write(test.ToString());
                  
        }


        protected bool searchResults(ExcelQuery  excelQuery)
        {
            var x = excelQuery.GetSheets();
            foreach (string y in x)
            {
                int i = 1,a=1;
                string sql = "SELECT * FROM [" + y+"]";
                Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
                DataSet data = excelQuery.ExecuteDataSet(sql,"Table");
                while (i <= 1)
                {
                    foreach (DataTable dt in data.Tables)
                    {
                        do
                        {
                            a++;

                            var rowAsString = string.Join(", ", dt.Rows[0].ItemArray);

                           Match match = numberRegex.Match(rowAsString);
                            if (match.Success)
                            {
                                return true;
                            }


                       } while (a <= 1);
                      
                        i++;
                    }
                }

           }
           
            return false;
         }

    }
}   


//This piece uses ClosedXML and loops through columns


            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook("C:\\Test.xlsx");
            //List results;

            foreach (IXLWorksheet worksheet in wbk.Worksheets)
            {
                //var test  = worksheet.ToString();
                var companyRange = worksheet.RangeUsed();
                var companyTable = companyRange.AsTable();

                if (companyTable != null)
                {
                 var results = companyTable.DataRange.Rows()
                       .Where(companyRow => numberRegex.IsMatch(companyRow.Field(x).GetString()) == true)
                       .ToList();

                        int cnt = results.Count;

                        if (cnt > 0)
                        {
                            Response.Write("Testing");
                        }

                    }
                }

//This Piece just loops through Cells Used

byte[] array = File.ReadAllBytes("C:\\Projects\\Sheet1test.xlsx");
            Stream stream = new MemoryStream(array);
            Regex numberRegex = new Regex(@"\d{3}-\d{2}-\d{4}", RegexOptions.IgnoreCase);
            XLWorkbook wbk = new XLWorkbook(stream);
            foreach (IXLWorksheet worksheet in wbk.Worksheets)
            {
      
                    IXLCells cells = worksheet.Columns().CellsUsed();
                    var results = cells.Where(cellval => numberRegex.IsMatch(cellval.GetString()) == true).ToList();

                    int cnt = results.Count;

                    if (cnt > 0)
                    {
                        Response.Write("Testing");
                        return;
                    }
           }
          


Monday, September 07, 2015

Elasticsearch Notes

Been recently playing with a lot of open source tool sets to figure out core solutions for different product ideas that I have. One of the recent technologies I have used is the Elasticsearch tool. Elasticsearch is basically a NoSql based indexing solution that allows one to use Lucene indexes on top of massive data sets especially string based documents. This blog post is just a bunch of notes that I have compiled. What is Elasticsearch?
Elasticsearch is a document store with each document stored as an index in a cluster with multiple shards. Sharding is basically a concept of partitioning data based on some metric within the data:
Now Elasticsearch exposes an http based request-response to query the individual documents stored in the index.
In my case I created a 2 node cluster as shown in the following image:

After this step I created an index called imdb_search . Initially wanted to create a Graphing tool to showcase the connections that I had in facebook and the relations between them...but then I decided nah....will go with a more open public api search for JSON feeds. Note: I used the SENSE api provided by Marvel for the GET/POST/TRACE http commands to store/parse/write the data. 


POST /imdb_test
{
"mapping":{
 "post": {
   "_routing":{
            "required":"true",
            "path":"movie_name"
   },
"properties":{
 "movie_id":{
type:"int"
    },
"movie_name":{
type:"string"
    },
"movie_box_office_value":{
type: "integer"
   },
"movie_date_of_release":{
type: "date"
   }
}
  }
}
}

A note here is that the nodes can be reconfigured with a API call:
PUT /_cluster/settings { "persistent" : { "discovery.zen.minimum_master_nodes" : 2 } }
In this case I created an index with the following mappings --> movie_id, movie_name, movie_box_office_value, movie_date_of_release with the data types as shown in the above table.

Next I pulled the required JSON feeds for the documents from the imdb open api's. A couple of the queries I used are as follows:
http://www.imdb.com/xml/find?json=1&nr=1&nm=on&q=Disney

Using a mix and match of some of the data and generating random box office and date results in my temp C# parser, I dynamically created a few entries for my Elasticsearch document repository. A couple of the POSTS are as follows:


POST imdb_test/post { "movie_name":"Star Wars: Episode VI - Return of the Jedi", "movie_id":1, "movie_box_office_value":7000000, "movie_date_of_release":"2005-07-01" }
POST imdb_test/post { "movie_name":"Terminator", "movie_id":2, "movie_box_office_value":10000000, "movie_date_of_release":"1994-06-04" }


After generating a couple of data files for my index.... I was able to query the list with specific filters/ basic queries using _search like GET /imdb_test/_search or more complex queries like
GET imdb_test/post/_search
{
  "query": {
    "match": {
      "user_name": "terminator"
    }
  }
  ,
  "aggs": {
    "all_words": {
      "terms": {"field":"movie_name"}
    }
  }
}
I just stepped short after checking out my routing using the GET commands:
GET /imdb_test/post/Terminator
The next step was to create a SPA to generate a dashboard with the resultant set. The advantages of using Elasticsearch is primary its querying abilities on massive volume sets and can be useful in terms of document repositories like in my case, Blogging and even Geo based analysis of data. It works with JSON documents and also has a really cool analytics dashboard to show case the metrics of the environment called Kibana. We can also alias our index like in my example we can categorize the movies into Horror/Action/Comedy etc. genres by aliasing which is a pretty handy feature.

[9/16/2015]
On that note, I took the above ElasticSearch setttings and applied a Life Sciences need for the same. I stored my meta data results for the graphs inside Elastic Search and pulled in the required metric for the same. Now in my case I needed the graphs to be jazzy --> so I had 3 indexes sets with two seperate aliases each. And I changed the data feeds to the same. For the genetic algorithm, I basically leveraged the json feed from Karsten @ http://www.karstenahnert.com/. Also got a good idea for the dashboard from Colin @http://colinwhite.net/Dash2.5/ which was more for Hospital management.

Here is a sneek peek at the dashboard:-


Note: The red, green and blue were to mimic a heat map with trends that are hot - Red, luke warm - green and shallow - yellow. Used simple SVG rect for that portion.
The visualizations were done using D3 and the gets were fetched using Angular.....