Posts

Showing posts with the label Big Data

Modern Day Data Warehousing

Image
With ever growing data volumes and the need to necessitate a faster approach into making rapid decisions on your data sets, numerous technologies have emerged in the past 5 years trying to blend  the transactional pipelines, the data storage and the analytics together. The modern day data warehouse consists of dumping the data into more of a containerized approach catering away from the traditional star and snowflake schemas of the past. Not that these data warehouse schemas have been extinguished but more in tune with rapid assembly and disassembly of data in containers. The core aspect is to whether to de-containerize  the data as soon as it comes into our storage layer or go about doing it in an adhoc fashion with your containers being pipelined further into data virtualization stores, data marts or even just reporting on top of the data brought into the data lake, a virtualized data warehouse, an in memory data mart etc... Almost all modern day architectures have s...

Hadoop Installation on Win 10 OS

Image
Setting the Hadoop files prior to Spark installation on Win 10: 1. Ensure that your JAVA_HOME is properly set. A recommended approach here is to navigate to the installed Java folder in Program Files and copy the contents into a new folder you can locate easily for eg:- C:\Projects\Java . 2. Create a user variable called JAVA_HOME and enter " C:\Projects\Java " 3. Add to the path system variable the following entry: " C:\Projects\Java\Bin; " 4. Create a HADOOP_HOME variable and specify the root path that contains all the Hadoop files for eg:- " C:\Projects\Hadoop " 5. Add to the path variable the bin location for your Hadoop repository: " C:\Projects\Hadoop\bin " <Keep  track of your Hadoop installs like C:\Projects\Hadoop\2_5_0\bin> 6. Once these variables are set, open command prompt as an administrator and run the following commands to ensure that everything is set correctly: A] java B] javac C] Hadoop D] Hadoop Version 7. ...

Elasticsearch Notes

Image
Been recently playing with a lot of open source tool sets to figure out core solutions for different product ideas that I have. One of the recent technologies I have used is the Elasticsearch tool. Elasticsearch is basically a NoSql based indexing solution that allows one to use Lucene indexes on top of massive data sets especially string based documents. This blog post is just a bunch of notes that I have compiled. What is Elasticsearch? Elasticsearch is a document store with each document stored as an index in a cluster with multiple shards. Sharding is basically a concept of partitioning data based on some metric within the data: Now Elasticsearch exposes an http based request-response to query the individual documents stored in the index. In my case I created a 2 node cluster as shown in the following image: After this step I created an index called imdb_search . Initially wanted to create a Graphing tool to showcase the connections that I had in facebook and the relatio...