Now this post is a part 2 to my previous post on Data Implosion and not Explosion. Off late I have been reading a lot of books on data especially related to Data Science and one of the books I am on the verge of completion is Developing Analytic Talent: Becoming a Data Scientist by Vincent Granville. Now the reason I purchased this book was to get a deeper insight into what Data Scienists do and what is required in order to become one. The unfortunate part of this book is that getting a deeper insight into Data Science is absolutely incomprehensible. The reason behind this is that majority of the formulae (which I would love to understand and apply) are not defined clearly. And majority of these formulae's do not have a definition of the co-efficients used. Coming from a Computer Science background, I am familiar with the core aspects like the Big O notation and how to derive it from one's code, so reading through these parts was eating with a spoon since I do leverage these concepts on a regular basis (or whenever I actively code). However the statistical representation of data barring frequency tables ( a few other aspects like mean, average and standard deviation) are understandable. When it comes to certain formulae like weights and sample spaces are not clearly represented and are garbled. But I would definitly recommend the chapter on Excel which I think has been written fantabulously and is a must read for data geeks. When it comes to clustering and when to leverage certain principles like Sampling for example again was very shaky for me to read and understand. However I am going to talk about cluster sampling in this article:

1. Let us consider a population of 500 people.

2. Divide this sample space into clusters: For example->

A] number of men (140)

B] number of women (150)

C] number of children (210)

3. Now each cluster is a sample in its own right and we will not combine them together but delineate this after step 2.

4. Now in a random population sample, we choose a subset of men, women and children from our respective clusters and apply the necessary parameters we need to gauge, a couple of examples here would be:

A] What time does one sleep?

B] What time does one have lunch?

C] How many times in a month does one shop? etc..

5. Let us take only [A] for our current problem.

--> I choose 20 from each subset as a sample

Let us consider Men-

Space Definition Number of Men who sleep Average

--------------------------------------------------------------------------------------------------------------------------

1pm - 5pm 4 4/20 =0.20

6pm - 10pm 12 12/20 =3/5 =0.60

post 10 pm 4 4/20 =0.20

--------------------------------------------------------------------------------------------------------------------------

20 1

From this we can apply the same to the overall Men's cluster and gather the required metrics.

Now clustering sampling can actually bombard your analysis if your data points are redundant and is well not worth the time and effort. But as I mentioned earlier a Data Implosion Mechanic is someone who will not deduce these aspects of the data volume but rather force the metrics that need to be infringed on the data. I could be wrong here but this is my take on it... Let's move onto Monte Carlo simulations a bit....

Let us consider [B] and we allocate 2 time slots 12pm-1pm and 1pm-2pm and let us say that none of the people who took the survey were willing to respond to this specific question.

So let us randomize a 12pm - 1 pm(1) and 1pm - 2pm (0) for the 150 women:

In this case let us take an average of the 150 entries we have 44% of the women take their meal between 12 pm - 1 pm and the remaining 56% take their meal after 1pm. So if I were to set up a Sandwich shop, I might consider going after the 1pm slot to open the shop..... (Since this is more of random application of metrics, you could cross reverse the data results and then try your experiment).

1. Let us consider a population of 500 people.

2. Divide this sample space into clusters: For example->

A] number of men (140)

B] number of women (150)

C] number of children (210)

3. Now each cluster is a sample in its own right and we will not combine them together but delineate this after step 2.

4. Now in a random population sample, we choose a subset of men, women and children from our respective clusters and apply the necessary parameters we need to gauge, a couple of examples here would be:

A] What time does one sleep?

B] What time does one have lunch?

C] How many times in a month does one shop? etc..

5. Let us take only [A] for our current problem.

--> I choose 20 from each subset as a sample

Let us consider Men-

Space Definition Number of Men who sleep Average

--------------------------------------------------------------------------------------------------------------------------

1pm - 5pm 4 4/20 =0.20

6pm - 10pm 12 12/20 =3/5 =0.60

post 10 pm 4 4/20 =0.20

--------------------------------------------------------------------------------------------------------------------------

20 1

From this we can apply the same to the overall Men's cluster and gather the required metrics.

Now clustering sampling can actually bombard your analysis if your data points are redundant and is well not worth the time and effort. But as I mentioned earlier a Data Implosion Mechanic is someone who will not deduce these aspects of the data volume but rather force the metrics that need to be infringed on the data. I could be wrong here but this is my take on it... Let's move onto Monte Carlo simulations a bit....

Let us consider [B] and we allocate 2 time slots 12pm-1pm and 1pm-2pm and let us say that none of the people who took the survey were willing to respond to this specific question.

So let us randomize a 12pm - 1 pm(1) and 1pm - 2pm (0) for the 150 women:

In this case let us take an average of the 150 entries we have 44% of the women take their meal between 12 pm - 1 pm and the remaining 56% take their meal after 1pm. So if I were to set up a Sandwich shop, I might consider going after the 1pm slot to open the shop..... (Since this is more of random application of metrics, you could cross reverse the data results and then try your experiment).