Modern Day Data Warehousing
With ever growing data volumes and the need to necessitate a faster approach into making rapid decisions on your data sets, numerous technologies have emerged in the past 5 years trying to blend the transactional pipelines, the data storage and the analytics together. The modern day data warehouse consists of dumping the data into more of a containerized approach catering away from the traditional star and snowflake schemas of the past. Not that these data warehouse schemas have been extinguished but more in tune with rapid assembly and disassembly of data in containers. The core aspect is to whether to de-containerize the data as soon as it comes into our storage layer or go about doing it in an adhoc fashion with your containers being pipelined further into data virtualization stores, data marts or even just reporting on top of the data brought into the data lake, a virtualized data warehouse, an in memory data mart etc...
Almost all modern day architectures have started moving down this path including Netflix, Facebook that has included its own data storage format on disk from Hive, Microsoft with the CosmosDB predominantly being used by the Bing search team and several others. Real time analytics with modern day queuing systems like Kafka, Rabbit MQ, Kinesis, ZeroMQ etc is taking the analytics world by STORM.... On the virtualized side you have Denodo, Delphix, Dremio etc. which are being used by companies like Virgin Orbit, Chrysler, Biogen, AAA are also making a significant impact in the market nowadays. The general rule of thumb is to ensure that the data virtualization is on top of the Data Warehouse or Data Lake with the key aspect being that this store will be the vault for all the analytics within an organization. Also in virtualized stores, you typically maintain strict copies of the source so as to maintain a total control of the data and improve the data governance of the overall architecture. A data warehouse and a data virtualization layer play different roles that go hand in hand in a BI eco-system. According to wikipedia's definition of a Data Lake -" A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualization, analytics and machine learning".
A couple of really good reads on modern day data warehousing are as follows:
https://cloud.google.com/solutions/build-a-data-lake-on-gcp
https://medium.com/netflix-techblog/building-and-scaling-data-lineage-at-netflix-to-improve-data-infrastructure-reliability-and-1a52526a7977
https://keen.io/blog/architecture-of-giants-data-stacks-at-facebook-netflix-airbnb-and-pinterest/
https://engineering.fb.com/core-data/scaling-the-facebook-data-warehouse-to-300-pb/
https://customers.microsoft.com/en-gb/story/bing-ads-partner-professional-services-azure-sql-data-warehouse
https://cloudblogs.microsoft.com/sqlserver/2016/07/12/the-elastic-future-of-data-warehousing/
Note: I am not blogging as much as I want to but will keep posting stuff I find interesting once in a while....
Comments