Redshift Experience
Big Data - the keyword given to solutions that can handle massive amount of data usually in the petabyte or greater amount. There are several big data solutions out there and all of them have their unique characteristics which can be useful in different scenarios. I was looking into Cloudera's versions of Hadoop like Impala, Sentry and HBase. All these vary based on the use case. For some of my clients I have leveraged Amazon Redshift, Cassandra (and hopefully soon Apache Hadoop). The architecture of these systems differ but the end goal is the storage and processing of vast amounts of data down to second or milli second based result generation. Focusing on this aspect I am going to give a more detailed insight on Redshift which is a node based peta byte scaled database as well as a high level overview of what I recently implemented.
Note: The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)
If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibilty is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.
It is similar to the analogy of using a bucket..... Consider this:
You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)
The concept scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....
The costing of data nodes and compute nodes are as follows:
Note: The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)
If you pay close attention to the diagram above the compute nodes is responsible for all the data processing and query transactions on the database and the data nodes contain the actual node slices. The Leader Node is the service that monitors the data connections against the Redshift cluster and is also responsible for the query processing (almost like a syntax checker on the query and functions leveraged). It then transfers the query across to the Compute Nodes whose main responsibilty is to find the data slices from the data nodes and communicate with one another to determine the way the transaction needs to be executed. It is similar to a job broker except that this is more real time than non real time.
It is similar to the analogy of using a bucket..... Consider this:
You take a bucket and keep filling water into it, eventually the bucket get filled..... however what happens when there is an enormous amount of water that needs to be contained. Either grab a massive bucket or use multiple buckets to store the water (so the second option actually depicts the Redshift architecture....)
The concept scaling up implies not only adding a bucket for storage but also a mechanism to ensure that the pipeline flow goes to an empty bucket which is nothing but out compute node. But there is the price of the bucket and the cost of the mechanism that is required to populate the bucket as well....
The costing of data nodes and compute nodes are as follows:
As the data node volume keeps increasing, reserve a data node and then utilize it as required. While creating the Redshift cluster ensure that the cluster type is Multi - Node and not Single-Node for large data related projects.
The tables are stored with a columnar partitioning algorithm on the different nodes governed by the distribution key of the respective table. The database used within Redshift is Postgres (the version keeps getting updated based on the release), this will be handled by Amazon. On a quick note whenever you are trying to add a node to the cluster, the database will go down so ensure that all your ETL's or import/export jobs are stopped during this time frame.
Even though daily snapshots are taken inside the Redshift cluster, a key note here is if one is ever going to delete the cluster and recreate it from a previous snapshot, ensure that a new cluster is created from the snapshot which is required and delete the first redshift cluster.
Even though Redshift has a lot of data integration tools partnerships, I would definitely recommend leveraging Talend ETL. I might give a breakdown of an Talend ETL in a upcoming article.
For reporting on the other hand there are several analytics tools like Pentaho, Microstrategy, Tableau, Birst etc..., I would recommend using Tableau/Pentaho/Microstrategy for analytical dashboarding needs based on the cost consideration.
Let us get back to Redshift, the storage mechanism is similar to the Tabular vertipaq engine model that Microsoft leverages. And the processing is done using MPP i.e. Massive parallel Processing engines. The encryption is two fold, the first fold is at the database level where the Data Encryption Key encrypt data blocks in the cluster. Each data block is assigned a randomly-generated AES-256 key. These keys are encrypted by using the database key for the cluster. The other fold is at the data level or commonly known as encoding. I am just giving a flat table of what encoding needs to be used for what data type:-
int,date,datetime, timestamps | delta |
varchar or char upto 50 characters | runlength |
varchar > 50 characters | text255 |
smallint,decimal | Mostly8 |
Others | Raw |
The cons of using Redshift are as follows:
- You need an Administrator to monitor the Redshift cluster at all times
- The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
- Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
Redshift allows both JDBC and ODBC drivers so definitely if you want to query the database. The Postgres driver can be found as follows:
Like any other traditional RDBMS, if there is a constant amount of queries running on the same table, the locking of the table is inevitable, so care must be taken to ensure that the tables are not locked using the following statement:
-->select * from STL_TR_CONFLICT
Also ensuring that the statistics are built on a regular basis....run the following command:
-->Analyze Table_Name
for all the tables on which queries are performed on a periodic basis.
Also enable auditing on the cluster using the Amazon Redshift console.
A good practice is to ensure that the datawarehouse schema conforms to a Star schema to improve the query performance.
This is it for now but stay tuned to more updates in this regard!!!
Comments