Why Azure might overtake AWS in its data services offerings
- Compliance
Microsoft enterprise
cloud services are independently validated through certifications and
attestations, as well as third-party audits. In-scope services within the
Microsoft Cloud meet key international and industry-specific compliance
standards, such as ISO/IEC 27001 and ISO/IEC 27018, FedRAMP, and SOC 1 and SOC
2. They also meet regional and country-specific standards and contractual
commitments, including the EU Model Clauses, UK G-Cloud, Singapore MTCS, and
Australia CCSL (IRAP). In addition, rigorous third-party audits, such as by the
British Standards Institution and Deloitte, validate the adherence of our cloud
services to the strict requirements these standards mandate.
- Security
With Security
tightly ingrained with its AD offerings, Microsoft currently continues to
evolve its security and data integrity
in Azure. Core advantages of Security in Azure are as follows:
- Tightly integrated with Windows Active Directory
- Simplified cloud access using Single Sign On
- Single and Multi Factor authentication support
- Rich protocols Eg: Federated Authentication (WS-FEDERATION), SAML, OAuth 2.0 (Version 1.0 still supported), Open ID Connect, Graph API, Web API 2.0 (in conjunction with Authentication_JSON_AppService.axd and Authorize attribute)
- Data/BI Offerings in the Cloud
Microsoft Azure
relatively has all the required components to support data related and business
intelligence in various formats.
The core database
and data collection/integration formats in Azure are as follows:
- Data Factory
Data
Factory is a cloud-based data integration service that orchestrates and
automates the movement and transformation of data. You can create data integration solutions
using the Data Factory service that can ingest data from various data stores,
transform/process the data, and publish the result data to the data stores.+
Data Factory service
allows you to create data pipelines that move and transform data, and then run
the pipelines on a specified schedule (hourly, daily, weekly, etc.). It also
provides rich visualizations to display the lineage and dependencies between your
data pipelines, and monitor all your data pipelines from a single unified view
to easily pinpoint issues and setup monitoring alerts.
- SQL Server Integration Services (SSIS)
Leveraging SSIS
which is the core tool leveraged for discontinued development across various
teams, one can move data into and out of Azure to on premise or other cloud
environments based on one's needs. SSIS can integrate databases and data
warehouses in the Azure cloud and also enable individuals to drive templated
based development efforts with ease.
Typically compared with AWS data pipeline:
AWS Data Pipeline is
a web service that helps you reliably process and move data between different
AWS compute and storage services, as well as on-premise data sources, at
specified intervals. With AWS Data Pipeline, you can regularly access your data
where it’s stored, transform and process it at scale, and efficiently transfer
the results to AWS services such as Amazon S3, Amazon RDS, Amazon DynamoDB, and
Amazon EMR.
A pipeline schedules
and runs tasks. You upload your pipeline definition to the pipeline, and then
activate the pipeline. You can edit the pipeline definition for a running
pipeline and activate the pipeline again for it to take effect. You can
deactivate the pipeline, modify a data source, and then activate the pipeline
again. When you are finished with your pipeline, you can delete it. Task Runner polls
for tasks and then performs those tasks. For example, Task Runner could copy
log files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed
and runs automatically on resources created by your pipeline definitions. You can
write a custom task runner application, or you can use the Task Runner
application that is provided by AWS Data Pipeline.
- Azure SQL Data Warehouse
The SQL data
warehouse for Azure maintains storage for data in the realm of files and blobs.
This allows the easy conjunction between dimensional data and measures within
the platform. Core documentation can be found here: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-what-is
Typically Compared with AWS Redshift:
Redshift supports
two kinds of sort keys: compound and interleaved. A compound sort key a
combination of multiple columns, one primary column and one or more secondary
columns. A compound sort key helps with joins and where conditions; however,
the performance drops when the query is only on secondary columns without the
primary column. A compound sort key is the default sort type. In interleaved an
sort, each column is given an equal weight. Both compound and interleaved
require a re-index to keep the query performance level high. The architecture
of these systems differ but the end goal is the storage and processing of vast
amounts of data down to second or milli second based result generation.
Focusing on this aspect I am going to give a more detailed insight on Redshift
which is a node based peta byte scaled database as well as a high level
overview of what I recently implemented.
Note:
The above diagram is from the Redshift Warehousing article (http://docs.aws.amazon.com/redshift/latest/dg/c_high_level_system_architecture.html)
If you pay close
attention to the diagram above the compute nodes is responsible for all the
data processing and query transactions on the database and the data nodes
contain the actual node slices. The Leader Node is the service that monitors
the data connections against the Redshift cluster and is also responsible for
the query processing (almost like a syntax checker on the query and functions
leveraged). It then transfers the query across to the Compute Nodes whose main
responsibility is to find the data slices from the data nodes and communicate
with one another to determine the way the transaction needs to be executed. It
is similar to a job broker except that this is more real time than non real
time.
It is similar to the
analogy of using a bucket..... Consider this:
You take a bucket
and keep filling water into it, eventually the bucket get filled..... however
what happens when there is an enormous amount of water that needs to be
contained. Either grab a massive bucket or use multiple buckets to store the
water (so the second option actually depicts the Redshift architecture....)
The concept
scaling up implies not only adding a bucket for storage but also a
mechanism to ensure that the pipeline flow goes to an empty bucket which is
nothing but out compute node. But there is the price of the bucket and the cost
of the mechanism that is required to populate the bucket as well....
The cons of using Redshift are as follows:
- You need an Administrator to monitor the Redshift cluster at all times
- The power of Redshift lies in the manner the database is queried, so a SQL developer/DBA with understanding of the Redshift internals is definitely a must
- Upfront cost is slightly on the higher side but over a period in time the cost will be justified with more nodes being added to the cluster
Redshift
major advantage is the fact that it allows both JDBC and ODBC drivers so definitely if you want to query the
database. The Postgres driver can be found as follows:
- Azure Redis Cache
Azure Redis Cache is
based on the popular open-source Redis cache. It gives you access to a secure,
dedicated Redis cache, managed by Microsoft and accessible from any application
within Azure.
Azure Redis Cache is
available in the following tiers:
Basic—Single node,
multiple sizes, ideal for development/test and non-critical workloads. The
basic tier has no SLA.
Standard—A
replicated cache in a two node Primary/Secondary configuration managed by
Microsoft, with a high availability SLA.
Premium—The new
Premium tier includes a high availability SLA and all the Standard-tier
features and more, such as better performance over Basic or Standard-tier
Caches, bigger workloads, disaster recovery, and enhanced security. Additional
features include:
Redis persistence
allows you to persist data stored in Redis cache. You can also take snapshots
and back up the data which you can load in case of a failure.
Redis cluster
automatically shards data across multiple Redis nodes, so you can create
workloads of bigger memory sizes (greater than 53 GB) and get better
performance.
Azure Virtual
Network (VNET) deployment provides enhanced security and isolation for your
Azure Redis Cache, as well as subnets, access control policies, and other
features to further restrict access.
Basic and Standard
caches are availabe in sizes up to 53 GB, and Premium caches are available in
sizes up to 530 GB with more on request.
Azure Redis Cache
helps your application become more responsive even as user load increases. It
leverages the low-latency, high-throughput capabilities of the Redis engine.
This separate, distributed cache layer allows your data tier to scale
independently for more efficient use of compute resources in your application
layer.
Unlike traditional
caches which deal only with key-value pairs, Redis is popular for its highly
performant data types. Redis also supports running atomic operations on these
types, like appending to a string; incrementing the value in a hash; pushing to
a list; computing set intersection, union and difference; or getting the member
with highest ranking in a sorted set. Other features include support for
transactions, pub/sub, Lua scripting, keys with a limited time-to-live, and
configuration settings to make Redis behave more like a traditional cache.
Another key aspect
to Redis success is the healthy, vibrant open source ecosystem built around it.
This is reflected in the diverse set of Redis clients available across multiple
languages. This allows it to be used by nearly any workload you would build
inside of Azure.
Advantages -
Elasticache Redis:
Simple
query language - no complex features
Is
(out-of-the-box) not reachable from other regions.
You're
always limited to the maount of memory
Sharding
over multiple instances is only possible within your application - Redis
doesn't do anything here
You
pay per instance no matter how the load or the number of requests are.
If
you want redundancy of the data you need to setup replication (not possible
between different regions)
You
need to work yourself for high availability
*Redis/memcached are
in-memory stores and should generally be faster than DynamoDB
*Exposed as a API
Typically compared with AWS DynamoDB
In DynamoDB, data is
partitioned automatically by its hash key. That’s why you will need to choose a
hash key if you’re implementing a GSI. The partitioning logic depends upon two
things: table size and throughput.
DynamoDB supports
following data types:
Scalar –
Number, String, Binary, Boolean, and Null.
Multi-valued –
String Set, Number Set, and Binary Set.
Document –
List and Map.
Has
a query language which is able to do more complex things (greater than, between
etc.)
Is
reachable via an external internet-facing API (different regions are reachable
without any changes or own infrastructure)
Permissions
based on tables or even rows are possible
Scales
in terms of data size to infinity
You
pay per request -> low request numbers means smaller bill, high request
numbers means higher bill
Reads
and Writes are different in costs
Data
is saved redundant by AWS in multiple facilities
DynamoDB
is highly available out-of-the-box
- Azure Document DB
Planet-scale,
highly-available NoSQL database service on Microsoft Azure. Azure DocumentDB is
Microsoft’s multi-tenant distributed database service for managing JSON
documents at Internet scale. DocumentDB is now generally available to Azure
developers. In this paper, we describe the DocumentDB indexing subsystem.
DocumentDB indexing enables automatic indexing of documents without requiring a
schema or secondary indices. Uniquely, DocumentDB provides real-time consistent
queries in the face of very high rates of document updates. As a multi-tenant
service, DocumentDB is designed to operate within extremely frugal resource
budgets while providing predictable performance and robust resource isolation
to its tenants. This paper describes the DocumentDB capabilities, including
document representation, query language, document indexing approach, core index
support, and early production experiences
Azure
DocumentDB guarantees less than 10 ms latencies on reads and less than 15 ms
latencies on writes for at least 99% of requests. DocumentDB leverages a
write-optimized, latch-free database engine designed for high-performance
solid-state drives to run in the cloud at global scale. Read and write requests
are always served from your local region while data can be distributed to other
regions across the globe. Scale
data throughput
and storage independently and elastically—not just within one region, but
across many geographically distributed regions. Add capacity to serve millions
of requests per second at a fraction of the cost of other popular NoSQL
databases.
Easily
build apps at planet scale without the hassle of complex, multiple data center
configurations. Designed as a globally
distributed database system, DocumentDB automatically replicates all of your data to any
number of regions worldwide. Your apps can serve data from a region closest to
your users for fast, uninterrupted access.
Query
using familiar SQL and JavaScript syntax over document and key-value data
without dealing with schema or secondary indices, ever. Azure DocumentDB is a
truly schema-agnostic
database capable
of automatically indexing JSON documents. Define your business logic as stored
procedures, triggers, and user-defined functions entirely in JavaScript, and
have them executed directly inside the database engine. Standardized SLA's for
infrastructure throughpout
Typically compared with AWS DynamoDB
Azure HDInsight
Easily spin up enterprise-grade, open source cluster types,
guaranteed with the industry’s best 99.9% SLA and 24/7 support. Our SLA covers
your entire Azure big data solution, not just the virtual machine instances.
HDInsight is architected for full redundancy and high availability, including
head node replication, data geo-replication, and built-in standby NameNode,
making HDInsight resilient to critical failures not addressed in standard
Hadoop implementations. Azure also offers cluster monitoring and 24/7
enterprise support backed by Microsoft and Hortonworks with 37 combined
committers for Hadoop core—more than all other managed cloud providers
combined—ready to support your deployment with the ability to fix and commit
code back to Hadoop. Use rich
productivity suites for Hadoop and Spark with your preferred development
environment such as Visual
Studio, Eclipse, and IntelliJ for Scala, Python, R, Java,
and .Net support. Data scientists gain the ability to combine code, statistical
equations, and visualizations to tell a story about their data through
integration with the two most popular notebooks, Jupyter and Zeppelin. HDInsight is also the only
managed cloud Hadoop solution with integration to Microsoft R Server.
Multi-threaded math libraries and transparent parallelization in R Server
handle up to 1000x more data and up to 50x faster speeds than open source R,
helping you train more accurate models for better predictions than previously
possible.
A
thriving market of independent software vendors (ISVs) provide value-added
solutions across the broader ecosystem of Hadoop. Because every cluster is
extended with edge nodes and script action, HDInsight lets you spin up Hadoop
and Spark clusters that are pre-integrated and pre-tuned with any ISV
application out-of-the-box, including Datameer, Cask, AtScale, and StreamSets.)
Integration
with the Azure Load Balancer:
Azure
Load Balancer delivers high availability and network performance to your
applications. It is a Layer 4 (TCP, UDP) load balancer that distributes
incoming traffic among healthy instances of services defined in a load-balanced
set.
Azure
Load Balancer can be configured to:
- Load balance incoming Internet traffic to virtual machines. This configuration is known as Internet-facing load balancing.
- Load balance traffic between virtual machines in a virtual network, between virtual machines in cloud services, or between on-premises computers and virtual machines in a cross-premises virtual network. This configuration is known as internal load balancing.
- Forward external traffic to a specific virtual machine.
All
resources in the cloud need a public IP address to be reachable from the
Internet. The cloud infrastructure in Azure uses non-routable IP addresses for
its resources. Azure uses network address translation (NAT) with public IP
addresses to communicate to the Internet.
*not
similar to the elastic emr load balancer
Shows
live changes in analytics.
Azure
in itself is very user-friendly, HDInsight is a great addition.
Works
in tandem with other Microsoft technologies like Power BI and Excel seamlessly
as end clients
When
loading large volumes of data, issues in data might stop the load mid way due
to either data corruption errors issues in data might stop the load mid way due
to either data corruption errors or some latency issue. The entire load needs
to be tranisitioned from start.
- Azure HDInsight is a service that provisions Apache Hadoop in the Azure cloud, providing a software framework designed to manage, analyze and report on big data.
- HDInsight clusters are configured to store data directly in Azure Blob storage, which provides low latency and increased elasticity in performance and cost choices.
- Unlike the first edition of HDInsight , now it is delivered on Linux – as Hadoop should be, which means access to to HDP features. The cluster can be accessed via Ambari in the web browser, or directly via SSH.
- HDInsight has always been an elastic platform for data processing. In today’s platform, it’s even more scalable. Not only can nodes be added and removed from a running cluster, but individual node size can be controlled which means the cluster can be highly optimized to run the specific jobs that are scheduled.
- In its initial form, there were many options for developing HDInsight processing jobs. Today, however, there are really great options available that enable developers to build data processing applications in whatever environment they prefer. For Windows developers, HDInsight has a rich plugin for Visual Studio that supports the creation of Hive, Pig, and Storm applications. For Linux or Windows developers, HDInsight has plugins for both IntelliJ IDEA and Eclipse, two very popular open-source Java IDE platforms. HDInsight also supports PowerShell, Bash, and Windows command inputs to allow for scripting of job workflows.
Typically Compared with AWS EMR
Amazon
EMR provides a managed Hadoop framework that makes it easy, fast, and
cost-effective to process vast amounts of data across dynamically scalable
Amazon EC2 instances. You can also run other popular distributed frameworks
such as Apache Spark, HBase, Presto, and Flink in
Amazon EMR, and interact with data in other AWS data stores such as Amazon S3
and Amazon DynamoDB.
Amazon
EMR securely and reliably handles a broad set of big data use cases, including
log analysis, web indexing, data transformations (ETL), machine learning,
financial analysis, scientific simulation, and bioinformatics.
AWS
EMR is more mature when compared to HDInsight however HDInsight development
continues to progress at a rapid pace compared to the phased approach for AWS
EMR.
AWS
EMR pricing is also higher than Azure HDInsight in terms of storage usage.
EMR
release velocity is better than Azure HDInsight.
- Amazon EMR provides a managed Hadoop framework that simplifies big data processing.
- Other popular distributed frameworks such as Apache Spark and Presto can also be run in Amazon EMR.
- Pricing of Amazon EMR is simple and predictable: Payment can be done on hourly rate. A 10-node Hadoop can be launched for as little as $0.15 per hour. Because Amazon EMR has native support for Amazon EC2 Spot and Reserved Instances, 50-80% can also be saved on the cost of the underlying instances.
- It also is in vogue due to its easy usage capability. When a cluster is launched on Amazon EMR the web service allocates the virtual server instances and configures them with the needed software for you. Within minutes you can have a cluster configured and ready to run your Hadoop application.
- It is resizable, the number of virtual clusters depending on the processing needs can be easily contracted or expanded.
- Amazon EMR integrates with popular business intelligence (BI) tools such as Tableau, MicroStrategy, and Datameer. For more information, see Use Business Intelligence Tools with Amazon EMR.
- You can run Amazon EMR in a Amazon VPC in which you configure networking and security rules. Amazon EMR also supports IAM users and roles which you can use to control access to your cluster and permissions that restrict what others can do on the cluster. For more information, see Configure Access to the Cluster.
Comments