Amazon EMR is a service that allows cost-effective and fast processing of large amounts of data. It uses the Hadoop (open-source data processing software) framework, based on Amazon EC2 and Amazon S3.
It provides the ability to efficiently process large amounts of data in processes such as:
- data mining
- machine learning
- financial analysis
Amazon EMR saves us time-consuming configuration, commissioning and management of Hadoop clusters and the computing power that we need. Thanks to this, we can freely build workflows and monitor the progress of big data analysis.
The main unit when using the service is a cluster, which consists of nodes that can perform different functions, i.e. they can be of different types. Amazon EMR, on each type of instance (node), installs other software components, thereby assigning a specific role to the framework (Hadoop Apache). There are 3 types of nodes (nodes):
- Master node – responsible for the distribution of data between all nodes. Also, it monitors the progress of the analysis and checks the condition of the entire cluster.
- Core node – contains software components that launch task nodes.
- Task node – contains software components that perform the task and do not store data. This type of node is optional.
After building the cluster, we can proceed to commission it to work. The next step is data analysis.
Being aware of the benefits of using Amazon EMR, let’s get to the security issue. The necessity of high data protection is undeniable. Especially those with sensitive status. The service uses such safeguards as:
- Amazon VPC,
- Security Groups,
- AWS CloudTrail,
- Amazon EC2 Key Pairs,
In addition, the service is fully integrated with AWS CloudWatch, which monitors the flow of traffic and activities in the cluster. To control changes in the cluster, we can also use such services as AWS CLI, SDK, API or the AWS console itself. An additional advantage is the ability to reuse a configuration that has already been created while building new clusters.
How much does the solution cost?
Cost estimation is extremely simple.
The service applies:
- billing per second, which must last a minimum of 60 seconds. So a 10 node cluster operating for 10 hours will cost the same as a 100 node cluster for 1 hour.
- hourly billing, which depends on such factors as – the type of instance or CPU. Hourly billing is calculated to the nearest second and shows the time in decimal form.
You can check the availability of the service in individual AWS Regions here.
In addition to the obvious advantage of using Amazon EMR, which is optimization and cost reduction during data analysis, there are several other reasons for its implementation.
- Integrity with other AWS services allows them to be combined quickly and easily, which in turn translates into faster deployment.
- It is highly available and scalable, which is critical.
- It is secure, thanks to the previously mentioned integrity with AWS services and those responsible for security, thus ensuring a high level of protection for your data.
We also recommend video from re:Invent