How to Create Amazon EMR and Install Dependencies Through Bootstrapping

A guide to create Amazon EMR through web console and Infrastructure as a Code (Terraform) and install external dependencies through bootstrapping.

What is Amazon EMR

Amazon EMR stands for Amazon Elastic Map Reduce. Amazon EMR is an Amazon service for big data processing and analysis. Amazon EMR is a managed service to create cluster platforms with minimal configuration to run big data platforms like Apache Spark, Apache Hadoop, Apache Hive for processing and analyzing huge volume of data. This can be used for performing data and analytics for big data frameworks. Amazon EMR is also very useful and efficient in transforming and moving huge volume of data into and out of Amazon data storage services like S3 (Amazon Simple Storage Service).

What is Bootstrapping

By definition from Wikipedia, bootstrapping is a self-starting process that is supposed to continue or grow without external input. In computing, a bootstrap program is the first code that runs when a computer system starts. Bootstrapping helps in loading the operating system. 


Why do We Need Bootstrapping in Amazon EMR

When we create Amazon EMR, external dependency modules are not installed by default. In some applications wherein we run scripts that use external libraries like pyspark, pandas, matplotlib if the external packages are not installed we get an error like: 
ImportError: No module named pyspark

To resolve these errors and run the application on Amazon EMR without any issues, we can use bootstrapping to install external dependencies on the cluster instances. Bootstrap scripts run on cluster once Amazon EMR instance is launched. Amazon EMR installs the application once bootstrap action is completed. When we start processing the data with the application, the machine can find the dependent modules and process the data as per the application script. Bootstrap action also runs on the additional new nodes, when we are adding nodes in a running cluster. We can create custom scripts to specify what bootstrap action is required and specify it while creating the EMR cluster.

Create Custom Bootstrap scripts

Create a bootstrap script to include all external dependencies which will be installed while creating Amazon EMR cluster.

Let us take an example application in which we will be using PySpark, pandas, boto3 and requests packages. Here we will be using conda to install the packages on the EMR cluster.

Create bootstrap_script.sh with the following code:

#!/usr/bin/env bash

# install conda
wget --quiet https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh \
    && /bin/bash ~/miniconda.sh -b -p $HOME/conda

echo -e '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc

# install packages
conda install -y notebook=5.7.* jupyter=1.0.* pandas

#install findspark
pip install --upgrade pip
pip install findspark
pip install boto3
pip install requests

Now, we will create another script to bind conda to spark. Create pyspark_config.sh with the following code:

#!/usr/bin/env bash

# bind conda to spark
echo -e "\nexport PYSPARK_PYTHON=/home/hadoop/conda/bin/python" >> /etc/spark/conf/spark-env.sh
echo "export PYSPARK_DRIVER_PYTHON=/home/hadoop/conda/bin/jupyter" >> /etc/spark/conf/spark-env.sh
echo "export PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port 8888 --ip=\"0.0.0.0\"'" >> /etc/spark/conf/spark-env.sh

Create a s3 bucket or use an existing bucket to store the two shell scripts which we created. Create a folder bootstrap in s3 bucket and upload the two shell scripts.

How to Create an EMR Cluster through Web Console

To create Amazon EMR cluster with bootstrapping follow below steps:

Step 1: From the AWS console, search EMR, click on EMR and go to EMR console.  

Step 2: Choose create cluster to create a new EMR cluster.

Step 3: Click on Go to advanced options

Step 4: Select EMR Release Label and applications as per the requirement. 

Step 5: Select Step Type “Custom Jar” in Steps and add step. Provide name of the step. In JAR location, add command-runner.jar. command-runner.jar is located on the AMI. These scripts are placed on the shell login path environment so full path to command-runner.jar is not required. Add argument to copy pyspark_config.sh script from s3 to home.

Step 6: Once pyspark_config.sh is copied, pyspark is configured by running the shell script. Add another step to run pyspark_config.sh script.

Step 7: Select vpc, subnet, master node and core node configuration.

Step 8: Enable logging and provide S3 bucket path to store the logs.

Step 9: Add bootstrap action. Select Custom Action and configure. Add name of bootstrap action and provide s3 URI of bootstrap_script.sh.

Step 10: Add master and core security group and create the cluster.

Amazon EMR cluster is ready within minutes with all the external dependencies installed on all node instances.

How to Create an EMR Cluster through Infrastructure as Code

For best practices, it is recommended to use Infrastructure as Code. In this example, we will use Terraform as Infrastructure as Code to create Amazon EMR and install external dependencies through bootstrapping . Follow below steps to create EMR and install external dependencies through Terraform.

Step 1: Create a directory bootstrap and add the two shell scripts – bootstrap_script.sh and pyspark_config.sh.

This will be the folder structure.


Step 2: Create version.tf file to define terraform and AWS version to be used.

terraform {
required_version = ">= 0.12"

required_providers {
aws = {
source = "hashicorp/aws"
version = ">= 3.15"
}
}
}

Step 3: Create connection.tf file to define provider and aws region to be used.

provider "aws" {
profile = "default"
region = var.aws_region
}

Step 4: Create variables.tf to define all variables and terraform.tfvars to define the value of variables to be used for deployment.

Step 5: Create main.tf to define resources to be deployed.

# s3 object for bootstrap script
resource "aws_s3_bucket_object" "bootstrap_action" {
  bucket     = var.s3_bucket_name
  key        = "bootstrap/bootstrap_script.sh"  
  source     = "bootstrap/bootstrap_script.sh"
}

# s3 object for pyspark configuration script
resource "aws_s3_bucket_object" "pyspark_config" { 
  bucket     = var.s3_bucket_name  
  key        = "bootstrap/pyspark_config.sh"  
  source     = "bootstrap/pyspark_config.sh"
}

# create EMR cluster with bootstrap
resource "aws_emr_cluster" "emr-spark-cluster" {
  name                              = var.name
  release_label                     = var.release_label
  applications                      = var.applications
  termination_protection            = false
  keep_job_flow_alive_when_no_steps = true

  ec2_attributes {
    subnet_id                         = var.subnet_id
    emr_managed_master_security_group = var.emr_master_sg
    emr_managed_slave_security_group  = var.emr_slave_sg
    instance_profile                  =    var.emr_ec2_instance_profile_arn
  }

  ebs_root_volume_size = "12"

  master_instance_group {
    name           = "EMR master"
    instance_type  = var.master_instance_type
    instance_count = "1"

    ebs_config {
      size                 = var.master_ebs_size
      type                 = var.master_ebs_type
      volumes_per_instance = var.master_volumes_per_instance
    }
  }

  core_instance_group {
    name           = "EMR core"
    instance_type  = var.core_instance_type
    instance_count = var.core_instance_count

    ebs_config {
      size                 = var.core_ebs_size
      type                 = var.core_instance_ebs_type
      volumes_per_instance = var.core_instance_volumes_per_instance
    }
  }

  service_role     = var.emr_service_role_arn
  autoscaling_role = var.emr_autoscaling_role_arn

  log_uri = var.s3_bucket_logs
  tags = {name = var.name}
  bootstrap_action {
    name = "Bootstrap"
    path = "s3://${var.s3_bucket_name}/bootstrap/bootstrap_script.sh"
  }

  step {
      name              = "Copy pyspark script"
      action_on_failure = "CONTINUE"
      hadoop_jar_step {
        jar  = "command-runner.jar"
        args = ["aws", "s3", "cp", "s3://${var.s3_bucket_name}/bootstrap/pyspark_config.sh", "/home/hadoop/"]
      }
    }

  step {
      name              = "Configure pyspark"
      action_on_failure = "CONTINUE"
      hadoop_jar_step {
        jar  = "command-runner.jar"
        args = ["sudo", "bash", "/home/hadoop/pyspark_config.sh"]
      }
    }

  configurations_json = <<EOF
    [
    {
    "Classification": "spark-defaults",
      "Properties": {
      "maximizeResourceAllocation": "true",
      "spark.dynamicAllocation.enabled": "true"
      }
    }
  ]
  EOF
}

Step 6: Run Terraform Init to initialize.

Step 7: Run Terraform Plan to check the resources to be added, updated or deleted.

Step 8: Run Terraform Apply to create Amazon EMR cluster.

You can use any of the above approach as per your preference. Feel free to reach out, if any questions or feedback.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: