How to Setup an AWS Instance, Docker, and Jupyter (The solution to "it only works on my computer" problem)

How To Get a BASH shell in Windows, Set Up an AWS EC2 Instance to Run in that Shell, Install Docker in that Instance, and Run Jupyter Notebook in Docker

What is Jupyter and Why Do You Want to Use Jupyter?

Jupyter is a really convenient notebook that you can share with others, particularly people working on something with you. It is a standard in industry and academia. Jupyter can run forty programming languages.

Why run Jupyter on Docker?

We want to run Jupyter on Docker because (1) Docker will automatically install updates for Jupyter and (2) in case we work with data sets that our laptops can’t handle, we can run it on an Amazon Web Services E2 machine. Installing Docker on Windows is cumbersome, so Windows users install Git Bash (Born Again Shell). Ultimately, we will run Jupyter in our Docker shell which will run in our AWS shell which will run in Git Bash; Git Bash > AWS > Docker

Steps

Install Git Bash (if on Windows)

If on Windows, install Git Bash. Go to git-scm.com. The rest is self-explanatory

Go to aws.amazon.com and create an account if you don’t have one already

Configure a Key Pair

In your bash shell, run

ssh-keygen -t rsa.

When it asks you to enter a file in which you want to save the key and a passphrase, just provide nothing ie just press Enter.Unless you really need to make your data really secure or something. Security wasn’t a concern in class. Verify your newly created ssh-key by running

cat ~/.ssh/id_rsa.pub

This should output something that looks like the following. (I don’t use this key so it’s not possible to hack any of my stuff with the ouput below.)

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDIN8mZglB4XWCtv/VyErkvCa/RrOgRd6pVomXyisWrsNLeSbZdpGxMu6EseY/u4CIPRb1caz3PgU0p5vg7qJ+65Jp0OxtSYy7xu/CPyMcDkUEsvHRILKg0aPNzTj2vNP3vD7ceXZAAvaPhAJ3Cl66lgTTNyw6aELF9J1eJvqwSBlxY9Csva+QmFui5SY7jn+ft7w5i8Dkfm/6Wrl92BIkRZUJX9Vks/HmQvmGKIA3NY1jdcFLUgrBVe7wzBbTif+8S3+Xte//UDA7SmI3+JHzX1JnPxWxaHABQWBhFikfryJDx8IHK6iMaZYxbiDh2nw2pY+JkoWZNTGlPN+F2fYft

Go to AWS > EC2 > Key Pairs > Import Key Pair. Set name to whatever you’d like. For this class, you set the name to jan_2018_unex_213. Copy (i) and paste it into the Public Key Contents input box. Press button Import. When you try to connect to AWS in Git Bash on your computer, AWS will look for the private key in your computer that matches the public key.

Create a New Security Group

AWS > EC2 > Security Group > Create Security Group > Add Rule. Make sure that the Inbound tab is selected because you are created Inbound rules. Each rule has four fields: Type, Protocol, Port Range, and Source. In class, I setup security group name to ucla_data_sci and description as ssh jupyter docker mongo.

type	protocol	port	Source	comments (not field in AWS)
custom	do not touch	8888	anywhere	Jupyter
custom		2376	anywhere	Docker hub
custom		27016	anywhere	mongo (not db)
SSH		22	anywhere	SSH
HTTP		80	anywhere	did not create in class originally but UCLA would not let us connect to port 8888 for some reason

Create a new EC2 Instance, Configure Docker, and Pull the jupyter/scipy-notebook Image using AWS CLI and AWS Cloudformation

AWS CLI and Cloudformation is much less manual, but has a startup cost. There is a post on how to setup AWS CLI.

Save Template

Save the template. The following is in yaml format.

# template can be in json or yaml format; usg yaml
# because yaml allows comments
# https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-instance.html

# A stack is a collection of AWS resources that you can manage as a single unit.
# All the resources in a stack are defined by the stack's AWS CloudFormation template.
---
# template can be in json or yaml format; usg yaml
# because yaml allows comments
# https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-ec2-instance.html

# A stack is a collection of AWS resources that you can manage as a single unit
# All the resources in a stack are defined by the stack's AWS CloudFormation template.
---
AWSTemplateFormatVersion: '2010-09-09'
Description: "UCLA-specifications"
# Resources Sxn is the only required section
# Resources Sxn specifies the stack resources
# and their properties, such as an EC2 instance 
# or S3 bucket. You can refer to rsrcs
# in the Resources and Outputs sections of the template.
Resources:
    MyEC2Instance:
        Type: AWS::EC2::Instance
        Properties:
            ImageId: ami-005bdb005fb00e791 #Ubuntu 18.04
            KeyName: april-2019
            InstanceType: t2.micro
            BlockDeviceMappings:
            # Lists start w/ -
              - DeviceName: /dev/sda1
                Ebs: #BlockDevice          
                    VolumeType: gp2
                    VolumeSize: 30
            SecurityGroups:
              - !Ref MySecurityGroup
            UserData:
                Fn::Base64: !Sub |
                    #!/bin/bash
                    # bash cmds to be run the first time the instance is run
                    curl -sSL http://get.docker.com/ | sh
                    sudo docker pull jupyter/pyspark-notebook
                    # add Ubuntu usr to Docker grp so we don't need to 
                    # sudo docker cmds
                    # /usr/sbin/cmd req'd for some reason.
                    sudo /usr/sbin/usermod -aG docker ubuntu
                    sudo reboot
    MySecurityGroup:
        Type: AWS::EC2::SecurityGroup
        Properties:
            GroupName: 2019-may
            GroupDescription: secruity group, simple low-level security, port 8888 for jupyter
            SecurityGroupIngress:
              - Description: jupyter
                IpProtocol: tcp
                FromPort: 8888
                ToPort: 8888
                CidrIp: 0.0.0.0/0
              - Description: ssh
                IpProtocol: tcp
                FromPort: 22
                ToPort: 22
                CidrIp: 0.0.0.0/0
              - Description: docker
                IpProtocol: tcp
                FromPort: 2376
                ToPort: 2376
                CidrIp: 0.0.0.0/0
              - Description: mongo (not db)
                IpProtocol: tcp
                FromPort: 27016
                ToPort: 27016
                CidrIp: 0.0.0.0/0

Create Stack

Run this command in Bash:

$ aws cloudformation create-stack --template-body file://./Sa-206.yaml --stack-name Sa-206p
{
    "StackId": "arn:aws:cloudformation:us-west-2:858891845818:stack/Sa-206p/6bee8050-6930-11e9-ab2a-0aeb9ab2aebe"
}

You can see that it’s worked by going to AWS management console > Cloudformation and management console > EC2.

console cloudformation stack

console ec2

From AWS documentation:

A stack is a collection of AWS resources that you can manage as a single unit. In other words, you can create, update, or delete a collection of resources by creating, updating, or deleting stacks. All the resources in a stack are defined by the stack’s AWS CloudFormation template. A stack, for instance, can include all the resources required to run a web application, such as a web server, a database, and networking rules. If you no longer require that web application, you can simply delete the stack, and all of its related resources are deleted.

How to Manually Create EC2 Instance

AWS > EC2 Dashboard > Instances > Launch Instances > AWS will prompt you to “Choose AMI”. AMI stands for Amazon Machine Image which contains software you need to run your sandbox machine. Teacher recommended the latest stable Ubuntu Server release that was free. After selecting AMI, you’ll be prompted to “Choose Instance Type”. Select t2.micro which was sufficient for the class. After selecting instance type, you’ll be prompted to “Configure Instance”. You can do nothing and just go on to the next step and tab. The fourth step and tab is “Add Storage”. We opted for 30GB, the maximum amount that was free. The fifth tab, “Add tags”, we ignored. The sixth tab, “Configure Security Group”, was important. We selected the one we created in a previous step. Finally, AWS to take you to the “Review and Launch” page. You can verify that you selected the options that you intended to and click the Launch button.

Note About Jupyter Notebook Security

The security group that we set this instance to is open to the world. We are not that concerned about intruders. However, we do have some security in the form of tokens. You will see later that to run Jupyter, Docker will generate a security token that you will need to access your Jupyter notebook. The token is like a password.

Configure the New EC2 Instance for Using Docker

SSH into the EC2 instance you just created by running in Git Bash

ssh ubuntu@<ipv4 public address>

The ipv4 public address can be found in AWS > Instances. Copy and paste (Ctrl+Shift+Insert because Ctrl+V doesn’t mean paste in Git Bash) the IPv4 address. Git Bash will ask you if you’re sure about it. Type yes.

To install docker, run

curl -sSL http://get.docker.com/ | sh

The shell will tell you to run

sudo usermod -aG docker ubuntu.

Run that command. This adds the Ubuntu user to the Docker group. In order to force the changes to take effect, you need to reboot. Reboot by running

sudo reboot

Adding the Ubuntu user to the Docker group makes it so that sudo (short for super user do) is no longer required to issue commands to the docker client. We tested this by running

docker -v

which tells you which Docker version that you’re using.

Pull the jupyter/scipy-notebook Image

Run

docker pull jupyter/scipy-notebook

Docker pull

pulls an image or a repository from a registry. In this case, we are pulling the Docker image of jupyter/scipy-noteook from Project Jupyter’s public Docker Hub account. After pulling this image, you do not need to pull it again as it is not in your docker images cache. Anytime you run a new Jupyter container, Docker will load the container from the image in your cache. You can vaguely think of the image as a class and a container as an instance of that class. “Vaguely” because the teacher, Josh, says so in his book Docker for Data Science. I don’t know Docker enough to know how that’s only a vague analogy.

Run the jupyter/scipy-notebook Image

The command is

docker run -v /home/ubuntu:/home/jovyan -p 8888:8888 -d jupyter/datascience-notebook.

In class, we used 80:8888 in place of 8888:8888 because UCLA did not let us connect via port 8888. Port 80 is http. I just replaced the 8888 in the following link with 80. Bash will return a link to paste into your browser. It should look like this

http://localhost:8888/?token=fc8ff7effaefa09be57ba60a90b669c7f023ffe8c08d1e04 :: /home/jovyan

I replaced localhost with the ipv4 public address of my AWS instance. This is because in the browser, the local host is my computer. The host is the AWS shell.

How to Setup Domain Name for Jupyter

If we only did what is outlined in How to Set Up an AWS Instance, Docker, and Jupyter, recall that we need the IPv4 address of our instance and our Jupyter notebook’s token to access our notebook. IP addresses and tokens are not memorable. Most people have to look up their addresses and tokens. However, we do not need our IP address and token if we associate our IP address with a domain name and setup a password.

browser showing domain name in address bar and request for password

How to Setup Domain Name

1. Buy a Domain Name

I bought a domain from the web hosting company 1and1 only for its promotional offer. In February 2018, you could get a domain for a year for only $1. Just go to 1and1.com and how to buy a domain name should be self-explanatory.

2. Set Your Domain’s IP Address to Your AWS Instance’s IPv4 Address

With 1and1, I just logged in 1and1.com and editted my domain’s DNS settings.