Efficient File Transfer

Introduction

This is a series of suggestions on how to easily and quickly keeping in sync files of your file system with files on /nobackup on one of the HPC clusters. There are three main elements:

  • setting up your file systems in a smart way
  • syncing with rsync
  • make it work smoothly without need to enter passwords every time.

The scope is to make the syncing easy and smooth, so that you can delete or let files expire without the worries of copying those files back. This works if you are using a Mac or a Linux workstation.

If you have any suggestions or tips on how this might be accomplished on a Windows PC, please let us know and we’ll update this documentation accordingly.

Notation: For clarity, commands to be issued on the HPC machine will be prefixed with the prompt [jsmith@login1 ~]$ , to correspond with what you would see if logged in to the ARC systems. Commands to be entered on your local system will be prefixed [jsmit@your_system ~]$ and [jsmith@hostname ~]$ if it can be entered on either machine.

Setting up The File Systems

We’ll assume you have your data in one or a few folders in your local $HOME directory.

The idea is to have your local file system (your workstation) and your remote file system (on the HPC service) look the same. We will use symbolic links to do so.

Let’s assume your ID is jsmith on both systems and your data is in your local $HOME/data

  • login into the HPC cluster
  • create directory on /nobackup

– now make a symbolic link in your HPC home directory that links to that directory

now you have a link in your HPC home directory called data . If you cd into it you it it will show the content of what it is on /nobackup/jsmith/data (nothing at the moment) but if you type pwd it will tell you that you are in $HOME/data .

Syncing

Now we want to sync some data. Because both filesystems look the same it is much easier to do.

Let’s say you want to sync data in your local system’s $HOME/data/project1 to the HPC cluster. Here is the command to do it.

(prompt asks for password. Enter it)

see below for explanation of all options.

BE CAREFUL: What we are asking is to copy the directory project1. Don’t add a trailing / at the end, or it will copy the content of the directory into ARC2’s (in this case) ~/data without creating the folder project1 .

rsync works over ssh , so if you can connect using ssh, you should not have problems. It is secure and has several advantages, in particular, before starting, it checks which files have changed or are missing and only transfer the data required to re-sync the two folders. Furthermore, it means that if the connection fails, you don’t have to start from scratch.

After you run your analysis you can copy data back in the same way, either “pushed” from the HPC service or “pulled” from your computer.

One advantage of having the same filesystem structure on your local and remote computer is that you only need to “find” the folder/file you want to sync on your system.

Rsync Options

Either <origin> or <destination> can be on a remote location. It this case should look like user@server.example.com:/path/to/destination

--rsh="/usr/bin/ssh -c arcfour" Transmits over ssh using a very fast cypher (arcfour) to reduce CPU usage at both ends (sometimes CPU is the bottleneck!)

--bwlimit=40000 Limit I/O bandwidth; KBytes per second. Limits bandwidth to 40MB/s (incoming link approx 100MB/s) to avoid saturation of the link.

--times preserve modification times. rsync, by default, checks file size and modification times to decide if a file was changed. If modification time differs, performs a checksum of both files. This might be slower than the transfer itself! To always force checksum use -c , to only compare file size use --size-only . Be careful with this last option.

--perms keep file permissions. Destination files will be set with same permissions as origin.

--recursive recurse into directories.

--progress and --stats give you something to look at if bored (and want to monitor the connection) and a final report to be impressed by

Automatic Authentication

Once you are familiar with rsync , you will notice that having to enter the password every time becomes the annoying bit. Fortunately there is a solution for this too! It is a bit lengthy, but worth it.

For a full explanation on how ssh works, see the following http://www.ibm.com/developerworks/library/l-keyc/index.html

To set up the automatic authentication (sometimes known as passwordless login), follow these steps:

  • Generate a private and public ssh keypair on your local system
  • upload the public key to remote system e.g., ARC2
  • generate ssh key pair on remote system, ARC2
  • send public key from ARC2 to your local system.

Generate ssh Keypair on Your Local System

To generate the key pair, from your computer:

Accept the default key location when prompted, typically ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub for private and public key repectively, and provide ssh-keygen with a secure passphrase. Once ssh-keygen completes, you’ll have a public key as well as a passphrase-encrypted private key. The passphrase should not be the same as the one you use to log in.

Upload Public Key to The HPC Cluster

Now we need to upload the public key to the HPC cluster. As it will be useful setting up the automated login to and from the HPC machine, we will copy the public key in the “authorized_keys” file and then transfer it over to, instead of coping the public key, id_rsa.pub directly.

Don’t copy the private key- that should be kept securely on your local system.

On your local machine issue the following commands:

You will still be prompted for your password at this point.

IMPORTANT: the private key should be, well, private. It is encrypted but still you should be the only one with reading rights on that file. Never share it. It is possible to leave the passphrase blank but this will substantially weaken the security of your keys. If left blank, anyone who gets hold of your private keys will be able to login as you.

Now log on to the HPC machine it will still require a password at this point:

Generate ssh Keypair on Arc2

Login to Arc2 using your password:

and generate a private and public key pair using:

and accept the default location.

Upload Public Key To Your Local System

Now copy the authorized_key file to the correct location,

append the Arc2 public key:

and copy back to your local machine:

Using keychain for Automatic Authentication

If everything is in place, now you should be able to log in using the keys. However, because they are encrypted you will still be prompted for the passphrase. So “what’s the point?” you might ask. Well, in MacOSX you will have the option to keep the passphrase in keychain. If you do, you will not be not prompted for the passphrase anymore.

In a Linux box you can achieve the same using the keychain application.

Download it using:

OR

extract it:

and now run it providing you private key:

You will be prompted for the passphrase. And finally

now you should be able to ssh in (or rsync) as many time as you want without entering the passphrase!

If you exit your session, then you’ll need to repeat the last two steps. You can put them in your .bash_profile though, so that it is automatically executed each time. Or make a bash script with those two commands.