Skip to main content

Data management

Transferring files

There are several ways to copy files from/to the Cluster:

  • Direct scp, rsync, sftp... to the login nodes
  • Using a Data transfer Machine which shares all the GPFS filesystem for transferring large files
  • Mounting GPFS in your local machine

Direct copy to the login nodes.

As said before no connections are allowed from inside the cluster to the outside world, so all scp and sftp commands have to be executed from your local machines and never from the cluster. The usage examples are in the next section.

On a Windows system, most of the secure shell clients come with a tool to make secure copies or secure ftp's. There are several tools that accomplish the requirements, please refer to the Appendices, where you will find the most common ones and examples of use.

Data Transfer Machine

We provide special machines for file transfer (required for large amounts of data). These machines are dedicated to Data Transfer and are accessible through ssh with the same account credentials as the cluster. They are:

  • dt01.bsc.es
  • dt02.bsc.es

These machines share the GPFS filesystem with all other BSC HPC machines. Besides scp and sftp, they allow some other useful transfer protocols:

  • scp
localsystem$ scp localfile username@dt01.bsc.es:
username's password:

localsystem$ scp username@dt01.bsc.es:remotefile localdir
username's password:
  • rsync
localsystem$ rsync -avzP localfile_or_localdir username@dt01.bsc.es:
username's password:

localsystem$ rsync -avzP username@dt01.bsc.es:remotefile_or_remotedir localdir
username's password:
  • sftp
localsystem$ sftp username@dt01.bsc.es
username's password:
sftp> get remotefile

localsystem$ sftp username@dt01.bsc.es
username's password:
sftp> put localfile
  • BBCP
bbcp -V -z <USER>@dt01.bsc.es:<FILE> <DEST>
bbcp -V <ORIG> <USER>@dt01.bsc.es:<DEST>
  • GRIDFTP (only accessible from dt02.bsc.es)
globus-url-copy -help
globus-url-copy -tcp-bs 16M -bs 16M -v -vb your_file sshftp://your_user@dt02.bsc.es/~/

Setting up sshfs

  • Create a directory inside your local machine that will be used as a mount point.

  • Run the following command below, where the local directory is the directory you created earlier. Note that this command mounts your GPFS home directory by default.

sshfs -o workaround=rename <yourHPCUser>@dt01.bsc.es: <localDirectory>
  • From now on, you can access that directory. If you access it, you should see your home directory of the GPFS filesystem. Any modifications that you do inside that directory will be replicated to the GPFS filesystem inside the HPC machines.

  • Inside that directory, you can call "git clone", "git pull" or "git push" as you please.

Active Archive Management

To move or copy from/to AA you have to use our special commands, available in dt01.bsc.es and dt02.bsc.es or any other machine by loading "transfer" module:

  • dtcp, dtmv, dtrsync, dttar

These commands submit a job into a special class performing the selected command. Their syntax is the same than the shell command without 'dt' prefix (cp, mv, rsync, tar).

  • dtq, dtcancel
dtq

dtq shows all the transfer jobs that belong to you, it works like squeue in SLURM.

dtcancel <job_id>

dtcancel cancels the transfer job with the job id given as parameter, it works like scancel in SLURM.

  • dttar: submits a tar command to queues. Example: Taring data from /gpfs/ to /gpfs/archive/hpc
dttar -cvf  /gpfs/archive/hpc/group01/outputs.tar ~/OUTPUTS 
  • dtcp: submits a cp command to queues. Remember to delete the data in the source filesystem once copied to AA to avoid duplicated data.
# Example: Copying data from /gpfs to /gpfs/archive/hpc    
dtcp -r ~/OUTPUTS /gpfs/archive/hpc/group01/
# Example: Copying data from /gpfs/archive/hpc to /gpfs
dtcp -r /gpfs/archive/hpc/group01/OUTPUTS ~/
  • dtrsync: submits a rsync command to queues. Remember to delete the data in the source filesystem once copied to AA to avoid duplicated data.
# Example: Copying data from /gpfs to /gpfs/archive/hpc    
dtrsync -avP ~/OUTPUTS /gpfs/archive/hpc/group01/
# Example: Copying data from /gpfs/archive/hpc to /gpfs
dtrsync -avP /gpfs/archive/hpc/group01/OUTPUTS ~/
  • dtsgrsync: submits a rsync command to queues switching to the specified group as the first parameter. If you are not added to the requested group, the command will fail. Remember to delete the data in the source filesystem once copied to the other group to avoid duplicated data.
# Example: Copying data from group01 to group02
dtsgrsync group02 /gpfs/projects/group01/OUTPUTS /gpfs/projects/group02/
  • dtmv: submits a mv command to queues.
# Example: Moving data from /gpfs to /gpfs/archive/hpc    
dtmv ~/OUTPUTS /gpfs/archive/hpc/group01/
# Example: Moving data from /gpfs/archive/hpc to /gpfs
dtmv /gpfs/archive/hpc/group01/OUTPUTS ~/

Additionally, these commands accept the following options:

--blocking: Block any process from reading file at final destination until transfer completed.

--time: Set up new maximum transfer time (Default is 18h).

It is important to note that these kind of jobs can be submitted from both the 'login' nodes (automatic file management within a production job) and 'dt01.bsc.es' machine. AA is only mounted in Data Transfer Machine. Therefore if you wish to navigate through AA directory tree you have to login into dt01.bsc.es

Repository management (GIT/SVN)

There's no outgoing internet connection from the cluster, which prevents the use of external repositories directly from our machines. To circumvent that, you can use the "sshfs" command in your local machine, as explained in the previous Setting up sshfs sections, for both Linux and Windows.

Doing that, you can mount a desired directory from our GPFS filesystem in your local machine. That way, you can operate your GPFS files as if they were stored in your local computer. That includes the use of git, so you can clone, push or pull any desired repositories inside that mount point and the changes will transfer over to GPFS.