Frequently Asked Questions

General
Connections
Data sharing and deletion
Executions
Performance Tuning
- At ... we saw that using the mpi option ... gives up to ... speedup. Have you seen a similar behavior at ...?
- I have found performance problems, what should I do?
Support Team

General

Backup Policy

The Backup Policy at BSC for the HPC machines is to backup daily the contents of $HOME. For each present file a few versions may be stored. For deleted files, after a few weeks only the last version is kept. If your files are stored elsewhere in the filesystem, there's no backup available (unless you specifically request for those files to be backed up to the Support Team).

Which filesystems do I have available? What is the intended usage?

Most of our machines use a shared filesystem (GPFS) across them, so all your data is accessible from your account in all of them. Also, quotas are enforced to limit user data on filesystem. Those machines present the following structure:

/gpfs/home
/apps
/gpfs/projects
/gpfs/scratch
/gpfs/tapes/hpc

Also there is a local hard disk available in each node but not shared nor accessible from anywhere else but the node:

/scratch/tmp

Filesystems overview

Filesystem	Intended	Permissions	BackUp	Extra comment
/gpfs/home	Personal files	User	Yes (daily)
/apps	Applications	Only BSC Support	Yes	Contact BSC Support to request access to licensed software
/gpfs/projects	Inputs and results or share data with your group	Group	Yes
/gpfs/scratch	Temporary or intermediate files used	User	No
/gpfs/tapes/hpc	Long-term data storage	Group	No	Only browsable from Data Transfer and accessed through dtcommands
/scratch/tmp	Local disk. Temporary files that don't need to be recovered	User	No

/gpfs/home

This is your home filesystem where you should store your personal and configuration files and is private even for other members of your team.

This filesystem is backed up once a day.

/apps

This is where applications are installed in the machine. Users are not allowed to install software or modify existing installations. Some of the applications may require licenses to be used and its access may be restricted. If your group holds a valid license for any of such software and you can't access the software, do not hesitate to contact our Support Team.

/gpfs/projects

This filesystem is intended to store the inputs and results of your executions and to share data with other group members.

This filesystem is continuously backed up.

/gpfs/scratch

This filesystem presents the highest performance for distributed read and write and should be used to store the temporary or intermediate files used by your executions. You may copy your inputs to these filesystem for the execution and retrieve the results to /gpfs/projects for analysis.

This filesystem is not backed up and data older than 15 days may be deleted without warning if space is needed.

/gpfs/tapes/hpc

This filesystem is intended for long-term data storage and is only browsable from Data Transfer. It may be accessed through dtcommands

There is no backup of data. All project's data may remain there up to a year after end of project to account for project renovations in non-contiguous periods.

/scratch/tmp

This disk is intended for temporary files that don't need to be recovered, like mesh partitions local to the node or MPI temporary communications between nodes. The data is not recoverable after the job. Do not confound with /gpfs/scratch.

How can I check how much free disk have I available? What is a quota?

All the filesystems you have available have a quota set. These quotas limit the amount of space usable by a user or group of users. We provide some convenience tools to check the usage and limits of quotas depending on the machine you are connected to. Mind that the filesystem is shared so the limits and status is the same across all machines.

You can use the command bsc_quota from the bsc_commands bundle which aggregates information about all filesystems in a comprehensive way.

For /gpfs/tapes/hpc quota you should issue command dtquota.

Your account for HPC Portal is different and a separate thing from your account for accessing the machines, it only works for HPC Portal and does not interfere in any way with your machine's account. If it is your first time logging in, please follow the steps explained in the HPC Portal section.

Connections

Why can't I connect/download/update... from logins to ...?

For security reasons, our clusters are unable to open outgoing connections to other machines, either internal (other BSC facilities) or external, but they will accept incoming connections. You must upload all needed data to the cluster by yourself and download it the same way.

Why is my SSH Key not working?

If you have installed an SSH key on your account as an authorized key, but it still asks for a password when you attempt to login, you should check the following:

If you have changed the default filename of the file in which the key is stored, you should specify the full-path to the private key when you log in with SSH. For example:
$ ssh -i /home/user/.ssh/custom_key username@machine.bsc.es
If you have changed the permissions from your $HOME, $HOME/.ssh or $HOME.ssh/authorized_keys, it may be the problem.
- The $HOME directory must have "rwx" permissions for the "user", and it must not have "w" permissions for group or others.
- The $HOME/.ssh must have "rwx" permissions for the "user", and it must not have any permissions for group or others.
- The $HOME/.ssh/authorized_keys must have "rw" permissions for the "user", and it must not have any permissions for group or others.
If you have set a passphrase to your ssh-key, you will need to enter the passphrase every time you log in.
If nothing from the previous points work, please contact Support.

How can I open a GUI in the logins?

To open GUIs you need to connect with parameter -X in your ssh connection (in Linux/OSX) or with some kind of x11 Forwarding (Windows). Examples:

Linux/OSX

   ssh -X -l username <login.bsc.es>

ERROR: "/usr/bin/manpath: can't set the locale; make sure $LC_* and $LANG are correct" or "cannot set LC_CTYPE locale"

This error is related to the locale (the language dependent character encoding) of your system being different/incompatible with MareNostrum's. If you find yourself in this situation, please try the following:

All

    LANG=es_ES.UTF-8 ssh -l username <login.bsc.es>

MacOsX

Some Mac versions have a bug in the terminal that ignores the previous setting and causes this error. You should be able to disable this behaviour by unchecking "Set locale environment variables on startup" in Terminal Settings -> Advanced.

What is this "connect to host XXX.bsc.es port 22: Connection refused" error?

You are probably banned from the login node you are trying to access after some failed password attempts. To check if this is the problem, you can try to access to another login node from the same machine.

To unban you, please send a mail to Support specifying your public IP address and the login node you are getting the connection refused from. You can know your public IP from websites like: https://whatismyipaddress.com/

If you want to give access to a folder to another user, our policy for every case is the following:

You can do it by yourself by changing the permissions of the folder. The command is:

chmod <-R> g+<r/w/x> path/to/folder

The option "-R" is used in case you want it to be recursive, and "r/w/x" stands for reading permissions, write permissions and execution permissions respectively. You can specify more than one at the same time. In case the folder is from someone that is not available at the moment (vacations, off work, etc...), you can contact Support in order to grant you permissions, but the Responsible's authorization will be needed.

You will have to contact our Support Team if you accomplish all these requisites:

Data is stored under /gpfs/projects/<unixgroup>
The responsible of the unixgroup approves (put them in CC in the mail so we can check their "OK")

It is not allowed to share data under /gpfs/scratch nor /gpfs/home. After the Responsible has given its permission, we will grant you access to it.

For more complex operations (sharing data across multiple people from different groups, for example), please contact Support.

Please contact our Support Team with the PI from the other group in copy (so they can give us their "OK"), and we will grant you the desired secondary group.

You need the data from someone who no longer works at BSC, or from yourself with another account

Please contact our Support Team specifying the files that you need and your username, in order to change the data's owner to you. If you have various accounts, we can grant you the requested secondary group so it is easier for you to navigate through the filesystem.

If you need to delete that data, unfortunately, we cannot do that for you. After changing the owner of this data, you will have to delete it by yourself.

Executions

I am preparing a pipeline and I want to quickly test job sbmission/job environment.

All accounts have access to a high priority qos named "debug". Submitting jobs to this qos will allow users to enter execuion faster. However, it implies certain limitations to jobs "timelimit" (2h) and size. Specific limits can be checked using the command "bsc_queues". Jobs can be submitted to the debug qos by adding the following to the jobscript (SLURM), depending on your machine/partition:

$ #SBATCH --qos=debug
$ #SBATCH --qos=gp_debug
$ #SBATCH --qos=acc_debug

Additionally, in the clusters with SLURM available, the users will be able to request compute node/s in which to try different commands interactively, to run commands in the same environment as a job. These resources can be requested interactively in the terminal of any of the login nodes, by running the following (1 task example):

$ salloc -n 1 --qos=debug

The salloc command can be combined with any of the other slurm sbatch constraints, in order to request for different tasks, nodes...

The logins in our facilities have 5 minutes CPU time limit. This means that any execution requiring more than that is automatically killed.

You may avoid this restriction by using the queues (interactive queue or standard job) to either compile or execute something depending of your interactive needs. To transfer files, you may use Data Transfer commands inside our cluster's filesystems or connect to our Data Transfer Machines, where uploads and downloads are not time limited.

My job failed and I see a message like "OOM Killed..." in the logs. What is this?

This is a message from the OS kernel stating that your process was consuming too much memory, exceeding the node's limits and was thus killed.

If you encounter such problem you should try to change how many processes are executed on the same node. It is recommended you contact our Support Team if it's the first time you try to tweak this settings.

My job has been waiting for a long time. How can I check when it will be executed?

There is no reliable prediction of when a certain job will start its execution, as the priorities in the queue are composed by several factors; thus, this value may change due to the presence of other jobs. The system is designed so all jobs will be executed eventually, but it may take longer for some of them.

If you are working in a cluster managed by SLURM, you can check the expected start time and resources to be allocated for pending jobs doing:

$ squeue --start

You should consider that as an estimation, so it can vary through time. It might start sooner or later than the time expected by the scheduler.

I need to use Nextflow, but I don't have an internet connection on the clusters. What do I do?

First of all, we don't directly support the use of Nextflow in our HPC clusters, since it is a workflow prone to issues if you don't have an internet connection. However, if you really need to use it, you have some options.

If you are a MN4 user from the BSC, you can use MN4's login0. This is the only login node with internet connection, so you can run a very barebones version of your Nextflow pipeline in order to get all your dependencies downloaded. Once that is done, you should be able to run the pipeline on compute nodes using the batch system.
If you use any other machine or don't have access to MN4's login0, you can try a workaround. You can try to run the pipeline in your local computer and then copy the generated nextflow environment to your desired cluster. With a bit of luck, this can be enough to have all the necessary dependencies sorted out to run in the compute nodes. This environment should be composed of:
- The “.nextflow” directory in your home directory
- All directories generated by your pipeline
- The pipeline itself

Performance Tuning

At [...] we saw that using the mpi option [...] gives up to [...] speedup. Have you seen a similar behavior at [...]?

We have seen that the tuning options of our different MPI implementations have very different outcomes depending on the application, the number of nodes used and sometimes, even on the input. We recommend that you benchmark different options to see which is best for your application, and if you have any doubt, contact Support Team to check the issue together. They will be happy to help you find the best possible environment and software stack for your execution.

Our general recommendations are:

Prefer intel compilers for most cases (only x86 nodes, for MN5 ACC, we usually use Nvidia compilers with better performance, or GNU with good behavior)
When using Intel MKL math libraries, prefer intel compilers and IntelMPI implementation over OpenMPI (to compile with MKL, you can check the linking flags and libraries required here: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html)
For big executions (with more than 250 nodes), it is better to contact support to check the scalability and performance before launching production jobs, to be sure the execution will perform properly and the usage of the hours will be effective.

I have found performance problems, what should I do?

First, you should find a method to reproduce the problem and confirm after some tests that it is reproducible (a node may sometimes have memory/network problems, so if the performance problem only happened one time, it most likely was caused by a temporary problem).

After that, you should provide all relevant information to the Support Team by e-mail, alongside instructions on how to reproduce your tests. The Support Team will investigate the issue and contact you as soon as possible with our recommendations.

Support Team

How/when may I contact Support Team?

You may send an e-mail any time and it will be answered on the next working day on office hours (9:00 - 17:00 CET). Bank holidays correspond to Barcelona's.

support AT bsc DOT es

caution

Please, bear in mind to include all the relevant information in your mail when contacting support:

Job Ids
Software and version used
Environment
Machine
Username
Exact steps that lead to the issue
Error messages

Who may access the resources of BSC?

Scientific access to BSC's HPC resources is granted through some national and international research projects:

Barcelona Supercomputing Center personnel or associated institutions
Spanish Supercomputing Network (RES) granted projects

How can I get access to BSC for computing?

There are periodic proposal submission deadlines for new and continuing projects 3 times a year. Visit each Access Project's website to check when the next submission deadline is.

Frequently Asked Questions

General​

Backup Policy​

Which filesystems do I have available? What is the intended usage?​

Filesystems overview​

/gpfs/home​

/apps​

/gpfs/projects​

/gpfs/scratch​

/gpfs/tapes/hpc​

/scratch/tmp​

How can I check how much free disk have I available? What is a quota?​

Why can't I login into HPC|Portal?​

Connections​

Why can't I connect/download/update... from logins to ...?​

Why is my SSH Key not working?​

How can I open a GUI in the logins?​

Linux/OSX​

ERROR: "/usr/bin/manpath: can't set the locale; make sure $LC_* and $LANG are correct" or "cannot set LC_CTYPE locale"​

All​

MacOsX​

What is this "connect to host XXX.bsc.es port 22: Connection refused" error?​

Data sharing and deletion​

If you want to share a folder with a user of your same group​

If you want to share a folder with a user of a different group​

If you want to share a folder with a different group but it is composed by the same people​

You need the data from someone who no longer works at BSC, or from yourself with another account​

Executions​

I am preparing a pipeline and I want to quickly test job sbmission/job environment.​

I was copying data/compiling/executing something in the login and the process was killed. Why?​

My job failed and I see a message like "OOM Killed..." in the logs. What is this?​

My job has been waiting for a long time. How can I check when it will be executed?​

I need to use Nextflow, but I don't have an internet connection on the clusters. What do I do?​

Performance Tuning​

At [...] we saw that using the mpi option [...] gives up to [...] speedup. Have you seen a similar behavior at [...]?​

I have found performance problems, what should I do?​

Support Team​

How/when may I contact Support Team?​

Who may access the resources of BSC?​

How can I get access to BSC for computing?​