Procedures Overview

Zorro, the AU High Performance Computing (HPC) System, is managed by the HPC Committee; the procedures on this page were developed by the committee. The user procedures below aim to

  • help make the system effective for users
  • ensure proper maintenance of the system
  • document use of this federally-funded facility
  • enable researchers to contribute to the system

This webpage always shows the procedures currently in effect. Procedures are continually reviewed and updated by the HPC Committee. Questions or concerns should be e-mailed to hpc@american.edu.

It is key that all users have access to the HPC system and that the integrity of all users' code is maintained. Following some of these procedures is likely to require that you understand the nature of HPC computing and the configuration of the hardware and software on the system. Please contact hpc@american.edu to ask questions or to report problems.

If a user does not abide by the procedures, a representative of the HPCÌýCommittee has the right to terminate the user's jobs and/or to suspend the user's account.

Terms

Jobs: in the context of HPC, jobs are workloads that users submit to the cluster workload manager, to be run either in batch or interactively on the cluster compute nodes.

Project: A collaboration requiring shared storage on the HPC between one AU faculty member and some number of other faculty or students either from AU or other institutions.

Selected
Basics
Account
Storage/Disk Quotas
Jobs
User Obligations

Users Must:

  • Use the batch submission system of the cluster workload scheduler (LSF or similar). Do not use programs like VS code to run jobs.
  • Users may not run applications interactively from the login node.
  • Users may not log into the compute nodes for the purpose of running a job directly without permission from the HPC committee.
    • To request access other than batch submission, please send a request explaining why special access is necessary for your research to hpc@american.edu.
  • Unauthorized Interactive jobs or applications found running on the login node(s) or on the compute nodes outside of LSF are subject to immediate termination.
  • Respond in a timely way to correspondence from the HPC Committee.
    • All correspondence will be sent to the e-mail address on file with the HPC committee.
  • Users’ home and shared directories are backed up onto tape and can be accessed for up to two years before they are destroyed. Pulling data requires the combined efforts of multiple offices so this will require advanced notice. Scratch space is not backed up.
    • If data are removed from the HPC either due to procedure violations or account/status lapse, they may still be retrieved from tape storage.

Costs of Use

There is no up-front cost for use for AU faculty Principal Investigators (PI) and students and externally funded users are encouraged to contribute funding to the facility at a rate of $1,000 per node-year.

Who May Have An Account?

  • Any AU faculty PI whose project has computational needs are beyond the capacity of their current workstation may apply for an account. The account can be accessed by all members of the research team on the project. (See .)
  • Students (undergrad or grad) who are sponsored by an AU faculty PI to do independent research may apply for an account. (See .)
  • Accounts expire automatically after 1 year unless renewed. Once an account expired, all associated data removed from the HPC.

  • Users from outside the AU community may also apply for an account for academic research upon invitation by an AU collaborator.

  • Once a user is no longer affiliated with AU, access to the cluster will be terminated and the account and associated data will be removed from HPC.

  • HPC wide software updates/installations need to be approved by the Office of Research or designee.

  • New commercial software installations need to be approved by the Office of Research or designee and purchased by either the faculty PI or the faculty PI’s department. This is contingent on the software being compatible with the cluster.

  • Users’ home directory (/home/{username}) will have a soft limit of 400 GB and a hard limit of 500 GB. Once 500 GB in the home directory is reached, users will not be able to save data until the directory has been reduced to below the quota.
  • Shared project storage hard limits will generally be 1 TB unless otherwise allocated based on Office of Research approval. If you are exceeding the approved project memory quota, you will not be able to save until the directory has been reduced below the quota.
  • Projects expire automatically after 1 year--or as specified in an Office of Research project application--unless renewed.
  • Projects that expire will have all associated data removed from the HPC.
  • Neither Zorro, nor the new HPC are configured to offer encrypted data storage or otherwise able to protect confidential or personally identifiable information. It will remain up to project principal investigators to ensure that their data management practices conform with the expectations of the data provider and/or funding agency.

User access to compute nodes is managed by software called a scheduler. The scheduler reserves available nodes and other resources according to the queue that a user can access. The queues are managed so that the AU's research capacity is maximized.

There are six queues on the cluster. Three are open to all users; one is restricted. Each queue has a different runtime limit and resources (CPUs/GPUs) associated with it. As a rule, compute-intensive jobs (such as computing certain statistical estimators) are allowed use more resources over a short period of time. Jobs that require long run times (such as simulations or real-time modeling) are allowed fewer resources so that these jobs do not create bottle-necks for other users.

Queues Open to All Users

Normal: This is the default queue. If you do not specify another queue when you submit a job, your job will run in the normal queue. The normal queue has a runtime limit of 48 hours. Users can request up to 72 CPUs / job slots when they submit a job.

Long: The long queue is for jobs that are expected to run longer than 48 hours. The runtime limit in this queue is 240 hours. Users can request up to 48 CPUs /job slots. If a User finds that jobs submitted to the normal queue are terminated before completion, he or she should resubmit the job to the long queue.

Short: This queue is for compute-intensive jobs that are not expected to run longer than two hours. Runtime limit in this queue is 2 hours. Users can request up to 120 CPUs / job slots. Again, users specify the short queue when the job is submitted.

H100: Dual Nvidia H100 GPUs, each with 80 GB of RAM

Bigmem: 2 TB of RAM, no GPUS

Please note that on the short, normal, long, and H100 queues GPUs must be specifically called during the job, otherwise these will run as CPU jobs.

Special Purpose & Restricted Queues

Priority: The priority queue is reserved for users who provide financial support for the AU HPC System, including its initial funding. Jobs submitted by priority users go to the front of the queue they requested.

Note: System administration and testing of the machine may interrupt jobs. System administrators will notify Users if maintenance or testing is expected to impact job run-time.

Users must:

  • Acknowledge the use of the HPC system in papers and presentations.
  • Confirm that user accounts are still needed when requested by OIT personnel in charge of resource allocation.
  • Update Titles and Abstracts for each research project being conducted on the HPC system when requested. (Requests are planned to coincide with Elements due dates.)
  • Every time a new project is started, a new Project request form from the Office of Research needs to be submitted (available at the HPC Home page).

If a user does not abide by the procedure, the following measures will be taken:

  • First procedure violation: The user (and faculty PI if it is a sponsored student), will receive a warning of the violation and/or user’s jobs will be terminated.
  • Second procedure violation: The user will have their account suspended for 30 days; any current jobs will be terminated.
  • Third procedure violation: The user will have their account terminated; any current jobs will be terminated. Any associated data in the user’s account will be removed from HPC deleted.

Ìý

The newly-reorganized HPC Committee manages the AU HPC System. More information will be posted soon.

These procedures were last updated by members of the Faculty Senate Committee on Information Services and Office of Information Technology in May 2024.

Ìý

User Support Contact