MGIS is seeking a System Administrator, Level 2, to manage High Performance Computing (HPC) clusters and support the scientists who rely on them. This role blends HPC system administration with hands-on user support — helping researchers install, run, and debug applications on HPC infrastructure so they can focus on their science instead of IT issues.
HPC environments in scope include clustered CPU/GPU systems with job schedulers and attached parallel storage (e.g., Lustre, GPFS).
What you'll be doing
HPC Administrator duties
- Maintain the HPC cluster — hardware, image management, local networking, scheduler, and backups
- Troubleshoot environment incidents to ensure a quick return to normal operations
HPC Analyst duties
- Meet with scientists to evaluate their HPC support requirements
- Develop task plans to meet researchers' needs, consulting the technical authority for approval
- Support application builds, installs, and runtime troubleshooting (GNU, Intel, Fortran, Nvidia)
- Support open-source and commercial software, including Python/Anaconda installs, Bash scripting, build/make tools, EasyBuild, Spack, and MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI)
- Assist with compilation and runtime of in-house developed applications
General systems management
- Manage Linux OS patching schedules and reliability
- Manage user accounts (creation, deletion) and environment modules
- Manage configuration via Git, MS DevOps, and Ansible Playbooks
- Manage RPM/DEB packages and troubleshoot ThinLinc
Troubleshooting & hardware
- Troubleshoot jobs on schedulers (PBS Pro/Torque, SLURM, SGE)
- Ensure reliable CUDA installs; troubleshoot GPU failures and CUDA software/driver issues
- Provide hardware support — memory upgrades, storage arrays, power/network cabling, ILO
Documentation
- Document every process and task to support enterprise knowledge continuity
- Submit weekly progress reports to the Technical Authority
What we're looking for
- Solid experience administering Linux-based HPC clusters (CPU/GPU nodes, schedulers, parallel storage)
- Hands-on experience with job schedulers such as PBS Pro/Torque, SLURM, or SGE
- Experience troubleshooting CUDA installations, GPU failures, and driver issues
- Familiarity with scientific computing toolchains — compilers (GNU, Intel), MPI implementations, EasyBuild, and Spack
- Experience supporting researchers or end-users with application builds and runtime issues
- Working knowledge of configuration management tools (Git, Ansible, MS DevOps)
- Comfortable working independently and producing clear technical documentation
- Eligible to obtain and maintain a Secret-level security clearance
Additional details
- Language of work: English
- May involve work on infrastructure containing Controlled Goods datasets
- MGIS will supply a laptop, workstation/cubicle, software, and system access; an ID badge is issued for facility access
- All intellectual property created under this contract remains the property of the client
- Loaned equipment may be recalled at any time per operational need
Pay: From $130,000.00 per year
Work Location: Hybrid remote in Dartmouth, NS