System Administrator

MGIS
Dartmouth, NS

Quick apply

Job details

Contract
From $130,000 a year
7 days ago

Qualifications

Ansible
DevOps
Git
English
Fortran
Bash (Unix shell)
Linux
Python

Full job description

MGIS is seeking a System Administrator, Level 2, to manage High Performance Computing (HPC) clusters and support the scientists who rely on them. This role blends HPC system administration with hands-on user support — helping researchers install, run, and debug applications on HPC infrastructure so they can focus on their science instead of IT issues.

HPC environments in scope include clustered CPU/GPU systems with job schedulers and attached parallel storage (e.g., Lustre, GPFS).

What you'll be doing

HPC Administrator duties

Maintain the HPC cluster — hardware, image management, local networking, scheduler, and backups
Troubleshoot environment incidents to ensure a quick return to normal operations

HPC Analyst duties

Meet with scientists to evaluate their HPC support requirements
Develop task plans to meet researchers' needs, consulting the technical authority for approval
Support application builds, installs, and runtime troubleshooting (GNU, Intel, Fortran, Nvidia)
Support open-source and commercial software, including Python/Anaconda installs, Bash scripting, build/make tools, EasyBuild, Spack, and MPI implementations (MPICH, OpenMPI, IntelMPI, HPMPI)
Assist with compilation and runtime of in-house developed applications

General systems management

Manage Linux OS patching schedules and reliability
Manage user accounts (creation, deletion) and environment modules
Manage configuration via Git, MS DevOps, and Ansible Playbooks
Manage RPM/DEB packages and troubleshoot ThinLinc

Troubleshooting & hardware

Troubleshoot jobs on schedulers (PBS Pro/Torque, SLURM, SGE)
Ensure reliable CUDA installs; troubleshoot GPU failures and CUDA software/driver issues
Provide hardware support — memory upgrades, storage arrays, power/network cabling, ILO

Documentation

Document every process and task to support enterprise knowledge continuity
Submit weekly progress reports to the Technical Authority

What we're looking for

Solid experience administering Linux-based HPC clusters (CPU/GPU nodes, schedulers, parallel storage)
Hands-on experience with job schedulers such as PBS Pro/Torque, SLURM, or SGE
Experience troubleshooting CUDA installations, GPU failures, and driver issues
Familiarity with scientific computing toolchains — compilers (GNU, Intel), MPI implementations, EasyBuild, and Spack
Experience supporting researchers or end-users with application builds and runtime issues
Working knowledge of configuration management tools (Git, Ansible, MS DevOps)
Comfortable working independently and producing clear technical documentation
Eligible to obtain and maintain a Secret-level security clearance

Additional details

Language of work: English
May involve work on infrastructure containing Controlled Goods datasets
MGIS will supply a laptop, workstation/cubicle, software, and system access; an ID badge is issued for facility access
All intellectual property created under this contract remains the property of the client
Loaned equipment may be recalled at any time per operational need

Pay: From $130,000.00 per year

Work Location: Hybrid remote in Dartmouth, NS

Quick apply

Job seeker tools

Employer Tools

Browse

Stay Connected