UCSF Wynton HPC Status #

Queue Metrics #

queues usage during the last day
GPU queues usage during the last day
queues usage during the last week
GPU queues usage during the last week
queues usage during the last month
GPU queues usage during the last month
queues usage during the last year
GPU queues usage during the last year

Compute Nodes #

Status on compute nodes unknown, which happens when for instance the job scheduler is down.

Wynton HPC Grafana Dashboard #

Detailed statistics on the file-system load and other cluster metrics can be found on the Wynton HPC Grafana Dashboard. To access this, make sure you are on the UCSF network. Use your Wynton HPC credential to log in.

Current Incidents #

May 16-June 1 (estimated), 2023 #

Full downtime followed by network and file-system recovery #

Update: The job scheduler is now available. Access to /wynton/group, /wynton/protected/group, and /wynton/protected/project has been restored. If you encounter a “Communication error on send” error, please do not delete or move the affected file.
June 1, 16:00 PDT

Update: Wynton will be fully available later today, meaning the job scheduler and access to /wynton/group, /wynton/protected/group, and /wynton/protected/project will be re-enabled. Note, two ZFS storage targets are still faulty and offline, but the work of trying to recover them will continue while we go live. This means that any files on the above re-opened /wynton subfolders that are stored, in part or in full, on those two offline storage targets will be inaccessible. Any attempt to read such files will result in a “Communication error on send” error and stall. To exit, press Ctrl-C. Importantly, do not attempt to remove, move, or update such files! That will make it impossible to recover them!
June 1, 12:15 PDT

Update: In total 22 (92%) out of 24 failed storage targets has been recovered. The consultant hopes to recover the bulk of the data from one of the two remaining damaged targets. The final damage target is heavily damaged, work on it will continue a few more days, but it is likely it cannot be recovered. The plan is to open up /wynton/group tomorrow Thursday with instructions what to expect for files on the damaged targets. The compute nodes and the job scheduler will also be enabled during the day tomorrow.
May 31, 22:45 PDT

Update: In total 22 (92%) out of 24 failed storage targets has been recovered. The remaining two targets are unlikely to be fully recovered. We’re hoping to restore the bulk of the files from them, but there is a risk that we will get none back. Then plan is to bring back /wynton/group, /wynton/protected/group, and /wynton/protected/project, and re-enable the job queue, on Thursday.
May 31, 01:00 PDT

Update: The login, data-transfer, and development nodes (except gpudev1) are now online an available for use. The job scheduler and compute nodes are kept offline, to allow for continued recovery of the failed ZFS storage pools. For the same reason, folders under /wynton/group, /wynton/protected/group, and /wynton/protected/project are locked down, except for groups who have mirrored storage. /wynton/home and /wynton/scratch are fully available. We have suspended the automatic cleanup of old files under /wynton/scratch and /wynton/protected/scratch. The ZFS consultant recovered 3 of the 6 remaining storage targets. We have now recovered in total 21 (88%) out of 24 failed targets. The recovery work will continue on Monday (sic!).
May 26, 17:00 PDT

Update: All 12 ZFS storage targets on one server pair have been recovered and are undergoing final verification, after which that server pair is back in production. On the remaining server pair with also 12 failed ZFS storage targets, 4 targets have been recovered, 4 possibly have been, and 4 are holding out. We’re continuing our work with the consultant on those targets. These storage servers were installed on 2023-03-28, so it is only files written after that date that may be affected. We are tentatively planning on bringing up the login, data transfer and development nodes tomorrow Friday, prior to the long weekend, but access to directories in /wynton/group, /wynton/protected/group, or /wynton/protected/project will be blocked with the exception for a few groups with mirrored storage. /wynton/home and /wynton/scratch would be fully accessible.
May 25, 17:00 PDT

Update: 8 more ZFS storage targets were recovered today. We have now recovered in total 17 (71%) out of 24 failed targets. The content of the recovered targets is now being verified. We will continue working with the consultant tomorrow on the remaining 7 storage targets.
May 24, 17:00 PDT

Update: The maintenance and upgrade of the Wynton network switch was successful and is now completed. We also made progress of recovering the failed ZFS storage targets - 9 (38%) out of 24 failed targets have been recovered. To maximize our chances at a full recovery, Wynton will be kept down until the consultant completes their initial assessment. Details: The contracted ZFS consultant started to work on recovering the failed ZFS storage targets that we have on four servers. During the two hours of work, they quickly recovered another three targets on on the first server, leaving us with only one failed target on that server. Attempts of the same recovery method on the second and third servers were not successful. There was no time today to work on the fourth server. The work to recover the remaining targets will resume tomorrow. After the initial recovery attempt has been attempted on all targets, the consultant, who is one of the lead ZFS developers, plans to load a development version of ZFS on the servers in order to perform more thorough and deep-reaching recovery attempts.
May 23, 17:00 PDT

Update: Wynton will be kept down until the ZFS-recovery consultant has completed their initial assessment. If they get everything back quickly, Wynton will come back up swiftly. If recovery takes longer, or is less certain, we will look at coming back up without the problematic storage targets. As the purchase is being finalized, we hope that the consultant can start their work either on Tuesday or Wednesday. The UCSF Networking Team is performing more maintenance on the switch tonight.
May 22, 23:30 PDT

Update: The cluster will be kept offline until at least Tuesday May 23. The BeeGFS file-system failure is because 24 out of 144 ZFS storage targets got corrupted. These 24 storage targets served our “group” storage, which means only files written to /wynton/group, /wynton/protected/group, and /wynton/protected/project within the past couple of months are affected. Files under /wynton/home and /wynton/scratch are not affected. We are scanning the BeeGFS file system to identify exactly which files are affected. Thus far, we have managed to recover 6 (25%) out of the 24 failed targets. The remaining 18 targets are more complicated and we are working with a vendor to start helping us recover them next week.
May 19, 10:15 PDT

Update: Automatic cleanup of /wynton/scratch has been disabled.
May 18, 23:00 PDT

Update: Several ZFS storage targets that are used by BeeGFS experienced failures during the scheduled maintenance window. There is a very high risk of partial data loss, but we will do everything possible to minimize the loss. In addition, the Wynton core network switch failed and needs to be replaced. The UCSF IT Infrastructure Network Services Team works with the vendor to get a rapid replacement.
May 17, 16:30 PDT

Update: The cluster is down and unavailable because of maintenance.
May 16, 21:00 PDT

Update: There will be a one-day downtime starting at 21:00 on Tuesday May 16 and ending at 17:00 on Wednesday May 17. This is aligned with a planned PG&E power-outage maintenance on May 17. Starting May 2, the maximum job runtime will be decreased on a daily basis from the maximum 14 days so that jobs finish in time. Jobs with runtimes going into the maintenance window, will only be started after the downtime. The default run time is 14 days, so make sure to specify qsub -l h_rt=<run-time> ... if you want something shorter.
May 3, 10:00 PDT

Update: The updated plan is to only have a 24-hour downtime starting the evening of Tuesday May 16 and end by the end of Wednesday May 17. This is aligned with a planned PG&E power-outage maintenance on May 17.
April 24, 11:00 PDT

Update: The updated plan is to have the downtime during the week of May 15, 2023 (2023W20). This is aligned with a planned PG&E power-outage maintenance during the same week.
March 27, 11:00 PDT

Notice: We will performing a full-week major update to the cluster during late Spring 2023. Current plan is to do this during either the week of May 8, 2023 (2023W19) or the week of May 15, 2023 (2023W20).
February 27, 11:00 PST

Upcoming Incidents #

None.

Past Incidents #

Operational Summary for 2023 (this far) #

  • Full downtime:
    • Scheduled: 17.0 hours (= 0.6 days)
    • Unscheduled: 0.0 hours (= 0.0 days)
    • Total: 17.0 hours (= 0.6 days)
    • External factors: 0% of the above downtime, corresponding to 0.0 hours (= 0.0 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    • 2023-02-22 (17.0 hours)
  • Total downtime: 17.0 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    • N/A

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    • N/A
  • Total downtime: 0.0 hours of which 0.0 hours were due to external factors

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    • N/A
  • Total downtime: 0.0 hours of which 0.0 hours were due to external factors

Unscheduled downtimes due to other reasons #

  • Impact: Less compute resources
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    • N/A
  • Total downtime: 0.0 hours of which 0.0 hours were due to external factors

February 22-23, 2023 #

Full downtime #

Resolved: The cluster maintenance has completed and the cluster is now fully operational again.
February 23, 14:00 PST

Update: The cluster has been shut down for maintenance.
February 22, 21:00 PST

Notice: The cluster will be shut down for maintenance from 9 pm on Wednesday February 22 until 5:00 pm on Thursday February 23, 2023. This is done to avoid possible file-system and hardware failures when the UCSF Facilities performs power-system maintenance. During this downtime, we will perform cluster maintenance. Starting February 8, the maximum job runtime will be decreased on a daily basis from the current 14 days so that jobs finish in time. Jobs with runtimes going into the maintenance window, will be started after the downtime.
February 9, 09:00 PST

January 24, 2023 #

No access to login and data-transfer hosts #

Resolve: Network issues has been resolved and access to all login and data-transfer has been re-established. The problem was physical (a cable was disconnected).
January 24, 16:00 PST

Notice: There is no access to non-PHI login and data-transfer hosts (log[1-2], dt[1-2]). We suspect a physical issue (e.g. somebody kicked a cable), which means we need to send someone onsite to fix the problem.
January 24, 14:45 PST

January 11, 2023 #

No internet access on development nodes #

Resolved: The network issue for the proxy servers has been fixed. All development nodes now have working internet access.
January 11, 16:00 PST

Workarounds: Until this issue has been resolved, and depending on needs, you might try to use a data-transfer node.Some of the software tools on the development nodes are also available on the data-transfer nodes, e.g. curl, wget, and git.
January 11, 09:50 PST

Notice: The development nodes have no internet access, because the network used by out proxy servers is down for unknown reasons. The problem most likely started on January 10 around 15:45.
January 11, 09:00 PST

Operational Summary for 2022 #

  • Full downtime:

    • Scheduled: 94.0 hours = 3.9 days = 1.1%
    • Unscheduled: 220.0 hours = 9.2 days = 2.5%
    • Total: 314.0 hours = 13.1 days = 1.4%
    • External factors: 36% of the above downtime, corresponding to 114 hours (= 4.8 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    • 2022-02-08 (53.5 hours)
    • 2022-09-27 (40.5 hours)
  • Total downtime: 94.0 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    1. 2022-08-05 (up to 14 days)

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    • 2022-09-06 (66 hours)
  • Total downtime: 66 hours of which 66 hours were due to external factors

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    • 2022-03-28 (1 hours): Major BeeGFS issues
    • 2022-03-26 (5 hours): Major BeeGFS issues
    • 2022-03-18 (100 hours): Major BeeGFS issues
  • Total downtime: 106.0 hours of which 0 hours were due to external factors

Unscheduled downtimes due to other reasons #

  • Impact: Less compute resources
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    • 2022-03-26 (48 hours): Data-center cooling issues
  • Total downtime: 48 hours of which 48 hours were due to external factors

Accounts #

  • Number of user account: 1,643 (change: +369 during the year)

November 2, 2022 #

Major BeeGFS issues #

Resolved: The BeeGFS issues have been resolved. At 05:29 this morning, a local file system hosting one of our 12 BeeGFS meta daemons crashed. Normally, BeeGFS detects this and redirects processing to a secondary, backup daemon. In this incident, this failback did not get activated and a manual intervention was needed.
November 2, 09:30 PDT

Notice: The BeeGFS file system started to experience issues early morning on Tuesday 2022-11-02. The symptoms are missing files and folders.
November 2, 08:15 PDT

November 1, 2022 #

Scheduler not available #

Resolved: The job scheduler is responsive again, but we are not certain what caused the problem. We will keep monitoring the issue.
November 1, 16:30 PDT

Notice: The job scheduler, SGE, does not respond to user requests, e.g. qstat and qsub. No new jobs can be submitted at this time. The first reports on problems came in around 09:00 this morning. We are troubleshooting the problem.
November 1, 10:25 PDT

September 27-29, 2022 #

Full downtime #

Resolved: The cluster maintenance has completed and the cluster is now fully operational again.
September 29, 13:30 PDT

Update: The cluster has been shut down for maintenance.
September 27, 21:00 PDT

Notice: Wynton will be shut down on Tuesday September 27, 2022 at 21:00. We expect the cluster to be back up by the end of the workday on Thursday September 29. This is done to avoid file-system and hardware failures that otherwise may occur when the UCSF Facilities performs maintenance to the power system in Byers Hall. We will take the opportunity to perform cluster maintenance after the completion of the power-system maintenance.
September 14, 17:00 PDT

September 6-9, 2022 #

Outage following campus power glitch #

Resolved: As of 09:20 on 2022-09-09, the cluster is back in full operation. The queues are enabled, jobs are running, and the development nodes are accepting logins.
September 9, 09:35 PDT

Update: Login and data-transfer nodes are disabled to minimize the risk for file corruption.
September 7, 12:45 PDT

Notice: The Wynton system experiencing system-wide issues, including the file system, due to a campus power glitch. To minimize the risk of corrupting the file system, it was decided to shut down the job scheduler and terminate all running jobs. The power outage at Mission Bay campus happened at 15:13. Despite diesel-generated backup power started up momentarily, it was enough to affect some of our servers. The job scheduler will be offline until the impact on Wynton is fully investigated.
September 6, 16:20 PDT

August 5-9, 2022 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted.
Aug 9, 12:00 PDT

Notice: New operating-system kernels are deployed. Login, data-transfer, and development nodes will be rebooted on Monday August 8 at 14:00. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~14,500 cores) in the graph above.
Aug 5, 10:30 PDT

August 4, 2022 #

Software repository maintenance #

Resolved: The Sali lab software module repository is back.
Aug 4, 12:00 PDT

Notice: The Sali lab software module repository is back will be unavailable from around 10:30-11:30 today August 4 for maintenance.
Aug 4, 03:30 PDT

March 28-April 6, 2022 #

Major BeeGFS issues #

Resolved: The patch of the BeeGFS servers were successfully deployed by 14:30 and went without disruptions. As a side effect, rudimentary benchmarking shows that this patch also improves the overall performance. Since the troubleshooting, bug fixing, and testing started on 2022-03-28, we managed to keep the impact of the bugs to a minimum resulting in only one hour of BeeGFS stall.
April 6, 17:00 PDT

Update: The BeeGFS servers will be updated tomorrow April 6 at 14:00. The cluster should work as usual during the update.
April 5, 17:00 PDT

Update: Our load tests over the weekend went well. Next, we will do discrepancy validation tests between our current version and the patch versions. When those pass, we will do a final confirmation with the BeeGFS vendor. We hope to deploy the patch to Wynton in a few days.
April 4, 10:30 PDT

Update: After a few rounds, we now have a patch that we have confirmed work on our test BeeGFS system. The plan is to do additional high-load testing today and over the weekend.
April 1, 10:30 PDT

Update: The BeeGFS vendors will send us a patch by tomorrow Tuesday, which we will test on our separate BeeGFS test system. After being validated there, will will deploy it to the main system. We hope to have a patch deploy by the end of the week.
March 28, 11:30 PDT

Update: We have re-enabled the job scheduler after manually having resolved the BeeGFS meta server issues. We will keep monitoring the problem and send more debug data to the BeeGFS vendors.
March 28, 11:00 PDT

Notice: On Monday 2022-03-28 morning at 10:30 the BeeGFS hung again. We put a hold on the job scheduler for now.
March 28, 10:30 PDT

March 26, 2022 #

Job scheduler is disabled due to cooling issues #

Resolved: The compute nodes and the job scheduler are up and running again.
March 26, 11:00 PDT

Notice: The job scheduler as disabled and running jobs where terminated on Saturday 2022-03-26 around 09:00. This was done due to an emergency shutdown because the ambient temperature in the data center started to rise around 08:00 and at 09:00 it hit the critical level, where our monitoring system automatically shuts down compute nodes to prevent further damage. This resulted in the room temperature coming down to normal levels again. We are waiting on UCSF Facilities to restore cooling in the data center.
March 26, 10:30 PDT

March 26, 2022 #

Major BeeGFS issues #

Resolved: Just after 03:00 on Saturday 2022-03-26 morning BeeGFS hung. Recover actions were taken at 07:30 and the problem was resolved before 08:00. We have tracked down the problem occur when a user runs more than one rm -r /wynton/path/to/folder concurrently on the same folder. This is a bug in BeeGFS that vendors is aware of.
March 26, 10:30 PDT

March 18-22, 2022 #

Job scheduler is disabled because of BeeGFS issues #

Resolved: We have re-enabled the job scheduler, which now processes all queued jobs. We will keep working with the BeeGFS vendor to find a solution to avoid this issue from happening again.
March 22, 16:30 PDT

Update: The BeeGFS issue has been identified. We identified a job that appears to trigger a bug in BeeGFS, which we can reproduce. The BeeGFS vendor will work on a bug fix. The good news is that the job script that triggers the problem can be tweaked to avoid hitting the bug. This means we can enable the job scheduler as soon as all BeeGFS metadata servers have synchronized, which we expect to take a few hours.
March 22, 12:00 PDT

Update: The BeeGFS file system troubleshooting continues. The job queue is still disabled. You might experience login and non-responsive prompt issues while we troubleshoot this. We have met with the BeeGFS vendors this morning and we are collecting debug information to allow them to troubleshoot the problem on their end. At the same time, we hope to narrow in on the problem further on our end by trying to identify whether there is a particular job or software running on the queue that might cause this. Currently, we have no estimate when this problem will be fixed. We have another call scheduled with the vendor tomorrow morning.
March 21, 11:45 PDT

Update: The BeeGFS file system is back online and the cluster can be accessed again. However, we had to put SGE in maintenance mode, which means no jobs will be started until the underlying problem, which is still unknown, has been identified and resolved. The plan is to talk to the BeeGFS vendor as soon as possible after the weekend. Unfortunately, in order to stabilize BeeGFS, we had to kill, at 16:30 today, all running jobs and requeue them on the SGE job scheduler. They are now listed as status ‘Rq’. For troubleshooting purposes, please do not delete any of your ‘Rq’ jobs.
March 18, 17:05 PDT

Notification: The Wynton environment cannot be accessed at the moment. This is because the global file system, BeeGFS, is experiencing issues. The problem, which started around 11:45 today, is being investigated.
March 18, 11:55 PDT

March 14-15, 2022 #

Brief network outage #

Noticed: UCSF Network IT will be performing maintenance on several network switches in the evening and overnight on Monday March 14. This will not affect jobs running on the cluster. One of the switches is the one which provides Wynton with external network access. When that switch is rebooted, Wynton will be inaccessible for about 15 minutes. This is likely to happen somewhere between 22:00 and 23:00 that evening, but the outage window extends from 21:00 to 05:00 the following morning, so it could take place anywhere in that window.
March 11, 10:15 PST

February 28-March 2, 2022 #

Full downtime #

Resolved: Wynton is available again.
March 2, 15:30 PST

Update: The Wynton environment is now offline for maintenance work.
February 28, 10:00 PST

Clarification: The shutdown will take place early Monday morning February 28, 2022. Also, this is on a Monday and not on a Tuesday (as previously written below).
February 22, 11:45 PST

Update: We confirm that this downtime will take place as scheduled.
February 14, 15:45 PST

Notice: We are planning a full file-system maintenance starting on Tuesday Monday February 28, 2022. As this requires a full shutdown of the cluster environment, we will start decreasing the job queue, on February 14, two weeks prior to the shutdown. On February 14, jobs that requires 14 days or less to run will be launched. On February 15, only jobs that requires 13 days or less will be launched, and so on until the day of the downtime. Submitted jobs that would go into the downtime window if launched, will only be launched after the downtime window.
November 22, 11:45 PST

Operational Summary for 2021 #

  • Full downtime:

    • Scheduled: 64 hours = 2.7 days = 0.73%
    • Unscheduled: 58 hours = 2.4 days = 0.66%
    • Total: 122 hours = 5.1 days = 0.14%
    • External factors: 39% of the above downtime, corresponding to 47 hours (=2.0 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    1. 2021-05-25 (64 hours)
  • Total downtime: 64 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    1. 2021-01-29 (up to 14 days)
    2. 2021-07-23 (up to 14 days)
    3. 2021-12-08 (up to 14 days)

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    • 2021-08-26 (28 hours) - Planned Byers Hall power shutdown failed
    • 2021-11-09 (10 hours) - Unplanned PG&E power outage
  • Total downtime: 38 hours of which 38 hours were due to external factors

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    1. 2021-03-26 (9 hours) - Campus networks issues causing significant BeeGFS slowness
    2. 2021-07-23 (8 hours) - BeeGFS silently failed disks
    3. 2021-11-05 (3 hours) - BeeGFS non-responsive
  • Total downtime: 20 hours of which 9 hours were due to external factors

Unscheduled downtimes due to other reasons #

  • Impact: Less compute resources
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    1. 2021-04-28 (210 hours) - GPU taken down due to server room cooling issues
  • Total downtime: 0 hours

Accounts #

  • Number of user account: 1,274 (change: +410 during the year)

December 8-December 23, 2021 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted.
Dec 23, 12:00 PST

Notice: New operating-system kernels are deployed. Login, data-transfer, and development nodes will be rebooted tomorrow Thursday December 9 at 11:00. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~12,500 cores) in the graph above.

Dec 8, 16:30 PST

December 19-21, 2021 #

Globus and data-transfer node issue #

Resolved: Data-transfer node dt1 and Globus file transfers are working again.
Dec 21, 13:20 PST

Update: Globus file transfers to and from Wynton are not working. This is because Globus relies on the data-transfer node dt1, which is currently down.
Dec 20, 15:30 PST

Notice: Data-transfer node dt1 has issues. Please use dt2 until resolved. The first report on this problem came yesterday at 21:30.
Dec 20, 09:30 PST

November 9, 2021 #

Partial outage due to campus power glitch #

Resolved: All hosts have been rebooted and are now up and running.
November 9, 11:00 PST

Notice: There was a brief PG&E power outage early Tuesday November 9 around 01:20. This affected the power on the Mission Bay campus, including the data center housing Wynton. The parts of our system with redundant power were fine, but many of the compute nodes are on PG&E-power only and, therefore, went down. As a result, lots of jobs crashed. We will restart the nodes that crashed manually during the day today.
November 9, 09:10 PST

October 25-26, 2021 #

File-system maintenance #

Resolved: Resynchronization of all file-system meta servers is complete, which concludes the maintenance.
October 26, 09:45 PDT

Update: The maintenance work has started.
October 25, 14:00 PDT

Notice: We will perform BeeGFS maintenance work starting Monday October 25 at 2:00 pm. During this work, the filesystem might be less performant. We don’t anticipate any downtime.
October 21, 12:10 PDT

August 26-September 10, 2021 #

Byers Hall power outage & file-system corruption #

Resolved: The corrupted filesystem has been recovered.
September 10, 17:20 PDT

Update: Wynton is back online but the problematic BeeGFS filesystem is kept offline, which affects access to some of the folders and files hosted on /wynton/group/. The file recovery tools are still running.
August 27, 13:05 PDT

Partially resolved: Wynton is back online but the problematic BeeGFS filesystem is kept offline, which affects access to some of the folders and files hosted on /wynton/group/. The file recovery tools are still running.
August 27, 13:05 PDT

Update: The BeeGFS filesystem recovering attempt keeps running. The current plan is to bring Wynton back online while keeping the problematic BeeGFS filesystem offline.
August 26, 23:05 PDT

Update: All of the BeeGFS servers are up and running, but one of the 108 filesystems that make up BeeGFS was corrupted by the sudden power outage. The bad filesystem is part of /wynton/group/. We estimate that 70 TB of data is affected. We are making every possible effort to restore this filesystem, which will take time. While we do so, Wynton will remain down.
August 26, 14:05 PDT

Notice: The cluster is down after an unplanned power outage in the main data center. The power is back online but several of our systems, including BeeGFS servers, did not come back up automatically and will require on-site, manual actions.
August 26, 09:15 PDT

July 23-July 28, 2021 #

Kernel maintenance #

Resolved: The majority of the compute nodes have been rebooted after only four days, which was quicker than the maximum of 14 days.
July 28, 08:00 PST

Notice: New operating-system kernels are deployed. Login, data-transfer, and development nodes will be rebooted at 13:00 on Friday July 23 at 1:00 pm. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~10,400 cores) in the graph above.

July 23, 07:40 PST

June 24, 2021 #

Cluster not accessible (due to BeeGFS issues) #

Resolved: Wynton and BeeGFS is back online. The problem was due to failed disks. Unfortunately, about 10% of the space in /wynton/scratch/ went bad, meaning some files are missing or corrupted. It is neither possible to recover them nor identify which files or folders are affected. In other words, expect some oddness if you had data under /wynton/scratch/. There will also be some hiccups over the next several days as we get everything in ZFS and BeeGFS back into an as stable state as possible.
June 24, 14:55 PDT

Update: We’re working hard on getting BeeGFS back up. We were not able to recover the bad storage target, so it looks like there will be some data loss on /wynton/scratch/. More updates soon.
June 24, 13:45 PDT

Notification: The Wynton environment cannot be accessed at the moment. This is because the global file system, BeeGFS, is experiencing issues since early this morning. The problem is being investigated.
June 24, 07:00 PDT

May 25-June 7, 2021 #

Full downtime (major maintenance) #

Resolved: All remaining issues from the downtime have been resolved.
June 7, 17:00 PDT

Update: Login node log2 can now be reached from the UCSF Housing WiFi network.
June 7, 17:00 PDT

Update: dt2 can now be reached from outside the Wynton cluster.
June 7, 13:15 PDT

Update: Login node log2 cannot be reached from the UCSF Housing WiFi network. If you are on that network, use log1 until this has been resolved.
June 2, 07:00 PDT

Update: Both data transfer nodes are back online since a while, but dt2 can only be reached from within the Wynton cluster.
June 1, 13:45 PDT

Update: A large number of of the remaining compute nodes have been booted up. There are now ~8,600 cores serving jobs.
June 1, 10:15 PDT

Update: The development nodes are now back too. For the PHI pilot project, development node pgpudev1 is back up, but pdev1 is still down.
May 28, 10:00 PDT

Update: Wynton is partially back up and running. Both login hosts are up (log1 and log2). The job scheduler, SGE, accepts new jobs and and launches queued jobs. Two thirds of the compute node slots are back up serving jobs. Work is done to bring up the the development nodes and the data transfer hosts (dt1 and dt2).
May 27, 10:30 PDT

Update: We hit more than a few snags today. Our filesystem, BeeGFS, is up and running, but it still needs some work. The login hosts are up, but SGE is not and neither are the dev nodes. We will continue the work early tomorrow Thursday.
May 26, 21:40 PDT

Notice: The Wynton HPC environment will be shut down late afternoon on Tuesday May 25, 2021, for maintenance. We expect the cluster to be back online late Wednesday May 26. To allow for an orderly shutdown of Wynton, the queues have been disabled starting at 3:30 pm on May 25. Between now and then, only jobs whose runtimes end before that time will be able to start. Jobs whose runtimes would run into the maintenance window will remain in the queue.
May 10, 16:40 PDT

Preliminary notice: The Wynton HPC cluster will be undergoing a major upgrade on Wednesday May 26, 2021. As usual, starting 15 days prior to this day, on May 11, the maximum job run-time will be decreased on a daily basis so that all jobs finishes in time, e.g. if you submit a job on May 16 with a run-time longer than nine days, it will not be able to scheduled and it will be queued until after the downtime.
May 3, 11:00 PDT

June 1-2, 2021 #

Password management outage #

Resolved: Password updates works again.
June 2, 10:30 PDT

Notice: Due to technical issues, it is currently not possible to change your Wynton password. If attempted from the web interface, you will get an error on “Password change not successful! (kadmin: Communication failure with server while initializing kadmin interface )”. If attempted using ‘passwd’, you will get “passwd: Authentication token manipulation error”.
June 1, 10:30 PDT

April 28 - May 7, 2021 #

Many GPU nodes down (due to cooling issues) #

Resolved: Cooling has been restored and all GPU nodes are back online again.
May 7, 11:10 PDT

Update: Half of the GPU nodes that was taken down are back online. Hopefully, the remaining ones can be brought back up tomorrow when the cooling in the server room should be fully functioning again.
May 6, 14:30 PDT

Notification: One of Wynton’s ancillary server rooms is having cooling issues. To reduce the heat load in the room, we had to turn off all the Wynton nodes in the room around 09:45 this morning. This affects GPU nodes named msg*gpu* and a few other regular nodes. We estimate that the UCSF Facilities to fix the cooling problem by early next week.
April 28, 16:30 PDT

March 26, 2021 #

Cluster not accessible (due to network outage) #

Resolved: The malfunctioning network link between two of Wynton’s data centers, which affected our BeeGFS file system and Wynton HPC as a whole, has been restored.
March 26, 21:30 PDT

Notification: Campus network issues causing major Wynton HPC issues including extremely slow access to our BeeGFS file system. This was first reported around 11:30 today. A ticket has been filed with the UCSF Network. ETA is unknown.
March 26, 12:30 PDT

January 29-February 12, 2021 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted. A few compute nodes remain offline that has to be rebooted manually, which will be done as opportunity is given.
February 13, 09:00 PST

Notice: New operating-system kernels are deployed. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~10,400 cores) in the graph above. Login, data-transfer, and development nodes will be rebooted at 13:00 on Monday February 1.
January 31, 17:00 PST

February 1-3, 2021 #

Development node not available #

Resolved: Development node dev2 is available again.
February 3, 15:00 PST

Notice: Development node dev2 is down. It failed to come back up after the kernel upgrade on 2021-02-01. An on-site reboot is planned for Wednesday February 3.
February 2, 11:45 PST

January 28, 2021 #

Server room maintenance #

Notice: The air conditioning system in one of our server rooms will be upgraded on January 28. The compute nodes in this room will be powered down during the upgrade resulting in fewer compute nodes being available on the cluster. Starting 14 days prior to this date, compute nodes in this room will only accept jobs that will finish in time.
January 13, 10:00 PST

Operational Summary for 2020 #

  • Full downtime:

    • Scheduled: 123 hours = 5.1 days = 1.4%
    • Unscheduled: 91.5 hours = 3.8 days = 1.0%
    • Total: 214.5 hours = 8.9 days = 2.4%
    • External factors: 12% of the above downtime, corresponding to 26.5 hours (=1.1 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    1. 2020-08-10 (93 hours)
    2. 2020-12-07 (30 hours)
  • Total downtime: 123 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    1. 2020-06-11 (up to 14 days)
    2. 2020-12-11 (up to 14 days)

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    • None
  • Total downtime: 0 hours

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    1. 2020-01-22 (2.5 hours) - BeeGFS failure to failed upgrade
    2. 2020-01-29 (1.0 hours) - BeeGFS non-responsive
    3. 2020-02-05 (51.5 hours) - Legacy NetApp file system failed
    4. 2020-05-22 (0.5 hours) - BeeGFS non-responsive to failed upgrade
    5. 2020-08-19 (1.5 hours) - BeeGFS non-responsive
    6. 2020-10-21 (3 hours) - BeeGFS non-responsive
    7. 2020-11-05 (3 hours) - BeeGFS non-responsive
  • Total downtime: 63.0 hours

Unscheduled downtimes due to other reasons #

  • Impact: Less compute resources
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    1. 2020-05-28 (26.5 hours) - MSG Data Center outage affecting many GPU compute nodes
    2. 2020-07-04 (2 hours) - SGE scheduler failed
    3. 2020-11-04 (288 hours) - ~80 compute nodes lost due to network switch failure
  • Total downtime: 28.5 hours of which 26.5 hours were due to external factors

Accounts #

  • Number of user account: 864 (change: +386 during the year)

December 8-17, 2020 #

Limited accessibility of Login node log1 #

Resolved: Login node ‘log1.wynton.ucsf.edu’ can again be accessed from outside of the UCSF network.
December 17, 14:20 PDT

Notice: Login node ‘log1.wynton.ucsf.edu’ is only accessible from within UCSF network. This is a side effect of the recent network upgrades. We are waiting for The UCSF IT Network to resolve this for us. Until resolved, please use the alternative ‘log2.wynton.ucsf.edu’ login node when connecting from outside of the UCSF network.
December 8, 23:00 PDT

December 11-16, 2020 #

Rebooting compute nodes #

Resolved: All compute nodes have been rebooted.
December 16, 05:00 PDT

Notice: The new BeeGFS setting introduced during the upgrades earlier this week caused problems throughout the system and we need to roll them back. The compute nodes will no longer take on new jobs until they have been rebooted. A compute node will be automatically rebooted as soon as all of its running jobs have completed. Unfortunately, we have to kill jobs that run on compute nodes that are stalled and suffer from the BeeGFS issues.
December 11, 13:50 PDT

December 11, 2020 #

Rebooting login and development nodes #

Resolved: All login and development nodes have been rebooted.
December 12, 17:00 PDT

Notice: Login node ‘log1.wynton.ucsf.edu’ and all the development nodes will be rebooted at 4:30 PM today Friday. This is needed in order to roll back the new BeeGFS setting introduced during the upgrades earlier this week.
December 11, 13:50 PDT

December 7-8, 2020 #

Major upgrades (full downtime) #

Resolved: The upgrade has been completed. The cluster back online, including all of the login, data-transfer, and development nodes, as well as the majority of the compute nodes. The scheduler is processing jobs again. All hosts now run CentOS 7.9.
December 8, 16:30 PDT

Update: The upgrade is paused and will resume tomorrow. We hope to be bring all of the cluster back online by the end of tomorrow. For now, login node ‘log2’ (but not ‘log1’), and data-transfer nodes ‘dt1’, and ‘dt2’ are back online and can be used for accessing files. Development nodes ‘dev1’ and ‘dev3’ are also available (please make sure to leave room for others). The scheduler remains down, i.e. it is is not be possible to submit jobs.
December 7, 17:00 PDT

Update: The upgrades have started. Access to Wynton HPC has been disable as of 10:30 this morning. The schedulers stopped launching queued jobs as of 23:30 last night.
December 7, 10:30 PDT

Revised notice: We have decided to hold back on upgrading BeeGFS during the downtime and only focus on the remain parts including operating system and network upgrades. The scope of the work is still non-trivial. There is a risk that the downtime will extend into Thursday December 10. However, if everything go smoothly, we hope that Wynton HPC will be back up by the end of Monday or during the Tuesday. There will only be one continuous downtime, that is, when the cluster comes back up, it will stay up.
December 3, 09:00 PDT

Notice: Starting early Monday December 7, the cluster will be powered down entirely for maintenance and upgrades, which includes upgrading the operating system, the network, and the BeeGFS file system. We anticipate that the cluster will be available again by the end of Tuesday December 8, when load testing of the upgraded BeeGFS file system will start. If these tests fail, we will have to unroll the BeeGFS upgrade, which in case we anticipate that the cluster is back online by the end of Wednesday December 9.
November 23, 16:50 PDT

November 4-16, 2020 #

Compute nodes not serving jobs (due to network switch failure) #

Resolved: All 74 compute nodes that were taken off the job scheduler on 2020-11-04 are back up and running
November 16, 12:00 PDT

Notice: 74 compute nodes, including several GPU nodes, were taken off the job scheduler around 14:00 on 2020-11-04 due to a faulty network switch. The network switch needs to be replaced in order to resolve this.
November 4, 16:10 PDT

November 5, 2020 #

Cluster inaccessible (due to BeeGFS issues) #

Resolved: Our BeeGFS file system was non-responsive during 01:20-04:00 on 2020-11-05 because one of the meta servers hung.
November 5, 08:55 PDT

October 21, 2020 #

Cluster inaccessible (due to BeeGFS issues) #

Resolved: Our BeeGFS file system was non-responsive because one of its meta servers hung, which now has been restarted.
October 21, 11:15 PDT

Notice: The cluster is currently inaccessible for unknown reasons. The problem was first reported around 09:30 today.
October 21, 10:45 PDT

August 19, 2020 #

Cluster inaccessible (due to BeeGFS issues) #

Resolved: Our BeeGFS file system was non-responsive between 17:22 and 18:52 today because one of its meta servers hung while the other attempted to synchronize to it.
August 19, 19:00 PDT

Notice: The cluster is currently inaccessible for unknown reasons. The problem was first reported around 17:30 today.
August 19, 18:15 PDT

August 10-13, 2020 #

Network and hardware upgrades (full downtime) #

Resolved: The cluster is fully back up and running. Several compute nodes still need to be rebooted but we consider this upgrade cycle completed. The network upgrade took longer than expected, which delayed the processes. We hope to bring the new lab storage online during the next week.
August 13, 21:00 PDT

Update: All login, data-transfer, and development nodes are online. Additional compute nodes are being upgraded and are soon part of the pool serving jobs.
August 13, 14:50 PDT

Update: Login node log1, data-transfer node dt2, and the development nodes are available again. Compute nodes are going through an upgrade cycle and will soon start serving jobs again. The upgrade work is taking longer than expected and will continue tomorrow Thursday August 13.
August 12, 16:10 PDT

Notice: All of the Wynton HPC environment is down for maintenance and upgrades.
August 10, 00:00 PDT

Notice: Starting early Monday August 10, the cluster will be powered down entirely for maintenance and upgrades, which includes upgrading the network and adding lab storage purchased by several groups. We anticipate that the cluster will be available again by the end of Wednesday August 12.
July 24, 15:45 PDT

July 6, 2020 #

Development node failures #

Resolved: All three development nodes have been rebooted.
July 6, 15:20 PDT

Notice: The three regular development nodes have all gotten themselves hung up on one particular process. This affects basic system operations and preventing such basic commands as ps and w. To clear this state, we’ll be doing an emergency reboot of the dev nodes at about 15:15.
July 6, 15:05 PDT

July 5, 2020 #

Job scheduler non-working #

Resolved: The SGE scheduler produced errors when queried or when jobs were submitted or launched. The problem started 00:30 and lasted until 02:45 early Sunday 2020-07-05.
July 6, 22:00 PDT

June 11-26, 2020 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted.
June 26, 10:45 PDT

Update: Development node dev3 is back online.
June 15, 11:15 PDT

Update: Development node dev3 is not available. It failed to reboot and requires on-site attention, which might not be possible for several days. All other log-in, data-transfer, and development nodes were rebooted successfully.
June 11, 15:45 PDT

Notice: New operating-system kernels are deployed. Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer than usual slots available on the queues. To follow the progress, see the green ‘Available CPU cores’ curve (target ~10,400 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Thursday June 11.
June 11, 10:45 PDT

June 5-9, 2020 #

No internet access on development nodes #

Resolved: Internet access from the development nodes is available again. A new web-proxy server had to be built and deploy.
June 9, 09:15 PDT

Notice: Internet access from the development nodes is not available. This is because the proxy server providing them with internet access had a critical hardware failure around 08-09 this morning. At the most, we cannot provide an estimate when we get to restore this server.
June 5, 16:45 PDT

May 18-22, 2020 #

File-system maintenance #

Update: The upgrade of the BeeGFS filesystem introduced new issues. We decided to rollback the upgrade and we are working with the vendor. There is no upgrade planned for the near term.
June 8, 09:00 PDT

Update: The BeeGFS filesystem has been upgraded using a patch from the vendor. The patch was designed to lower the amount of resynchronization needed between the two metadata servers. Unfortunately, after the upgrade we observe an increase of resynchronization. We will keep monitoring the status. If the problem remains, we will consider a rollback to the BeeGFS version used prior to May 18.
May 22, 01:25 PDT

Update: For a short moment around 01:00 early Friday, both of our BeeGFS metadata servers were down. This may have lead to some applications experiencing I/O errors around this time.
May 22, 01:25 PDT

Notice: Work to improve the stability of the BeeGFS filesystem (/wynton) will be conducted during the week of May 18-22. This involves restarting the eight pairs of metadata server processes, which may result in several brief stalls of the file system. Each should last less than 5 minutes and operations will continue normally after each one.
May 6, 15:10 PDT

May 28-29, 2020 #

GPU compute nodes outage #

Resolved: The GPU compute nodes are now fully available to serve jobs.
May 29, 12:00 PDT

Update: The GPU compute nodes that went down yesterday have been rebooted.
May 29, 11:10 PDT

Investigating: A large number of GPU compute nodes in the MSG data center are currently down for unknown reasons. We are investigating the cause.
May 28, 09:35 PDT

February 5-7, 2020 #

Major outage due to NetApp file-system failure #

Resolved: The Wynton HPC system is considered fully functional again. The legacy, deprecated NetApp storage was lost.
February 10, 10:55 PDT

Update: The majority of the compute nodes have been rebooted and are now online and running jobs. We will actively monitor the system and assess the how everything works before we considered this incident resolved.
February 7, 13:40 PDT

Update: The login, development and data transfer nodes will be rebooted at 01:00 today Friday February 7.
February 7, 12:00 PDT

Update: The failed legacy NetApp server is the cause to the problems, e.g. compute nodes not being responsive causing problems for SGE etc. Because of this, all of the cluster - login, development, transfer, and computes nodes - will be rebooted tomorrow Friday 2020-02-07.
February 6, 10:00 PDT

Notice: Wynton HPC is experience major issues due to NetApp file-system failure, despite this is being deprecated and not used much these days. The first user report on this came in around 09:00 and the job-queue logs suggests the problem began around 02:00. It will take a while for everything to come back up and there will be brief BeeGFS outage while we reboot the BeeGFS management node.
February 5, 10:15 PDT

January 29, 2020 #

BeeGFS failure #

Resolved: The BeeGFS file-system issue has been resolved by rebooting two meta servers.
January 29, 17:00 PDT

Notice: There’s currently an issue with the BeeGFS file system. Users reporting that they cannot log in.
January 29, 16:00 PDT

January 22, 2020 #

File-system maintenance #

Resolved: The BeeGFS upgrade issue has been resolved.
Jan 22, 14:30 PST

Update: The planned upgrade caused unexpected problems to the BeeGFS file system resulting in /wynton/group becoming unstable.
Jan 22, 13:35 PST

Notice: One of the BeeGFS servers, which serve our cluster-wide file system, will be swapped out starting at noon (11:59am) on Wednesday January 22, 2020 and the work is expected to last one hour. We don’t anticipate any downtime because the BeeGFS servers are mirrored for availability.
Jan 16, 14:40 PST

December 20, 2019 - January 4, 2020 #

Kernel maintenance #

Resolved: All compute nodes have been updated and rebooted.
Jan 4, 11:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,500 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Friday December 20. GPU nodes already run the new kernel and are not affected.
December 20, 10:20 PST

Operational Summary for 2019 #

  • Full downtime:

    • Scheduled: 96 hours = 4.0 days = 1.1%
    • Unscheduled: 83.5 hours = 3.5 days = 1.0%
    • Total: 179.5 hours = 7.5 days = 2.0%
    • External factors: 15% of the above downtime, corresponding to 26 hours (=1.1 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    1. 2021-01-09 (1.0 hours) - job scheduler updates
    2. 2021-07-08 (95 hours)
  • Total downtime: 96.0 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    1. 2019-01-22 (up to 14 days)
    2. 2019-03-21 (up to 14 days)
    3. 2019-10-29 (up to 14 days)
    4. 2019-12-22 (up to 14 days)

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    1. 2019-07-30 (6.5 hour) - Byers Hall power outage
    2. 2019-08-15 (5.5 hour) - Diller power outage
    3. 2019-10-25 (1.0 hour) - Byers Hall power outage
    4. 2019-10-22 (13.0 hour) - Diller power backup failed during power maintenance
  • Total downtime: 26.0 hours of which 26.0 hours were due to external factors

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    1. 2019-01-08 (2.0 hours) - BeeGFS server non-responsive
    2. 2019-01-14 (1.5 hours) - BeeGFS non-responsive
    3. 2019-05-15 (24.5 hours) - BeeGFS non-responsive
    4. 2019-05-17 (5.0 hours) - BeeGFS slowdown
    5. 2019-06-17 (10.5 hours) - BeeGFS non-responsive
    6. 2019-08-23 (4.0 hours) - BeeGFS server non-responsive
    7. 2019-09-24 (3.0 hours) - BeeGFS server non-responsive
    8. 2019-12-18 (3.5 hours) - Network switch upgrade
    9. 2019-12-22 (5.5 hours) - BeeGFS server non-responsive
  • Total downtime: 58.5 hours of which 0 hours were due to external factors

Accounts #

  • Number of user account: 478 (change: +280 during the year)

December 20, 2019 - January 4, 2020 #

Kernel maintenance #

Resolved: All compute nodes have been updated and rebooted.
Jan 4, 11:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,500 cores) in the graph above. Log-in, data-transfer, and development nodes will be rebooted at 15:30 on Friday December 20. GPU nodes already run the new kernel and are not affected.
December 20, 10:20 PST

December 22, 2019 #

BeeGFS failure #

Resolved: No further hiccups were needed during the BeeGFS resynchronization. Everything is working as expected.
December 23, 10:00 PST

Update: The issues with login was because the responsiveness of one of the BeeGFS file servers became unreliable around 04:20. Rebooting that server resolved the problem. The cluster is fully functional again although slower than usual until the file system have been resynced. After this, there might be a need for one more, brief, reboot.
December 22, 14:40 PST

Notice: It is not possible to log in to the Wynton HPC environment. The reason is currently not known.
December 22, 09:15 PST

December 18, 2019 #

Network/login issues #

Resolved: The Wynton HPC environment is fully functional again. The BeeGFS filesystem was not working properly during 18:30-22:10 on December 18 resulting in no login access to the cluster and job file I/O being backed up.
December 19, 08:50 PST

Update: The BeeGFS filesystem is non-responsive, which we believe is due to the network switch upgrade.
December 18, 21:00 PST

Notice: One of two network switches will be upgraded on Wednesday December 18 starting at 18:00 and lasting a few hours. We do not expect this to impact the Wynton HPC environment other than slowing down the network performance to 50%.
December 17, 10:00 PST

October 29-November 11, 2019 #

Kernel maintenance #

Resolved: All compute nodes have been updated and rebooted.
Nov 11, 01:00 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). GPU nodes will be rebooted as soon as all GPU jobs complete. During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target ~7,000 cores) in the graph above.
Oct 29, 16:30 PDT

October 25, 2019 #

Byers Hall power outage glitch #

Resolved: Development node qb3-dev2 was rebooted. Data-transfer node dt1.wynton.ucsf.edu is kept offline because it is scheduled to be upgraded next week.
October 28, 15:00 PDT

Update: Most compute nodes that went down due to the power glitch has been rebooted. Data-transfer node dt1.wynton.ucsf.edu and development node qb3-dev2 are still down - they will be brought back online on Monday October 28.
October 25, 14:00 PDT

Notice: A very brief power outage in the Byers Hall building caused several compute nodes in its Data Center to go down. Jobs that were running on those compute nodes at the time of the power failure did unfortunately fail. Log-in, data-transfer, and development nodes were also affected. All these hosts are currently being rebooted.
October 25, 13:00 PDT

October 24, 2019 #

Login non-functional #

Resolved: Log in works again.
October 24, 09:45 PDT

Notice: It is not possible to log in to the Wynton HPC environment. This is due to a recent misconfiguration of the LDAP server.
October 24, 09:30 PDT

October 22-23, 2019 #

BeeGFS failure #

Resolved: The Wynton HPC BeeGFS file system is fully functional again. During the outage, /wynton/group and /wynton/scratch was not working properly, whereas /wynton/home was unaffected.
October 23, 10:35 PDT

Notice: The Wynton HPC BeeGFS file system is non-functional. It is expected to be resolved by noon on October 23. The underlying problem is that the power backup at the Diller data center did not work as expected during a planned power maintenance.
October 22, 21:45 PDT

September 24, 2019 #

BeeGFS failure #

Resolved: The Wynton HPC environment is up and running again.
September 24, 20:25 PDT

Notice: The Wynton HPC environment is unresponsive. Problem is being investigated.
September 24, 17:30 PDT

August 23, 2019 #

BeeGFS failure #

Resolved: The Wynton HPC environment is up and running again. The reason for this downtime was the BeeGFS file server became unresponsive.
August 23, 20:45 PDT

Notice: The Wynton HPC environment is unresponsive.
August 23, 16:45 PDT

August 15, 2019 #

Power outage #

Resolved: The Wynton HPC environment is up and running again.
August 15, 21:00 PDT

Notice: The Wynton HPC environment is down due to a non-planned power outage at the Diller data center. Jobs running on compute nodes located in that data center, were terminated. Jobs running elsewhere may also have been affected because /wynton/home went down as well (despite it being mirrored).
August 15, 15:45 PDT

July 30, 2019 #

Power outage #

Resolved: The Wynton HPC environment is up and running again.
July 30, 14:40 PDT

Notice: The Wynton HPC environment is down due to a non-planned power outage at the main data center.
July 30, 08:20 PDT

July 8-12, 2019 #

Full system downtime #

Resolved: The Wynton HPC environment and the BeeGFS file system are fully functional after updates and upgrades.
July 12, 11:15 PDT

Notice: The Wynton HPC environment is down for maintenance.
July 8, 12:00 PDT

Notice: Updates to the BeeGFS file system and the operating system that require to bring down all of Wynton HPC will start on the morning of Monday July 8. Please make sure to log out before then. The downtime might last the full week.
July 1, 14:15 PDT

June 17-18, 2019 #

Significant file-system outage #

Resolved: The BeeGFS file system is fully functional again.
June 18, 01:30 PDT

Investigating: Parts of /wynton/scratch and /wynton/group are currently unavailable. The /wynton/home space should be unaffected.
June 17, 15:05 PDT

May 17, 2019 #

Major outage due to file-system issues #

Resolved: The BeeGFS file system and the cluster is functional again.
May 17, 16:00 PDT

Investigating: There is a major slowdown of the BeeGFS file system (/wynton), which in turn causes significant problems throughout the Wynton HPC environment.
May 17, 10:45 PDT

May 15-16, 2019 #

Major outage due to file-system issues #

Resolved: The BeeGFS file system, and thereby also the cluster itself, is functional again.
May 16, 10:30 PDT

Investigating: The BeeGFS file system (/wynton) is experiencing major issues. This caused all on Wynton HPC to become non-functional.
May 15, 10:00 PDT

May 15, 2019 #

Network/login issues #

Resolved: The UCSF-wide network issue that affected access to Wynton HPC has been resolved.
May 15, 15:30 PDT

Update: The login issue is related to UCSF-wide network issues.
May 15, 13:30 PDT

Investigating: There are issues logging in to Wynton HPC.
May 15, 10:15 PDT

March 21-April 5, 2019 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted.
April 5, 12:00 PDT

Update: Nearly all compute nodes have been rebooted (~5,200 cores are now available).
Mar 29, 12:00 PDT

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 5,424 cores) in the graph above.
Mar 21, 15:30 PDT

March 22, 2019 #

Kernel maintenance #

Resolved: The login, development and transfer hosts have been rebooted.
March 22, 10:35 PDT

Notice: On Friday March 22 at 10:30am, all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. These hosts should be offline for less than 5 minutes.
Mar 21, 15:30 PDT

January 22-February 5, 2019 #

Kernel maintenance #

Resolved: All compute nodes have been rebooted.
Feb 5, 11:30 PST

Notice: Compute nodes will no longer accept new jobs until they have been rebooted. A node will be rebooted as soon as any existing jobs have completed, which may take up to two weeks (maximum runtime). During this update period, there will be fewer available slots on the queues than usual. To follow the progress, see the green ‘Available CPU cores’ curve (target 1,944 cores) in the graph above.
Jan 22, 16:45 PST

January 23, 2019 #

Kernel maintenance #

Resolved: The login, development and transfer hosts have been rebooted.
Jan 23, 13:00 PST

Notice: On Wednesday January 23 at 12:00 (noon), all of the login, development, and data transfer hosts will be rebooted. Please be logged out before then. The hosts should be offline for less than 5 minutes.
Jan 22, 16:45 PST

January 14, 2019 #

Blocking file-system issues #

Resolved: The file system under /wynton/ is back up again. We are looking into the cause and taking steps to prevent this from happening again.
Jan 9, 12:45 PST

Investigating: The file system under /wynton/ went down around 11:30 resulting is several critical failures including the scheduler failing.
Jan 14, 11:55 PST

January 9, 2019 #

Job scheduler maintenance downtime #

Resolved: The SGE job scheduler is now back online and accepts new job submission again.
Jan 9, 12:45 PST

Update: The downtime of the job scheduler will begin on Wednesday January 9 @ noon and is expected to be completed by 1:00pm.
Jan 8, 16:00 PST

Notice: There will be a short job-scheduler downtime on Wednesday January 9 due to SGE maintenance. During this downtime, already running jobs will keep running and queued jobs will remain in the queue, but no new jobs can be submitted.
Dec 20, 12:00 PST

January 8, 2019 #

File-system server crash #

Investigating: One of the parallel file-system servers (BeeGFS) appears to have crashed on Monday January 7 at 07:30 and was recovered on 9:20pm. Right now we are monitoring its stability, and investigating the cause and what impact it might have had. Currently, we believe users might have experienced I/O errors on /wynton/scratch/ whereas /wynton/home/ was not affected.
Jan 8, 10:15 PST

Operational Summary for 2018 Q3-Q4 #

  • Full downtime:

    • Scheduled: 0 hours = 0.0%
    • Unscheduled: 84 hours = 3.5 days = 1.9%
    • Total: 84 hours = 3.5 days = 1.9%
    • External factors: 100% of the above downtime, corresponding to 84 hours (=3.5 days), were due to external factors

Scheduled maintenance downtimes #

  • Impact: No file access, no compute resources available
  • Damage: None
  • Occurrences:
    • None
  • Total downtime: 0.0 hours

Scheduled kernel maintenance #

  • Impact: Fewer compute nodes than usual until rebooted
  • Damage: None
  • Occurrences:
    1. 2018-09-28 (up to 14 days)

Unscheduled downtimes due to power outage #

  • Impact: No file access, no compute resources available
  • Damage: Running jobs (<= 14 days) failed, file-transfers failed, possible file corruptions
  • Occurrences:
    1. 2018-06-17 (23 hours) - Campus power outage
    2. 2018-11-08 (19 hours) - Byers Hall power maintenance without notice
    3. 2018-12-14 (42 hours) - Sandler Building power outage
  • Total downtime: 84 hours of which 84 hours were due to external factors

Unscheduled downtimes due to file-system failures #

  • Impact: No file access
  • Damage: Running jobs (<= 14 days) may have failed, file-transfers may have failed, cluster not accessible
  • Occurrences:
    • None.
  • Total downtime: 0.0 hours

Accounts #

  • Number of user account: 198 (change: +103 during the year)

December 21, 2018 #

Partial file system failure #

Resolved: Parts of the new BeeGFS file system was non-functional for approx. 1.5 hours during Friday December 21 when a brief maintenance task failed.
Dec 21, 20:50 PST

December 12-20, 2018 #

Nodes down #

Resolved: All of the `msg-* compute nodes but one are operational.
Dec 20, 16:40 PST

Notice: Starting Wednesday December 12 around 11:00, several msg-* compute nodes went down (~200 cores in total). The cause of this is unknown. Because it might be related to the BeeGFS migration project, the troubleshooting of this incident will most likely not start until the BeeGFS project is completed, which is projected to be done on Wednesday December 19.
Dec 17, 17:00 PST

December 18, 2018 #

Development node does not respond #

Resolved: Development node qb3-dev1 is functional.
Dec 18, 20:50 PST

Investigating: Development node qb3-dev1 does not respond to SSH. This will be investigated the first thing tomorrow morning (Wednesday December 19). In the meanwhile, development node qb3-gpudev1, which is “under construction”, may be used.
Dec 18, 16:30 PST

November 28-December 19, 2018 #

Installation of new, larger, and faster storage space #

Resolved: /wynton/scratch is now back online and ready to be used.
Dec 19, 14:20 PST

Update: The plan is to bring /wynton/scratch back online before the end of the day tomorrow (Wednesday December 19). The planned SGE downtime has been rescheduled to Wednesday January 9. Moreover, we will start providing the new 500-GiB /wynton/home/ storage to users who explicitly request it (before Friday December 21) and who also promise to move the content under their current /netapp/home/ to the new location. Sorry, users on both QB3 and Wynton HPC will not be able to migrate until the QB3 cluster has been incorporated into Wynton HPC (see Roadmap) or they giving up their QB3 account.
Dec 18, 16:45 PST

Update: The installation and migration to the new BeeGFS parallel file servers is on track and we expect to go live as planned on Wednesday December 19. We are working on fine tuning the configuration, running performance tests, and resilience tests.
Dec 17, 10:15 PST

Update: /wynton/scratch has been taken offline.
Dec 12, 10:20 PST

Reminder: All of /wynton/scratch will be taken offline and completely wiped starting Wednesday December 12 at 8:00am.
Dec 11, 14:45 PST

Notice: On Wednesday December 12, 2018, the global scratch space /wynton/scratch will be taken offline and completely erased. Over the week following this, we will be adding to and reconfiguring the storage system in order to provide all users with new, larger, and faster (home) storage space. The new storage will served using BeeGFS, which is a new much faster file system - a system we have been prototyping and tested via /wynton/scratch. Once migrated to the new storage, a user’s home directory quota will be increased from 200 GiB to 500 GiB. In order to do this, the following upgrade schedule is planned:

  • Wednesday November 28-December 19 (21 days): To all users, please refrain from using /wynton/scratch - use local, node-specific /scratch if possible (see below). The sooner we can take it down, the higher the chance is that we can get everything in place before December 19.

  • Wednesday December 12-19 (8 days): /wynton/scratch will be unavailable and completely wiped. For computational scratch space, please use local /scratch unique to each compute node. For global scratch needs, the old and much slower /scrapp and /scrapp2 may also be used.

  • Wednesday December 19, 2018 (1/2 day): The Wynton HPC scheduler (SGE) will be taken offline. No jobs will be able to be submitted until it is restarted.

  • Wednesday December 19, 2018: The upgraded Wynton HPC with the new storage will be available including /wynton/scratch.

  • Wednesday January 9, 2019 (1/2 day): The Wynton HPC scheduler (SGE) will be taken offline temporarily. No jobs will be able to be submitted until it is restarted.

It is our hope to be able to keep the user’s home accounts, login nodes, the transfer nodes, and the development nodes available throughout this upgrade period.

NOTE: If our new setup proves more challenging than anticipated, then we will postpone the SGE downtime to after the holidays, on Wednesday January 9, 2019. Wynton HPC will remain operational over the holidays, though without /wynton/scratch.
Dec 6, 14:30 PST [edited Dec 18, 17:15 PST]

December 12-14, 2018 #

Power failure #

Resolved: All mac-* compute nodes are up and functional.
Dec 14, 12:00 PST

Investigating: The compute nodes named mac-* (in the Sandler building) went down due to power failure on Wednesday December 12 starting around 05:50. Nodes are being rebooted.
Dec 12, 09:05 PST

November 8, 2018 #

Partial shutdown due to planned power outage #

Resolved: The cluster is full functional. It turns out that none of the compute nodes, and therefore none of the running jobs, were affected by the power outage.
Nov 8, 11:00 PST

Update: The queue-metric graphs are being updated again.
Nov 8, 11:00 PST

Update: The login nodes, the development nodes and the data transfer node are now functional.
Nov 8, 10:10 PST

Update: Login node wynlog1 is also affected by the power outage. Use wynlog2 instead.
Nov 8, 09:10 PST

Notice: Parts of the Wynton HPC cluster will be shut down on November 8 at 4:00am. This shutdown takes place due to the UCSF Facilities shutting down power in the Byers Hall. Jobs running on affected compute nodes will be terminated abruptly. Compute nodes with battery backup or in other buildings will not be affected. Nodes will be rebooted as soon as the power comes back. To follow the reboot progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above. Unfortunately, the above queue-metric graphs cannot be updated during the power outage.
Nov 7, 15:45 PST

September 28 - October 11, 2018 #

Kernel maintenance #

Resolved: The compute nodes has been rebooted and are accepting new jobs. For the record, on day 5 approx. 300 cores were back online, on day 7 approx. 600 cores were back online, on day 8 approx. 1,500 cores were back online, and on day 9 the majority of the 1,832 cores were back online.
Oct 11, 09:00 PDT

Notice: On September 28, a kernel update was applied to all compute nodes. To begin running the new kernel, each node must be rebooted. To achieve this as quickly as possible and without any loss of running jobs, the queues on the nodes were all disabled (i.e., they stopped accepting new jobs). Each node will reboot itself and re-enable its own queues as soon as all of its running jobs have completed. Since the maximum allowed run time for a job is two weeks, it may take until October 11 before all nodes have been rebooted and accepting new jobs. In the meanwhile, there will be fewer available slots on the queue than usual. To follow the progress, see the ‘Available CPU cores’ curve (target 1,832 cores) in the graph above.
Sept 28, 16:30 PDT

October 1, 2018 #

Kernel maintenance #

Resolved: The login, development, and data transfer hosts have been rebooted.
Oct 1, 13:30 PDT

Notice: On Monday October 1 at 01:00, all of the login, development, and data transfer hosts will be rebooted.
Sept 28, 16:30 PDT

September 13, 2018 #

Scheduler unreachable #

Resolved: Around 11:00 on Wednesday September 12, the SGE scheduler (“qmaster”) became unreachable such that the scheduler could not be queried and no new jobs could be submitted. Jobs that relied on run-time access to the scheduler may have failed. The problem, which was due to a misconfiguration being introduced, was resolved early morning on Thursday September 13.
Sept 13, 09:50 PDT

August 1, 2018 #

Partial shutdown #

Resolved: Nodes were rebooted on August 1 shortly after the power came back.
Aug 2, 08:15 PDT

Notice: On Wednesday August 1 at 6:45am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton HPC’s server rooms.
Jul 30, 20:45 PDT

July 30, 2018 #

Partial shutdown #

Resolved: The nodes brought down during the July 30 partial shutdown has been rebooted. Unfortunately, the same partial shutdown has to be repeated within a few days because the work in server room was not completed. Exact date for the next shutdown is not known at this point.
Jul 30, 09:55 PDT

Notice: On Monday July 30 at 7:00am, parts of the compute nodes (msg-io{1-10} + msg-*gpu) will be powered down. They will be brought back online within 1-2 hours. The reason is a planned power shutdown affecting one of Wynton HPC’s server rooms.
Jul 29, 21:20 PDT

June 16-26, 2018 #

Power outage #

Resolved: The Nvidia-driver issue occurring on some of the GPU compute nodes has been fixed.
Jun 26, 11:55 PDT

Update: Some of the compute nodes with GPUs are still down due to issues with the Nvidia drivers.
Jun 19, 13:50 PDT

Update: The login nodes and and the development nodes are functional. Some compute nodes that went down are back up, but not all.
Jun 18, 10:45 PDT

Investigating: The UCSF Mission Bay Campus experienced a power outage on Saturday June 16 causing parts of Wynton HPC to go down. One of the login nodes (wynlog1), the development node (qb3-dev1), and parts of the compute nodes are currently non-functional.
Jun 17, 15:00 PDT