Site Reliability Engineer

Vanguard Raleigh, NC Closed
Vanguard is looking for Site Reliability Engineer in Raleigh, NC. This local job opportunity with ID 3438314 is live since 02/14/2018.
As a Site Reliability Engineer on Employer's Runtime Engineering team you'll have the opportunity to put your operational savvy-ness and engineering skills to work! On the job you'll be ensuring the "-ilities" (Availability, Reliability, Scalability, Usability; etc.) of our private and public cloud platforms in both test and production environments. You'll respond to incidents, apply upgrades to the platform and leverage a strategic thinking mindset to "automate all the things"(repetitive manual work is the worst!).

Additionally, you can anticipate working with real-time monitoring and diagnostic data, analyze trends, and plan for future infrastructure needs. As a caretaker of these platforms you'll be collaborating and planning activities with our internal development teams to ensure that application service level objectives are met. As the name might suggest, a passion for reliability is a must!

On the job you'll be...
•Maintaining, upgrading, and patching our private and public cloud platforms in test and production environments.
•Managing communications and coordinating change events with development and support teams.
•Identifying and resolving reliability issues and implementing long-term mitigation strategies - ideally through automation.
•Responding to production incidents and availability needs.
•Facilitating and documenting platform post-mortems.
•Training and mentoring junior staff members on reliability practices, processes and technologies.
•Participating in an off-hours on-call rotation
•Defining future state implementations for Microservice runtime platforms.

Duties & Responsibilities:

  • Ensures reliable operation of production and test environments.
  • Diagnoses and troubleshoots availability interruptions and other production issues.
  • Plans and coordinates enterprise-wide infrastructure projects with other IT and client teams.
  • Communicates with teams to keep them apprised of status and issues. Contacts vendors to resolve technical issues.
  • Tests, installs, and migrates, all software, patches, upgrades, applications, and/or hardware. 
  • Develops technical standards. Tests and evaluates IT vendor products.
  • Writes documentation, including project plans, installation procedures, and troubleshooting tips. Creates diagrams, including technical topology.
  • Maintains, monitors, and tunes Production system and applications performance.  Debugs source code and performance problems and/or provides debugging assistance to developers.
  • Identifies opportunities to improve system and applications performance (e.g., automating manual system tasks). 
  • Trains and mentors staff. Resolves complex issues elevated from staff with less experience. 
  • Adds, updates, and closes IT Problem Management database records. Researches and resolves complex issues, and reviews related technology records to mitigate impact on assigned system.
  • Reviews numerous IT knowledge repositories to update technical knowledge.
  • Learns and understands client area business functions and requirements. Has the ability to determine the appropriate technical tool to address the client's business needs.
  • Thoroughly understands and complies with IT policies and procedures, especially those for quality and productivity standards that enable the team to meet established client service levels.
  • Thoroughly understands and complies with Information Security policies and procedures, and verifies deliverables meet Information Security and VSA requirements.  Participates in special projects and performs other duties as assigned.

Education & Experience:

  • Minimum of 3+ yrs of overall technical engineering experience
  • Bachelor’s Degree preferred or equivalent technical experience
  • A deep understanding and practical experience with managing at least one service container based platform (ex. Docker, Kubernetes, AWS ECS, Cloud Foundry, OpenShift, etc).
  • Experience maintaining and monitoring distributed systems.
  • Understanding and application of monitoring, alerting and visualization solutions (ex. Splunk, ELK, Nagios, Grafana...).
  • Deep knowledge of Linux systems and cloud platforms/providers
  • Strong oral and written communication skills
  • Passion for problem solving and strategic thinking and a desire to own and execute
  • Advanced understanding and application of at least one scripting language (Shell, PHP, Python; etc.)
  • Basic development experience in a language such as Java is considered a plus
  • A flexible schedule - some activities you'll be performing may require off-hours or weekend support

Special Factors

Remote and off hours support:  See additional information for the specific requirements for this posting.
On call:  See additional information for the specific requirements for this posting.

***Employer is not offering visa sponsorship for this position***

read more

Required Skills

My Compatibility Score

Choose Match Score option: