Site Reliability Engineer II
Company: Cox Automotive
Location: Austin
Posted on: April 2, 2026
|
|
|
Job Description:
Job Description The Site Reliability Engineer II will be part of
the Site Reliability Engineering (SRE) team. The SRE team drives
reliability, observability, and engineering practice maturity
across over 150 teams made up of over a thousand engineers in our
part of Cox Automotive. We build processes, documentation, and
tools that scale: deep observability to detect and diagnose issues
faster, engineering maturity assessments that drive measurable
improvement, reusable golden paths that accelerate delivery, and
trusted advisory relationships that align reliability with business
priorities. Much of our work focuses on eliminating toil through
automation-increasingly leveraging AI and agentic solutions-and
establishing self-service capabilities that multiply our impact. If
you love building monitoring systems that reveal truth, evaluating
engineering practices to raise the bar organization-wide, exploring
cutting-edge AI technologies to solve operational challenges, and
acting as a trusted advisor to engineers and leadership, we want to
talk to you. What You'll Do: - Define and drive adoption of SLIs,
SLOs, error budgets, and high-quality alerting standards across the
organization - Architect end-to-end observability strategies
(metrics, logs, traces, business signals) with consistent taxonomy
and discoverability - Build centralized dashboards, reliability
scorecards, and runbooks used by engineering teams and leadership -
Establish engineering practice maturity baselines and partner with
teams on measurable improvement plans - Create golden
paths-standardized pipelines, infrastructure modules, and service
templates-that enable rapid, consistent delivery - Pioneer the use
of AI and agentic solutions to automate toil, accelerate incident
response, and enhance operational workflows - Lead internal
workshops, game days, and learning programs to spread operational
excellence - Act as a trusted advisor to product and engineering
leadership, providing data-driven insights on reliability risk and
trade-offs - Guide post-incident reviews toward systemic
remediation (guardrails, automation, design changes) rather than
superficial fixes - Design and extend self-service platforms for
deployment, progressive delivery, and automated recovery - Reduce
MTTR through better telemetry, automation, AI-assisted diagnostics,
and resilience patterns - Mentor engineers across teams to become
local reliability champions, scaling SRE impact without adding
headcount Who You Are: - Experience programming in at least one of
the following languages: Python, Typescript, or Java. - Bachelor's
degree in a related discipline and 4 years' experience in a related
field. The right candidate could also have a different combination,
such as a master's degree and 2 years' experience; a Ph.D. and up
to 1 year of experience; or 16 years' experience in a related
field. - Applicants must currently be authorized to work in the
United States for any employer without current or future
sponsorship. No OPT, CPT, STEM/OPT or visa sponsorship now or in
future. - Expertise in designing, analyzing, and troubleshooting
large-scale distributed systems. - Deep hands-on experience with
modern observability tools (CloudWatch and NewRelic). - Proven
ability to assess engineering practices and drive measurable
improvements across multiple teams. - Experience establishing
SLIs/SLOs, managing error budgets, and improving alert
signal-to-noise ratios. - Strong background in release engineering,
CI/CD, and progressive deployment strategies. - Deep expertise in
AWS, Terraform, AWS CDK, and GitHub/GitHub Actions. - Enthusiasm
for applying AI, LLMs, and agentic automation to operational and
reliability challenges. - Track record reducing MTTR and improving
availability through automation and architectural improvements. -
Excellent written and verbal communication skills tailored to both
engineers and executives. - Systematic problem-solving approach
with a sense of drive and ownership. - Understanding of Linux
operating systems, networking, and performance fundamentals. -
Ability to build trust and influence decisions through data-driven
insights. - Experience facilitating effective post-incident
analysis and driving systemic remediation. - Desire to work in a
fast-paced, evolving, growing, dynamic environment. USD 89,400.00 -
134,000.00 per year Compensation: Compensation includes a base
salary in the range of $89,400.00 - $134,000.00. The base salary
may vary within the anticipated base pay range based on factors
such as the ultimate location of the position and the selected
candidate's knowledge, skills, and abilities. Position may be
eligible for additional compensation that may include an incentive
program. Benefits: The Company offers eligible employees the
flexibility to take as much vacation with pay as they deem
consistent with their duties, the company's needs, and its
obligations; seven paid holidays throughout the calendar year; and
up to 160 hours of paid wellness annually for their own wellness or
that of family members. Employees are also eligible for additional
paid time off in the form of bereavement leave, time off to vote,
jury duty leave, volunteer time off, military leave, and parental
leave.
Keywords: Cox Automotive, Waco , Site Reliability Engineer II, Engineering , Austin, Texas