We invite you to join our team Lead MLOpsResponsibilities:Turn DS development into reliable services with predictable performance, transparent monitoring and secure releases (central and edge-levels).Areas of responsibility: ML-platform: registry of artifacts/models, version/access policies, service templates; CI/CD for models: data/model tests, canary/blue-green/shadow, rollback, feature-flags; Observability: quality/drift/stability, p95 latency/resources, incidents, feedback loops in training;
We invite you to join our team Lead MLOps
Responsibilities:
Turn DS development into reliable services with predictable performance, transparent monitoring and secure releases (central and edge-levels).
Areas of responsibility:
- ML-platform: registry of artifacts/models, version/access policies, service templates;
- CI/CD for models: data/model tests, canary/blue-green/shadow, rollback, feature-flags;
- Observability: quality/drift/stability, p95 latency/resources, incidents, feedback loops in training;
- Security: secrets/IAM/RBAC, inference audit, config management, network policies/ingress;
- Edge-scenarios: synchronization of models/caches, resistance to failures/outages, telemetry;
- CPU/GPU capacity planning, error budgets for peak windows (evenings/Fridays);
- Automated replay of inference logs for audit/retraining; zero-downtime updates;
OKR examples:
- Uptime of ML-services 99.5%; p95 latency <150 ms on critical paths;
- TtM model from approve to production <30 min; 95% of releases - without downtime;
- Automatic detection of data/model drift with a frequency of 24 hours;
Requirements (must-have):
- 5+ years in MLOps/SRE/DevOps; industrial exploitation of ML-services on-prem;
- Deep understanding of life cycle of models, risks and observability;
- Confident knowledge of Kubernetes/OpenShift, Helm, Argo CD/Workflows, Terraform/Ansible, GitLab CI;
- Production experience with MLflow Registry/Serving, NVIDIA Triton, ONNX Runtime, FastAPI/gRPC, KServe or Seldon Core;
- Monitoring/logging: Prometheus/Grafana/Loki, Alertmanager, Evidently/whylogs, OpenTelemetry;
- Security/configs: Vault/Sealed Secrets, Keycloak (IAM), CNI policies, ingress (Traefik/Kong/Nginx);
- Automation of data/model tests, incident management, runbooks.
Will be a plus:
- Edge inference in retail (POS/SCO/video/planograms); GPU-profiling, TensorRT/quantization/batch-policy;
- Multiversion models with fast roll-forward/back; cost-/energy-aware planning;
- Practice of cost-aware planning of resources and energy efficiency;
Prometheus, Grafana, Loki, Alertmanager, OpenTelemetry; ML quality - Evidently/whylogs.
Security: HashiCorp Vault/Sealed Secrets, Keycloak (IAM), CNI policies, ingress controllers.The company offers:
- remote or hybrid formt of work;
- employment on the terms of a gig contract or in the state (reservation is possible);
- paid annual leave of 24 calendar days, paid sick leave;
- regular payment of wages without delays and in volumes, regular salary review;
- opportunity for professional and career growth;
- training courses.
Contact person: Kateryna, tel. style="font-weight: 400">0984567857 (t.me/KaterynaB_HR)