Next job

Lead MLOps Specialist in ATB-market

Posted more than 30 days ago

71 views

ATB-market

ATB-market

0
0 reviews
Without experience
Kharkiv
Full-time work

Translated by Google

We invite you to join our team Lead MLOpsResponsibilities:Turn DS development into reliable services with predictable performance, transparent monitoring and secure releases (central and edge-levels).Areas of responsibility: ML-platform: registry of artifacts/models, version/access policies, service templates; CI/CD for models: data/model tests, canary/blue-green/shadow, rollback, feature-flags; Observability: quality/drift/stability, p95 latency/resources, incidents, feedback loops in training;

We invite you to join our team Lead MLOps

Responsibilities:

Turn DS development into reliable services with predictable performance, transparent monitoring and secure releases (central and edge-levels).

Areas of responsibility: 

  • ML-platform: registry of artifacts/models, version/access policies, service templates; 
  • CI/CD for models: data/model tests, canary/blue-green/shadow, rollback, feature-flags; 
  • Observability: quality/drift/stability, p95 latency/resources, incidents, feedback loops in training; 
  • Security: secrets/IAM/RBAC, inference audit, config management, network policies/ingress; 
  • Edge-scenarios: synchronization of models/caches, resistance to failures/outages, telemetry; 
  • CPU/GPU capacity planning, error budgets for peak windows (evenings/Fridays); 
  • Automated replay of inference logs for audit/retraining; zero-downtime updates; 

OKR examples:

  1. Uptime of ML-services 99.5%; p95 latency <150 ms on critical paths; 
  2. TtM model from approve to production <30 min; 95% of releases - without downtime; 
  3. Automatic detection of data/model drift with a frequency of 24 hours; 

Requirements (must-have): 

  • 5+ years in MLOps/SRE/DevOps; industrial exploitation of ML-services on-prem;
  • Deep understanding of life cycle of models, risks and observability; 
  • Confident knowledge of Kubernetes/OpenShift, Helm, Argo CD/Workflows, Terraform/Ansible, GitLab CI; 
  • Production experience with MLflow Registry/Serving, NVIDIA Triton, ONNX Runtime, FastAPI/gRPC, KServe or Seldon Core; 
  • Monitoring/logging: Prometheus/Grafana/Loki, Alertmanager, Evidently/whylogs, OpenTelemetry; 
  • Security/configs: Vault/Sealed Secrets, Keycloak (IAM), CNI policies, ingress (Traefik/Kong/Nginx);
  • Automation of data/model tests, incident management, runbooks.

Will be a plus:

  • Edge inference in retail (POS/SCO/video/planograms); GPU-profiling, TensorRT/quantization/batch-policy;
  • Multiversion models with fast roll-forward/back; cost-/energy-aware planning;
  • Practice of cost-aware planning of resources and energy efficiency; 
Prometheus, Grafana, Loki, Alertmanager, OpenTelemetry; ML quality - Evidently/whylogs.
  • Security: HashiCorp Vault/Sealed Secrets, Keycloak (IAM), CNI policies, ingress controllers.
  • The company offers:

    • remote or hybrid formt of work;
    • employment on the terms of a gig contract or in the state (reservation is possible);
    • paid annual leave of 24 calendar days, paid sick leave;
    • regular payment of wages without delays and in volumes, regular salary review;
    • opportunity for professional and career growth;
    • training courses.


    Contact person: Kateryna, tel. style="font-weight: 400">0984567857 (t.me/KaterynaB_HR)

    Translated by Google

    Without experience
    Kharkiv
    Full-time work
    Want to get related jobs?
    New job openings in your Telegram
    Subscribe
    We use cookies
    accept