This role is for you if you are comfortable with Big Data and Cloud infrastructure (we use Google Cloud), have a good familiarity with high-performance databases, and are keen on ensuring high reliability and efficiency in large-scale systems.
What you’ll do/ Responsibilities:
Design and build our Machine Learning Platform to help data scientists productionize their models and features faster.Automate all parts of the data science lifecycle: feature engineering, model training, testing, and deployment.Deploy, operate, and grow some of the largest ML systems in the region.Collaborate with product teams to understand operational requirements. Translate these requirements into observable architecture and SRE processes.
What you’ll need/ Requirements:
At least 5 years as an infrastructure or software engineer.Experience with Go, Python, and shell script. Java optional.Experience with cloud environments. Google Cloud preferred.Experience with modern cloud deployment technology such as Terraform, Kubernetes, and Helm. Understanding of Infrastructure as Code (IaC) concepts.Experience with deploying, operating, and debugging Big Data frameworks such as Spark, Flink, Kafka, and Airflow. Experience with ML frameworks such as TFX, Kubeflow, and MLflow is a plus.Experience with relational and non-relational databases, including clustering and high-availability configurations.Proven track-record building and operating large-scale, high-throughput, low-latency production systems. Experience with microservice architectures and technology (Docker, Istio, nginx) is a huge plus.Great understanding of DevOps and Site Reliability Engineering (SRE) principles