Description:
We are seeking a skilled and motivated MLOps Engineer to help design, build, and maintain a centralized ML operations pipeline. The ideal candidate will work with a variety of technologies, including MLflow, Kubernetes, Pachyderm, and DVC, to create a robust and scalable system for managing model deployments, training, and experiments. This role will focus on version control for applications, models, and datasets, while improving visibility and control over the entire MLOps lifecycle.
Key Responsibilities:
- Develop and maintain a solution to track and manage live applications, including version control for models and datasets in production.
- Integrate Pachyderm to organize and track versioned layers (Application - Model - Data) for a unified view of deployment components.
- Implement version lineage tracking from application code through models and datasets to improve troubleshooting and facilitate system enhancements.
- Training Metrics Management:
- Capture and store real-time training metrics to optimize model performance and enable dynamic adjustments during training.
- Use MLflow to log training metrics, facilitating early stopping and performance-based adjustments.
- Implement version control for datasets used in training to ensure reproducibility and consistency across different runs.
- Design and implement systems to collect and log the results of inference experiments, including per-data-unit metrics and aggregated performance scores.
- Use MLflow to store, tag, and retrieve experiment results for further analysis and decision-making.
- Manage model and dataset versions across experiments using Pachyderm, ensuring consistency and enabling easy backtracking through version histories.
- Versioning and Integration:
- Integrate DVC and S3 for model and dataset versioning, ensuring proper tracking and storage of versioned data.
- Track versions of applications, models, and datasets, ensuring each version is clearly linked and up-to-date.
Qualifications:
- Proven experience in MLOps, machine learning pipelines, and model deployment.
- Strong knowledge of MLflow and Pachyderm for model and dataset version control.
- Hands-on experience with Kubernetes for deployment management and scaling machine learning models.
- Familiarity with DVC (Data Version Control) and cloud storage solutions like AWS S3.
- Proficiency in version control practices for machine learning applications, models, and datasets.
- Ability to work with both machine learning practitioners and software engineers to ensure a smooth workflow.
- Strong troubleshooting and debugging skills, particularly for distributed systems.
- Excellent communication skills and the ability to work in a collaborative team environment.