
Senior/Middle Data Scientist (Data Preparation & Pre-training)
- Киев
- Постоянная работа
- Полная занятость
- Design, prototype, and validate data preparation and transformation steps for LLM training datasets, including cleaning and normalization of text, filtering of toxic content, de-duplication, de-noising, detection and deletion of personal data, etc.
- Formation of specific SFT/RLHF datasets from existing data, including data augmentation/labeling with LLM as teacher.
- Analyze large-scale raw text, code, and multimodal data sources for quality, coverage, and relevance.
- Develop heuristics, filtering rules, and cleaning techniques to maximize training data effectiveness.
- Collaborate with data engineers to hand over prototypes for automation and scaling.
- Research and develop best practices and novel techniques in LLM training pipelines.
- Monitor and evaluate data quality impact on model performance through experiments and benchmarks.
- Research and implement best practices in large-scale dataset creation for AI/ML models.
- Document methodologies and share insights with internal teams.
- 3+ years of experience in Data Science or Machine Learning, preferably with a focus on NLP.
- Proven experience in data preprocessing, cleaning, and feature engineering for large-scale datasets of unstructured data (text, code, documents, etc.).
- Advanced degree (Master's or PhD) in Computer Science, Computational Linguistics, Machine Learning, or a related field is highly preferred.
- Good knowledge of natural language processing techniques and algorithms.
- Hands-on experience with modern NLP approaches, including embedding models, semantic search, text classification, sequence tagging (NER), transformers/LLMs, RAGs.
- Familiarity with LLM training and fine-tuning techniques, and data requirements.
- Proficiency in Python and common data science and NLP libraries (pandas, NumPy, scikit-learn, spaCy, NLTK, langdetect, fasttext).
- Strong experience with deep learning frameworks such as PyTorch or TensorFlow for building NLP models.
- Ability to write efficient, clean code and debug complex model issues.
- Solid understanding of data analytics and statistics.
- Experience in experimental design, A/B testing, and statistical hypothesis testing to evaluate model performance.
- Comfortable working with large datasets, writing complex SQL queries, and using data visualization to inform decisions.
- Experience deploying machine learning models in production (e.g., using REST APIs or batch pipelines) and integrating with real-world applications.
- Familiarity with MLOps concepts and tools (version control for models/data, CI/CD for ML).
- Experience with cloud platforms (AWS, GCP, or Azure) and big data technologies (Spark, Hadoop, Ray, Dask) for scaling data processing or model training is a plus.
- Experience working in a collaborative, cross-functional environment.
- Strong communication skills to convey complex ML results to non-technical stakeholders and to document methodologies.
- Ability to rapidly prototype and iterate on ideas
- Familiarity with evaluation metrics for language models (perplexity, BLEU, ROUGE, etc.) and with techniques for model optimization (quantization, knowledge distillation) to improve efficiency.
- Understanding of FineWeb2 or a similar processing pipeline approach
- Publications in NLP/ML conferences or contributions to open-source NLP projects.
- Active participation in the AI community or demonstrated continuous learning (e.g., Kaggle competitions, research collaborations).
- Familiarity with the Ukrainian language and context.
- Understanding of cultural and linguistic nuances that could inform model training and evaluation in a Ukrainian context.
- Knowledge of Ukrainian text sources and data sets, or experience with multilingual data processing, can be an advantage given our project's focus.
- Hands-on experience with containerization (Docker) and orchestration (Kubernetes) for ML, as well as ML workflow tools (MLflow, Airflow).
- Experience in working alongside MLOps engineers to streamline the deployment and monitoring of NLP models.
- Innovative mindset with the ability to approach open-ended AI problems creatively.
- Comfort in a fast-paced R&D environment where you can adapt to new challenges, propose solutions, and drive them to implementation.
- Office or remote - it's up to you. You can work from anywhere, and we will arrange your workplace.
- Remote onboarding.
- Performance bonuses.
- We train employees with the opportunity to learn through the company's library, internal resources, and programs from partners.
- Health and life insurance.
- Wellbeing program and corporate psychologist.
- Reimbursement of expenses for Kyivstar mobile communication.