Snaption
Public? My first deep dive into multi-modal ML! Built an end-to-end image captioning system from scratch using PyTorch, combining computer vision (EfficientNet) and NLP (Transformers). Trained on 8K images, tackled hyperparameter tuning, overfitting, and data augmentation. Features pre-trained models, clean package structure, and detailed documentation.