OTHER

Dataset deduplication

We created a tool that enables efficient removal of duplicate images from large datasets, resulting in significant reduction in dataset size, faster model training, and lower storage requirements. By applying this tool to LAION-2B, we were able to decrease the dataset size by a factor of 10, including the removal of garbage images, leading to better quality and faster processing.

Dataset deduplication
OpenCV.ai background