our model requires absolutely no modality-specific processing at inference time, and uses an order of magnitude fewer parameters at equivalent accuracy on ImageNet. We demonstrate that the same ByteFormer architecture can perform audio classification without modifications or modality-specific preprocessing

Source: Bytes Are All You Need: Transformers Operating Directly On File Bytes – Apple Machine Learning Research