CV Home
Home CV


Generating Pokémon Sprites from Text

My project PikaPikaGen focuses on text-to-image generation for Pokémon sprites. Inspired by models like DALL·E and Stable Diffusion, it generates high-quality 215×215 pixel Pokémon sprites directly from natural language descriptions.


The architecture combines a BERT-mini text encoder with a Transformer encoder stack, a custom Attention Block for global context, and a CNN-based image decoder with cross-attention at each stage. This design allows the system to align words like “blue scales” or “fire tail” with corresponding sprite regions.


Dataset preprocessing included splitting a Pokémon sprite dataset into train/validation/test sets, deterministic augmentation (rotating sprites at multiple angles), and enriching textual prompts with type/classification details. Training was done on an Apple M2 GPU with reproducibility ensured via fixed seeds and parameterized scripts.


Experiments: We explored dropout, CLIP loss, data augmentation, and enriched prompts. Despite improvements in training loss, the model consistently overfitted and struggled to generalize on validation/test sets. The final recommendation is to adopt a latent diffusion model (Stable Diffusion-based) for substantial performance gains.


This project demonstrates the challenges and opportunities of applying modern text-to-image architectures to niche domains like Pokémon sprite generation. GitHub: Repository