ML Figure Template

Interactive architecture diagrams for academic project pages

A lightweight, dependency-free library for building responsive, interactive ML architecture diagrams that look great on a project page and export to a self-contained SVG for your paper.

Code Docs

Getting started

The library and a short documentation are available on the GitHub repository. Three minimal walk-throughs are provided as standalone HTML files: quickstart 1, quickstart 2, and quickstart 3. They are the best place to start.

Below are more advanced examples.

GPT (default style)

Original GPT architecture: Radford et al., Improving Language Understanding by Generative Pre-Training

Input tokens
sequence length

Positional Emb.

Token Embedding

LayerNorm

Masked Multi-Head Attention

LayerNorm

FFN (MLP)

LayerNorm (final)

Linear Head

Softmax

Next-token probabilities
sequence length × vocabulary size

GPT (stylized)

Input tokens
sequence length

Positional Emb.

Token Embedding

LayerNorm

Masked Multi-Head Attention

LayerNorm

FFN (MLP)

LayerNorm (final)

Linear Head

Softmax

Next-token probabilities
sequence length × vocabulary size

Transformer (stylized)

Original Transformer architecture: Vaswani et al., Attention Is All You Need
Layout and colors inspired by dair-ai/ml-visuals/2.png

Softmax

Linear

Add & Norm

Feed
Forward

Add & Norm

Feed
Forward

Multi-Head
Attention

Add & Norm

Multi-Head
Attention

Masked
Multi-Head
Attention

Positional
Encoding

Input
Embedding

Output
Embedding

Inputs

Outputs (shifted right)

Textual Inversion (default style)

Original Textual Inversion paper: Gal et al., An Image is Worth One Word

"A photo of

S_{*}

CLIP Text Encoder

Embedding Layer

❄ Existing tokens

S_{*}

❄ Transformer Layers

Image x

Add Noise

❄ U-Net

\hat{ε}

predicted

noise

ε

timestep t

VETIM (VETIM paper style)

Original VETIM architecture: Everaert et al., VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

T [S_{*}]

A rendering of

S_{*}

on a black background.

Text encoder

E

S_{*}

E (T [S_{*}])

Image
generation
module
(e.g. diffusion model)

Generated
image

Sample
images

T [t]

A rendering of an object on a black background. The object is a twisted, abstract sculpture made of delicate, interlocking tendrils of glass.

Text encoder

E

E (T [t])

Textual Inversion (VETIM paper style)

Original Textual Inversion paper: Gal et al., An Image is Worth One Word
Layout and colors after Everaert et al., VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

T [S_{*}]

An illustration of a

S_{*}

Text encoder

E

S_{*}

E (T [S_{*}])

Image
generation
module
(e.g. diffusion model)

Generated
image

Sample
images

BibTeX

@misc{everaert2026mlfigtemplate,
  author       = {{EPFL-IVRL} and Everaert, Martin Nicolas},
  title        = {{ML} {F}igure {T}emplate: {I}nteractive architecture diagrams for {ML} project pages},
  year         = {2026},
  howpublished = {\url{https://github.com/IVRL/ml-figure-template}}
}