ML Figure Template

Interactive architecture diagrams for academic project pages

A lightweight, dependency-free library for building responsive, interactive ML architecture diagrams that look great on a project page and export to a self-contained SVG for your paper.

Getting started

The library and a short documentation are available on the GitHub repository. Three minimal walk-throughs are provided as standalone HTML files: quickstart 1, quickstart 2, and quickstart 3. They are the best place to start.

Below are more advanced examples.

GPT (default style)

Original GPT architecture: Radford et al., Improving Language Understanding by Generative Pre-Training

Input tokens
sequence length
Positional Emb.
Token Embedding
+
LayerNorm
Masked Multi-Head Attention
+
LayerNorm
FFN (MLP)
+
LayerNorm (final)
Linear Head
Softmax
Next-token probabilities
sequence length × vocabulary size

GPT (stylized)

Input tokens
sequence length
Positional Emb.
Token Embedding
+
LayerNorm
Masked Multi-Head Attention
+
LayerNorm
FFN (MLP)
+
LayerNorm (final)
Linear Head
Softmax
Next-token probabilities
sequence length × vocabulary size

Transformer (stylized)

Original Transformer architecture: Vaswani et al., Attention Is All You Need
Layout and colors inspired by dair-ai/ml-visuals/2.png

Softmax
Linear
Add & Norm
Feed
Forward
Add & Norm
Add & Norm
Feed
Forward
Multi-Head
Attention
Add & Norm
Add & Norm
Multi-Head
Attention
Masked
Multi-Head
Attention
Positional
Encoding
+
+
Positional
Encoding
Input
Embedding
Output
Embedding
Inputs
Outputs (shifted right)

Textual Inversion (default style)

Original Textual Inversion paper: Gal et al., An Image is Worth One Word

"A photo of S*"
CLIP Text Encoder
Embedding Layer
Existing tokens
S*
Transformer Layers
Image x
Add Noise
U-Net
ε^ predicted
noise ε
timestep t

VETIM (VETIM paper style)

Original VETIM architecture: Everaert et al., VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

T[S*]
A rendering of S* on a black background.
Text encoder E
S*
E(T[S*])
Image
generation
module
(e.g. diffusion model)
Generated
image
Sample
images
T[t]
A rendering of an object on a black background. The object is a twisted, abstract sculpture made of delicate, interlocking tendrils of glass.
Text encoder E
 
E(T[t])

Textual Inversion (VETIM paper style)

Original Textual Inversion paper: Gal et al., An Image is Worth One Word
Layout and colors after Everaert et al., VETIM: Expanding the Vocabulary of Text-to-Image Models only with Text

T[S*]
An illustration of a S*
Text encoder E
S*
E(T[S*])
Image
generation
module
(e.g. diffusion model)
Generated
image
Sample
images

BibTeX

@misc{everaert2026mlfigtemplate,
  author       = {{EPFL-IVRL} and Everaert, Martin Nicolas},
  title        = {{ML} {F}igure {T}emplate: {I}nteractive architecture diagrams for {ML} project pages},
  year         = {2026},
  howpublished = {\url{https://github.com/IVRL/ml-figure-template}}
}