Masked attention matrix showing that each token can only attend to itself and previous tokens, not future tokens in the sequence.

This image demonstrates the concept of masked attention used in Transformer decoder models. The matrix shows token-to-token visibility, where green cells indicate allowed attention and red cells represent blocked attention. The key rule illustrated is that each token can only “see” itself and earlier tokens in the sequence — never future ones. This masking ensures the autoregressive property of sequence generation, preventing information leakage from future tokens, which is crucial during language generation tasks like translation or text completion.

×

Table Of Content