Also known as: multidimensional array, torch.Tensor, ndarray
TL;DR
A tensor is a multidimensional array — the rank-N generalization of scalars (rank 0), vectors (rank 1), and matrices (rank 2). In ML, 'tensor' means an n-dimensional array with a shape, dtype, and device.
A tensor, in machine learning, is a multidimensional array of numbers with three pieces of metadata: a shape (a tuple of axis lengths), a dtype (the numeric type of each element), and a device (CPU, GPU, TPU). It’s the universal container for everything that flows through a deep-learning system — inputs, weights, activations, gradients.
import torchx = torch.zeros(32, 128, 768, dtype=torch.bfloat16, device="cuda")# shape: (32, 128, 768) -> a batch of 32 sequences of 128 tokens, each a 768-dim vector# dtype: bfloat16# device: cuda:0
That’s the entire mental model. Everything beyond it is operations on the same object.
Beyond vectors and matrices
Lower-rank tensors have specific names:
Rank 0 — scalar. A single number. torch.tensor(3.14).
Rank 3+ — tensor. torch.zeros(B, S, D) is the canonical shape for a batch of token sequences with embedding dimension .
The “rank” or “order” of a tensor is just the length of its shape tuple. An embedding matrix is rank 2; a batch of embeddings is rank 3; a batch of attention scores is rank 4 (batch, head, query, key).
Tensor = array with shape, dtype, and device. The math is just NumPy with a GPU underneath.
Shape, dtype, device
These three attributes are the entire surface area of tensor metadata. Get them right and most ML code works; get them wrong and nothing does.
Shape is a tuple of axis lengths. By convention in transformers, the leftmost axis is the batch and the rightmost is the feature dimension. Attention tensors typically follow (batch, head, sequence, head_dim). The conventions are tribal — read the codebase you’re working in.
Dtype controls precision and memory. The common ones in modern ML:
Device is where the data lives. Operations between tensors on different devices error out — you can’t .matmul a CPU tensor against a CUDA tensor. The cost of moving tensors between devices is real (PCIe bandwidth), so the rule is load once, compute many.
Broadcasting
Broadcasting is the implicit-shape-alignment rule that lets you write (batch, 1, dim) * (1, seq, dim) and get (batch, seq, dim) without an explicit tile or repeat. The rules:
Align shapes from the right.
Where one tensor has size 1, treat it as if it had the other tensor’s size on that axis.
Where one tensor has fewer axes, prepend size-1 axes until they match.
So (3, 1, 5) and (7, 5) both expand to (3, 7, 5) for the operation. The expansion is virtual — broadcasting never copies data, it just plays games with strides.
Strided memory and contiguity
A tensor’s data lives in a flat 1D buffer. Its shape plus its strides (how many elements to step in memory to advance one position along each axis) define how multidimensional indexing maps onto that buffer.
x = torch.zeros(3, 4)x.stride() # (4, 1) -> step 4 to move down a row, step 1 to move alongy = x.t() # transposey.stride() # (1, 4) -> same memory, different strides; non-contiguous
A tensor is contiguous when its memory layout matches its shape in row-major order. permute, transpose, and slicing produce non-contiguous tensors — same data, rearranged strides, no copy.
This matters because:
view requires contiguous memory and errors otherwise. reshape calls .contiguous() for you (with a copy) when needed.
Many CUDA kernels are optimized for contiguous inputs and either fall back to slow paths or error on non-contiguous ones.
Calling .contiguous() is a memory copy. Doing it in a hot loop is a real cost.
Most of the time, PyTorch’s reshape/view machinery handles contiguity transparently. The places it surfaces: (a) right after a permute or transpose when the next op needs view; (b) before passing into a custom CUDA kernel that doesn’t handle strided inputs; (c) in torch.compile’d code where the compiler can’t prove a tensor is contiguous; (d) when measuring memory carefully — non-contiguous tensors can pin more memory than their shape suggests because they reference the parent’s full buffer.
ML-tensors vs math-tensors (the rant)
In physics and differential geometry, a “tensor” is a multilinear map — an object that transforms in a specific way under coordinate changes. The covariant/contravariant index gymnastics, the metric tensor, the stress tensor — these are tensors in the strict mathematical sense.
In machine learning, “tensor” just means n-dimensional array. There is no transformation rule. torch.Tensor and numpy.ndarray are interchangeable concepts; the only reason we call them tensors is that early deep-learning frameworks (Theano, Torch, then TensorFlow) chose the name and it stuck.
This drives mathematicians up a wall, and they’re not wrong — the terminology is sloppy. But it’s also entrenched beyond rescue. When an ML person says “tensor” they mean array; when a physicist says “tensor” they mean multilinear map. Same word, almost-disjoint meanings.
An ML “tensor” is an n-dimensional array with a GPU pointer. The transformation rules are not part of the deal.
In practice, the only time the distinction matters is when reading older mathematical literature — at which point you’ll need to read carefully and figure out which sense the author meant. In modern ML papers, “tensor” always means the array.
Go further
Why does 'shape' matter so much in ML code?
Almost every bug in deep-learning code is a shape bug. The model expects (batch, seq, hidden) and you handed it (seq, batch, hidden); the loss expects logits of shape (batch, classes) and got (batch, classes, 1). Most production codebases annotate shapes in comments, in docstrings, or via jaxtyping/torchtyping to make these mismatches surface at the type level.
What's the difference between view, reshape, and permute?
view is a zero-copy reinterpretation of the same memory under a new shape — only valid on contiguous tensors. reshape is view if it can be, a copy otherwise. permute reorders axes — it produces a non-contiguous tensor with the same data but different strides. After permute, a view will fail until you call .contiguous().
NumPy/PyTorch broadcasting aligns shapes from the right and treats size-1 axes as expandable. So (B, 1, D) broadcasts against (1, S, D) to give (B, S, D) for free — no copy, just stride manipulation. The rules are mechanical but the failure mode is bad: a transpose that should have been there silently produces wrong-shape output without erroring. Always print .shape.