In this blog post Mastering Common Tensor Operations for AI and Data Workloads we will break down the everyday moves you need to work with tensors, the data structure behind modern AI.
Tensors are to machine learning what spreadsheets are to finance: a compact, structured way to hold numbers and transform them fast. Whether you are building a model, cleaning data, or optimizing inference on GPUs, knowing common tensor operations saves time and unlocks performance. In Mastering Common Tensor Operations for AI and Data Workloads, we start with the concepts, then walk through practical steps you can apply immediately.
What is a tensor, really?
A tensor is a multi-dimensional array. The key points are:
- Rank: number of dimensions (scalars 0D, vectors 1D, matrices 2D, etc.).
- Shape: size along each dimension, e.g., (batch, channels, height, width).
- Dtype: numeric type like float32, float16, int64.
- Device: where it lives (CPU or GPU).
Most ML libraries (NumPy, PyTorch, TensorFlow) expose similar operations: creation, indexing, reshaping, broadcasting, elementwise math, reductions, and linear algebra. The technology behind their speed includes contiguous memory layouts, vectorized CPU instructions, GPU kernels, and just-in-time operator fusion. Understanding these helps you write code that is both clear and fast.
Quick mental model
Think in batches and axes. A 4D image batch might be (N, C, H, W). Most ops either:
- Preserve shape (elementwise add, multiply).
- Reduce dimensions (sum/mean over an axis).
- Rearrange dimensions (reshape, transpose/permute).
- Combine tensors (concatenate/stack, matmul).
Broadcasting lets you operate on different shapes by virtually expanding dimensions of size 1 without copying data, which is both elegant and efficient.
Essential operations with PyTorch examples
The NumPy equivalents are almost identical. Swap torch for numpy and you are 90% there.
Creation and dtype/device
import torch
a = torch.zeros(2, 3) # 2x3 of zeros
b = torch.ones((3, 4), dtype=torch.float32)
c = torch.arange(0, 10) # 0..9
d = torch.linspace(0, 1, steps=5) # 0.00..1.00 (5 points)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
e = torch.randn(8, 8, device=device) # on GPU if available
Inspecting shape and rank
x = torch.randn(32, 64, 128)
print(x.shape) # torch.Size([32, 64, 128])
print(x.ndim) # 3
print(x.dtype) # torch.float32 by default
Indexing and slicing
x = torch.arange(12).reshape(3, 4)
row0 = x[0] # first row, shape (4,)
col1 = x[:, 1] # second column, shape (3,)
block = x[0:2, 2:] # rows [0,1], cols [2,3]
mask = x % 2 == 0
only_even = x[mask]
Reshape, view, transpose
Use reshape when you do not care if the result is a view or a copy. Use view only when your tensor is contiguous in memory.
y = torch.arange(24)
Y = y.reshape(4, 6) # shape (4,6)
Yt = Y.transpose(0, 1) # swap axes -> (6,4)
Yp = Y.permute(1, 0) # same as transpose in 2D
Yc = Yt.contiguous() # make memory contiguous
flat = Y.view(-1) # works only if contiguous
Broadcasting
Dimensions match from the right; a dimension of size 1 can expand. This avoids explicit loops.
a = torch.randn(3, 1)
b = torch.randn(1, 4)
C = a + b # result shape (3, 4)
Elementwise math and reductions
z = torch.randn(5, 5)
expz = torch.exp(z)
logz = torch.log(torch.clamp_min(expz, 1e-8))
row_sum = z.sum(dim=1) # reduce columns -> shape (5,)
col_mean = z.mean(dim=0) # reduce rows -> shape (5,)
max_vals, argmax_idx = z.max(dim=1)
Linear algebra
A = torch.randn(16, 32)
B = torch.randn(32, 64)
C = A @ B # matrix multiply -> (16, 64)
# Batch matmul: (N, I, J) @ (N, J, K) -> (N, I, K)
X = torch.randn(10, 32, 128)
W = torch.randn(10, 128, 64)
Y = torch.bmm(X, W)
# Einsum is concise and powerful
Y2 = torch.einsum('bij,bjk->bik', X, W)
Concatenate and stack
t1 = torch.randn(2, 3)
t2 = torch.randn(4, 3)
cat0 = torch.cat([t1, t2], dim=0) # (6, 3)
s1 = torch.randn(3)
s2 = torch.randn(3)
stacked = torch.stack([s1, s2], dim=0) # (2, 3)
Type casting and normalization
x = torch.randn(1024, 1024)
x16 = x.to(torch.float16)
# Z-score normalization across last dim
eps = 1e-6
norm = (x - x.mean(dim=-1, keepdim=True)) / (x.std(dim=-1, keepdim=True) + eps)
Autograd essentials
Tensors track gradients when requires_grad=True. Watch out: some in-place ops can break gradient history.
w = torch.randn(5, requires_grad=True)
loss = (w ** 2).sum()
loss.backward()
print(w.grad) # 2 * w
# In-place caution: w.add_(1) might invalidate the graph
Performance tips that matter
- Prefer vectorization over Python loops. Let the library dispatch optimized kernels.
- Use broadcasting instead of manual expand/tiling to save memory.
- Mind contiguity. After permute/transpose, call contiguous() before view; or use reshape, which falls back to a copy if needed.
- Choose dtypes wisely. float32 for training, float16/bfloat16 for inference when possible.
- Use GPU where it counts. Move data and models once: tensor = tensor.to(‘cuda’). Avoid ping-ponging between CPU and GPU.
- Batch your work. GPUs love large, regular batches; too small and kernel launch overhead dominates.
- Avoid unnecessary .item() or Python-side loops that break parallelism.
- Profile early. torch.autograd.profiler or PyTorch Profiler will show hot ops.
Mixed precision inference
model.eval()
if torch.cuda.is_available():
with torch.autocast(device_type='cuda', dtype=torch.float16):
out = model(x.to('cuda'))
Mixed precision reduces memory bandwidth and can double throughput on modern GPUs, with minimal accuracy loss for many models.
Common patterns worth mastering
Channel/feature last vs first
Know your layout. Vision models often use (N, C, H, W). Some preprocessors use (N, H, W, C). Use permute to align:
images_nhwc = torch.randn(32, 224, 224, 3)
images_nchw = images_nhwc.permute(0, 3, 1, 2)
Masking for conditional updates
x = torch.randn(100)
mask = x > 0
x = torch.where(mask, x, torch.zeros_like(x)) # zero negatives
Safe numerical practices
- Use eps when dividing by a std or norm.
- Clamp probabilities to [1e-6, 1 – 1e-6] before log.
- Prefer stable formulations (e.g., logsumexp) for softmax/log-likelihood.
How this maps to cloud workloads
On cloud infrastructure, tensor operations dominate compute time. A few practical steps:
- Right-size the GPU. If your workload is memory-bound (lots of large elementwise ops), higher memory bandwidth may matter more than raw FLOPs.
- Pin dataloading. Use pinned memory for CPU→GPU transfers to reduce stalls.
- Minimize host-device transfers. Stage tensors on GPU and keep them there for the full pipeline.
- Exploit batch inference. Aggregate requests to form larger tensors for better GPU utilization.
Cheat sheet of go-to ops
- Creation: zeros, ones, arange, linspace, randn
- Layout: reshape, view, transpose, permute, contiguous
- Selection: indexing, slicing, boolean masks, where, gather
- Math: add, mul, exp, log, clamp, normalize
- Reduction: sum, mean, max/min, argmax/argmin
- Combine: cat, stack, matmul/@, bmm, einsum
- Types/devices: to(dtype), to(device), float16/bfloat16
Wrapping up
Tensors are the language of modern AI. If you internalize shapes, broadcasting, and a handful of layout and math routines, most problems get simpler and faster. Start by replacing loops with vectorized tensor code, keep an eye on device placement, and profile the hotspots. The payoff is cleaner code and real speed on CPUs and GPUs.
If you are running these workloads in the cloud, the same principles scale: batch well, minimize transfers, and pick the right instance class for your tensor mix. When you are ready to operationalize models, CloudProinc.com.au can help you tune infrastructure for both cost and performance.
Discover more from CPI Consulting
Subscribe to get the latest posts sent to your email.