*This post is based on [this](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html#nn-module) pyTorch tutorial.* We'll construct a simple network to predict y from x by minimizing squared Euclidean distance. In detail it'll be a fully-connected network with one hidden layer (at first) and a sigmoid activation function. The point of this post is not to build a cutting-edge regression method, but to demonstrate how easy it is to go about it with pyTorch. Let's start by defining a function we like to learn: for something simple but non-linear we'll combine harmonic functions, but throw in a complication, namely that the sampling is densely only in the center. For good measure, we'll also add some noise because *nothing is noise-free*. To make it easy to visualize, they feature space is two-dimensional, and the output variable is scalar. ```python import torch # N is batch size; D_in is input dimension; # H is hidden dimension; D_out is output dimension. N, D_in, H, D_out = 1000, 2, 50, 1 # Create test tensors to hold input and output x = torch.randn(N, D_in) * 3.1415 y = (x[:,0].sin() + x[:,1].cos()).unsqueeze(1) noise = torch.randn(N, D_out) * 1e-1 y += noise ``` Let's first have a look at our data: ```python %matplotlib inline import matplotlib.pyplot as plt plt.style.use('website') plt.scatter(x.numpy()[:,0], x.numpy()[:,1],c=y.numpy()[:]) plt.xlim(-10,10) plt.ylim(-10,10) ``` ![png](data-science-in-astronomy-pytorch-intro_files/data-science-pytorch-intro_3_1.png) It's a clearly visible deterministic signal with sparse sampling in the outskirts. Now let's design a network to learn the mapping $X\rightarrow Y$. This implementation uses the `nn` package from PyTorch to build the network. PyTorch autograd makes it easy to define computational graphs and take gradients, but raw autograd can be a bit too low-level for defining complex neural networks; this is where the nn package can help. The nn package defines a set of Modules, which you can think of as a neural network layer that has produces output from input and may have some trainable parameters. ```python # Use the nn package to define a simple model as a sequence of layers. # Here: Linear / Sigmoid / Linear, # i.e. a single-layer preceptron with a linear mapping at the end class TwoLayerNet(torch.nn.Module): def __init__(self, D_in, H, D_out): """ In the constructor we instantiate two nn.Linear modules and assign them as member variables. """ super(TwoLayerNet, self).__init__() self.linear1 = torch.nn.Linear(D_in, H) self.linear2 = torch.nn.Linear(H, D_out) def forward(self, x): """ In the forward function we accept a Tensor of input data and we must return a Tensor of output data. We can use Modules defined in the constructor as well as arbitrary operators on Tensors. """ h = self.linear1(x).sigmoid() y_pred = self.linear2(h) return y_pred ``` The elegant part in pyTorch (and tensorflow) is that it simply declares how the data `x` flow through the network: a linear mapping (with a matrix and a bias vector), going from 2 dimensions to 50; a sigmoid transformation, applied element-wise; another linear mapping to go from 50 dimensions to 1. Done. Because that was such a lot of code to write (:sarcasm:), pyTorch makes it even easier: ```python # nn.Sequential is a Module which contains other Modules, # and applies them in sequence to produce its output. model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.Sigmoid(), torch.nn.Linear(H, D_out) ) ``` In essence, it provides the building blocks for constructing complex mappings with linear and non-linear operations. To optimize the parameters of this model, here: the matrices and bias vectors of `linear1` and `linear2`, we have to define a loss function. For regression that's the usual Mean Squared Error (MSE) of the predicted value $\hat{y}$: $\lVert y-\hat{y}\rVert_2^2$. It's already coded up in pyTorch, together with several other common ones, e.g. for classification tasks. And then we do what one always does in convex optimization, namely making steps in the negative gradient direction. This is where tensorflow's automatic differentiation is so useful. Because all transformations are declared in the class `TwoLayerNet`, the code figures out how to calculate gradients on its own. We only have to do the steps: ```python # initialize the model model = TwoLayerNet(D_in, H, D_out) # The nn package also contains definitions of popular loss functions; in this case we will use Mean Squared Error (MSE) as our loss function. loss_fn = torch.nn.MSELoss(reduction='sum') learning_rate = 1e-4 for t in range(1001): # Forward pass: compute predicted y by passing x to the model. # Module objects override the __call__ operator so you can call them # like functions. When doing so you pass a Tensor of input data to the # Module and it produces a Tensor of output data. y_pred = model(x) # Compute and print loss. We pass Tensors containing the predicted and # true values of y, and the loss function returns a Tensor containing # the loss. loss = loss_fn(y_pred, y) if t % 100 == 0: print(t, loss.item()) # Zero the gradients before running the backward pass. model.zero_grad() # Backward pass: compute gradient of the loss with respect to all the # learnable parameters of the model. Internally, the parameters of # each Module are stored in Tensors with requires_grad=True, so this # call computes gradients for all learnable parameters in the model. loss.backward() # Update the weights using gradient descent. Each parameter is a # Tensor, so we can access its gradients like we did before. with torch.no_grad(): for param in model.parameters(): param -= learning_rate * param.grad ``` 0 933.7546997070312 100 898.05712890625 200 827.8410034179688 300 703.25537109375 400 613.7958984375 500 529.8275756835938 600 433.88458251953125 700 351.5511169433594 800 291.5534973144531 900 254.2714385986328 1000 231.09042358398438 We have reduced the step sizes by a factor `learning_rate`, which is supposed to ensure that we can converge even when doing simultaneous downhill gradient steps for multiple optimization parameters may not be a fully stable (or beneficial) thing to do. Let's look at the prediction with a finer resolution: ```python xt = torch.linspace(-10,10,100) xt = torch.stack(torch.meshgrid((xt, xt)), dim=2) yt = model(xt).detach() # need to detach from model for further use yt = torch.t(yt.squeeze(2)) plt.imshow(yt, extent=(-10,10,-10,10)) ``` ![png](data-science-in-astronomy-pytorch-intro_files/data-science-pytorch-intro_11_1.png) The prediction is quite decent where the sampling of the data is dense, but it get's very shaky in the corners. That's common: > Don't use networks where they have to extrapolate! Also, depending on the data and the width of the network, a direct optimization can work or sometimes fail quite miserably. This has to do with the loss function not being fully convex in all parameters. In essence, a direct gradient approach may not be ideal. Instead we'll use another optimization method, `Adam`, that performs a stochastic gradient evaluation and can handle loss functions that occasionally increase. Again, it's already coded up in pyTorch: ```python # initialize the model model = TwoLayerNet(D_in, H, D_out) # Use the optim package to define an Optimizer that will update the # weights of the model for us. Here we will use Adam; the optim package # contains many other optimization algoriths. The first argument to the # Adam constructor tells the optimizer which Tensors it should update. learning_rate = 1e-3 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for t in range(1000): # Forward pass: compute predicted y by passing x to the model. y_pred = model(x) # Compute and print loss. loss = loss_fn(y_pred, y) if t % 100 == 0: print(t, loss.item()) # Before the backward pass, use the optimizer object to zero all the # gradients for the variables it will update (which are the learnable # weights of the model). This is because by default, gradients are # accumulated in buffers( i.e, not overwritten) whenever .backward() # is called. Checkout docs of torch.autograd.backward for details. optimizer.zero_grad() # Backward pass: compute gradient of the loss with respect to model # parameters loss.backward() # Calling the step function on an Optimizer makes an update to its # parameters optimizer.step() ``` 0 1078.5792236328125 100 899.9740600585938 200 864.6139526367188 300 800.7428588867188 400 707.2975463867188 500 603.0485229492188 600 510.47747802734375 700 437.02008056640625 800 377.6645812988281 900 329.94488525390625 ```python yt = model(xt).detach() yt = torch.t(yt.squeeze(2)) plt.imshow(yt, extent=(-10,10,-10,10)) ``` ![png](data-science-in-astronomy-pytorch-intro_files/data-science-pytorch-intro_14_1.png) This still may not be overly great, so let's try to make our model more complex by adding another fully connected layer (which adds $50\cdot50$ parameters) and keep going for 2000 steps: ```python model = torch.nn.Sequential( torch.nn.Linear(D_in, H), torch.nn.Sigmoid(), torch.nn.Linear(H, H), torch.nn.Sigmoid(), torch.nn.Linear(H, D_out) ) learning_rate = 1e-3 optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) for t in range(2000): y_pred = model(x) loss = loss_fn(y_pred, y) if t % 100 == 0: print(t, loss.item()) optimizer.zero_grad() loss.backward() optimizer.step() ``` 0 1446.3250732421875 100 914.260009765625 200 898.2183837890625 300 862.9479370117188 400 778.392333984375 500 592.9867553710938 600 402.9549255371094 700 315.80804443359375 800 248.01148986816406 900 208.93670654296875 1000 183.0381317138672 1100 165.38995361328125 1200 151.40463256835938 1300 138.21383666992188 1400 127.8183822631836 1500 118.86930084228516 1600 110.86335754394531 1700 103.80579376220703 1800 97.83943939208984 1900 92.97871398925781 ```python yt = model(xt).detach() yt = torch.t(yt.squeeze(2)) plt.imshow(yt, extent=(-10,10,-10,10)) ``` ![png](data-science-in-astronomy-pytorch-intro_files/data-science-pytorch-intro_17_1.png) Not too bad. And that was done with only a few lines of code, which allows for quick exploration of what works and what doesn't. My summary: > pyTorch hides the complexity of tensorflow and thereby emphasizes the structure of the network

Data Science in Astronomy: pyTorch introduction

Publication Date
Fri, 14.09.2018