13. Boltzmann Machines#
Why use a generative model rather than the more well known discriminative deep neural networks (DNN)?
Discriminitave methods have several limitations: They are mainly supervised learning methods, thus requiring labeled data. And there are tasks they cannot accomplish, like drawing new examples from an unknown probability distribution.
A generative model can learn to represent and sample from a probability distribution. The core idea is to learn a parametric model of the probability distribution from which the training data was drawn. As an example
a. A model for images could learn to draw new examples of cats and dogs, given a training dataset of images of cats and dogs.
b. Generate a sample of an ordered or disordered Ising model phase, having been given samples of such phases.
c. Model the trial function for Monte Carlo calculations
Both use gradient-descent based learning procedures for minimizing cost functions
Energy based models don’t use backpropagation and automatic differentiation for computing gradients, instead turning to Markov Chain Monte Carlo methods.
DNNs often have several hidden layers. A restricted Boltzmann machine has only one hidden layer, however several RBMs can be stacked to make up Deep Belief Networks, of which they constitute the building blocks.
History: The RBM was developed by amongst others Geoffrey Hinton, called by some the “Godfather of Deep Learning”, working with the University of Toronto and Google.
A BM is what we would call an undirected probabilistic graphical model with stochastic continuous or discrete units.
It is interpreted as a stochastic recurrent neural network where the state of each unit(neurons/nodes) depends on the units it is connected to. The weights in the network represent thus the strength of the interaction between various units/nodes.
It turns into a Hopfield network if we choose deterministic rather than stochastic units. In contrast to a Hopfield network, a BM is a so-called generative model. It allows us to generate new samples from the learned distribution.
A standard BM network is divided into a set of observable and visible units \(\hat{x}\) and a set of unknown hidden units/nodes \(\hat{h}\).
Additionally there can be bias nodes for the hidden and visible layers. These biases are normally set to \(1\).
BMs are stackable, meaning they cwe can train a BM which serves as input to another BM. We can construct deep networks for learning complex PDFs. The layers can be trained one after another, a feature which makes them popular in deep learning
However, they are often hard to train. This leads to the introduction of so-called restricted BMs, or RBMS. Here we take away all lateral connections between nodes in the visible layer as well as connections between nodes in the hidden layer. The network is illustrated in the figure below.
Figure 1:
13.1. The network#
The network layers:
A function \(\mathbf{x}\) that represents the visible layer, a vector of \(M\) elements (nodes). This layer represents both what the RBM might be given as training input, and what we want it to be able to reconstruct. This might for example be the pixels of an image, the spin values of the Ising model, or coefficients representing speech.
The function \(\mathbf{h}\) represents the hidden, or latent, layer. A vector of \(N\) elements (nodes). Also called “feature detectors”.
The goal of the hidden layer is to increase the model’s expressive power. We encode complex interactions between visible variables by introducing additional, hidden variables that interact with visible degrees of freedom in a simple manner, yet still reproduce the complex correlations between visible degrees in the data once marginalized over (integrated out).
Examples of this trick being employed in physics:
The Hubbard-Stratonovich transformation
The introduction of ghost fields in gauge theory
Shadow wave functions in Quantum Monte Carlo simulations
The network parameters, to be optimized/learned:
\(\mathbf{a}\) represents the visible bias, a vector of same length as \(\mathbf{x}\).
\(\mathbf{b}\) represents the hidden bias, a vector of same lenght as \(\mathbf{h}\).
\(W\) represents the interaction weights, a matrix of size \(M\times N\).
13.1.1. Joint distribution#
The restricted Boltzmann machine is described by a Bolztmann distribution
where \(Z\) is the normalization constant or partition function, defined as
It is common to ignore \(T_0\) by setting it to one.
13.1.2. Network Elements, the energy function#
The function \(E(\mathbf{x},\mathbf{h})\) gives the energy of a configuration (pair of vectors) \((\mathbf{x}, \mathbf{h})\). The lower the energy of a configuration, the higher the probability of it. This function also depends on the parameters \(\mathbf{a}\), \(\mathbf{b}\) and \(W\). Thus, when we adjust them during the learning procedure, we are adjusting the energy function to best fit our problem.
13.1.3. Defining different types of RBMs#
There are different variants of RBMs, and the differences lie in the types of visible and hidden units we choose as well as in the implementation of the energy function \(E(\mathbf{x},\mathbf{h})\). The connection between the nodes in the two layers is given by the weights \(w_{ij}\).
Binary-Binary RBM:
RBMs were first developed using binary units in both the visible and hidden layer. The corresponding energy function is defined as follows:
where the binary values taken on by the nodes are most commonly 0 and 1.
Gaussian-Binary RBM:
Another varient is the RBM where the visible units are Gaussian while the hidden units remain binary:
RBMs are Useful when we model continuous data (i.e., we wish \(\mathbf{x}\) to be continuous)
Requires a smaller learning rate, since there’s no upper bound to the value a component might take in the reconstruction
Other types of units include:
Softmax and multinomial units
Gaussian visible and hidden units
Binomial units
Rectified linear units
13.1.4. Cost function#
When working with a training dataset, the most common training approach is maximizing the log-likelihood of the training data. The log likelihood characterizes the log-probability of generating the observed data using our generative model. Using this method our cost function is chosen as the negative log-likelihood. The learning then consists of trying to find parameters that maximize the probability of the dataset, and is known as Maximum Likelihood Estimation (MLE). Denoting the parameters as \(\boldsymbol{\theta} = a_1,...,a_M,b_1,...,b_N,w_{11},...,w_{MN}\), the log-likelihood is given by
where we used that the normalization constant does not depend on the data, \(\langle \text{log} Z(\{ \theta_i\}) \rangle = \text{log} Z(\{ \theta_i\})\) Our cost function is the negative log-likelihood, \(\mathcal{C}(\{ \theta_i \}) = - \mathcal{L}(\{ \theta_i \})\)
13.1.5. Optimization / Training#
The training procedure of choice often is Stochastic Gradient Descent (SGD). It consists of a series of iterations where we update the parameters according to the equation
at each \(k\)-th iteration. There are a range of variants of the algorithm which aim at making the learning rate \(\eta\) more adaptive so the method might be more efficient while remaining stable.
We now need the gradient of the cost function in order to minimize it. We find that
where in order to simplify notation we defined the “operator”
and used the statistical mechanics relationship between expectation values and the log-partition function:
The data-dependent term in the gradient is known as the positive phase of the gradient, while the model-dependent term is known as the negative phase of the gradient. The aim of the training is to lower the energy of configurations that are near observed data points (increasing their probability), and raising the energy of configurations that are far from observed data points (decreasing their probability).
The gradient of the negative log-likelihood cost function of a Binary-Binary RBM is then
To get the expectation values with respect to the data, we set the visible units to each of the observed samples in the training data, then update the hidden units according to the conditional probability found before. We then average over all samples in the training data to calculate expectation values with respect to the data.
13.1.6. Kullback-Leibler relative entropy#
When the goal of the training is to approximate a probability distribution, as it is in generative modeling, another relevant measure is the Kullback-Leibler divergence, also known as the relative entropy or Shannon entropy. It is a non-symmetric measure of the dissimilarity between two probability density functions \(p\) and \(q\). If \(p\) is the unkown probability which we approximate with \(q\), we can measure the difference by
Thus, the Kullback-Leibler divergence between the distribution of the training data \(f(\boldsymbol{x})\) and the model distribution \(p(\boldsymbol{x}| \boldsymbol{\theta})\) is
The first term is constant with respect to \(\boldsymbol{\theta}\) since \(f(\boldsymbol{x})\) is independent of \(\boldsymbol{\theta}\). Thus the Kullback-Leibler Divergence is minimal when the second term is minimal. The second term is the log-likelihood cost function, hence minimizing the Kullback-Leibler divergence is equivalent to maximizing the log-likelihood.
To further understand generative models it is useful to study the gradient of the cost function which is needed in order to minimize it using methods like stochastic gradient descent.
The partition function is the generating function of expectation values, in particular there are mathematical relationships between expectation values and the log-partition function. In this case we have
Here \(\langle \cdot \rangle_{model}\) is the expectation value over the model probability distribution \(p(\boldsymbol{x}| \boldsymbol{\theta})\).
13.2. Setting up for gradient descent calculations#
Using the previous relationship we can express the gradient of the cost function as
This expression shows that the gradient of the log-likelihood cost function is a difference of moments, with one calculated from the data and one calculated from the model. The data-dependent term is called the positive phase and the model-dependent term is called the negative phase of the gradient. We see now that minimizing the cost function results in lowering the energy of configurations \(\boldsymbol{x}\) near points in the training data and increasing the energy of configurations not observed in the training data. That means we increase the model’s probability of configurations similar to those in the training data.
The gradient of the cost function also demonstrates why gradients of unsupervised, generative models must be computed differently from for those of for example FNNs. While the data-dependent expectation value is easily calculated based on the samples \(\boldsymbol{x}_i\) in the training data, we must sample from the model in order to generate samples from which to caclulate the model-dependent term. We sample from the model by using MCMC-based methods. We can not sample from the model directly because the partition function \(Z\) is generally intractable.
As in supervised machine learning problems, the goal is also here to perform well on unseen data, that is to have good generalization from the training data. The distribution \(f(x)\) we approximate is not the true distribution we wish to estimate, it is limited to the training data. Hence, in unsupervised training as well it is important to prevent overfitting to the training data. Thus it is common to add regularizers to the cost function in the same manner as we discussed for say linear regression.
13.3. RBMs for the quantum many body problem#
The idea of applying RBMs to quantum many body problems was presented by G. Carleo and M. Troyer, working with ETH Zurich and Microsoft Research.
Some of their motivation included
The wave function \(\Psi\) is a monolithic mathematical quantity that contains all the information on a quantum state, be it a single particle or a complex molecule. In principle, an exponential amount of information is needed to fully encode a generic many-body quantum state.
There are still interesting open problems, including fundamental questions ranging from the dynamical properties of high-dimensional systems to the exact ground-state properties of strongly interacting fermions.
The difficulty lies in finding a general strategy to reduce the exponential complexity of the full many-body wave function down to its most essential features. That is
a. Dimensional reduction
b. Feature extraction
Among the most successful techniques to attack these challenges, artifical neural networks play a prominent role.
Want to understand whether an artifical neural network may adapt to describe a quantum system.
Carleo and Troyer applied the RBM to the quantum mechanical spin lattice systems of the Ising model and Heisenberg model, with encouraging results. Our goal is to test the method on systems of moving particles. For the spin lattice systems it was natural to use a binary-binary RBM, with the nodes taking values of 1 and -1. For moving particles, on the other hand, we want the visible nodes to be continuous, representing position coordinates. Thus, we start by choosing a Gaussian-binary RBM, where the visible nodes are continuous and hidden nodes take on values of 0 or 1. If eventually we would like the hidden nodes to be continuous as well the rectified linear units seem like the most relevant choice.
13.4. Representing the wave function#
The wavefunction should be a probability amplitude depending on \(\boldsymbol{x}\). The RBM model is given by the joint distribution of \(\boldsymbol{x}\) and \(\boldsymbol{h}\)
To find the marginal distribution of \(\boldsymbol{x}\) we set:
Now this is what we use to represent the wave function, calling it a neural-network quantum state (NQS)
13.5. Choose the cost function#
Now we don’t necessarily have training data (unless we generate it by using some other method). However, what we do have is the variational principle which allows us to obtain the ground state wave function by minimizing the expectation value of the energy of a trial wavefunction (corresponding to the untrained NQS). Similarly to the traditional variational Monte Carlo method then, it is the local energy we wish to minimize. The gradient to use for the stochastic gradient descent procedure is
where the local energy is given by
13.5.1. Mathematical details#
Because we are restricted to potential functions which are positive it is convenient to express them as exponentials, so that
where \(E(\boldsymbol{x}_C)\) is called an energy function, and the exponential representation is the Boltzmann distribution. The joint distribution is defined as the product of potentials.
The joint distribution of the random variables is then
3 9
< < < ! ! M A T H _ B L O C K
with the partition function
\(T\) is a physics-inspired parameter named temperature and will be assumed to be 1 unless otherwise stated. The energy function of the Boltzmann machine determines the interactions between the nodes and is defined
Here \(\alpha_i^k (x_i)\) and \(\beta_j^l (h_j)\) are one-dimensional transfer functions or mappings from the given input value to the desired feature value. They can be arbitrary functions of the input variables and are independent of the parameterization (parameters referring to weight and biases), meaning they are not affected by training of the model. The indices \(k\) and \(l\) indicate that there can be multiple transfer functions per variable. Furthermore, \(a_i^k\) and \(b_j^l\) are the visible and hidden bias. \(w_{ij}^{kl}\) are weights of the \textbf{inter-layer} connection terms which connect visible and hidden units. \( v_{im}^k\) and \(u_{jn}^l\) are weights of the \textbf{intra-layer} connection terms which connect the visible units to each other and the hidden units to each other, respectively.
We remove the intra-layer connections by setting \(v_{im}\) and \(u_{jn}\) to zero. The expression for the energy of the RBM is then
resulting in
Similarly
Using Bayes theorem
Similarly
The original RBM had binary visible and hidden nodes. They were showned to be universal approximators of discrete distributions. It was also shown that adding hidden units yields strictly improved modelling power. The common choice of binary values are 0 and 1. However, in some physics applications, -1 and 1 might be a more natural choice. We will here use 0 and 1.
with the partition function
13.5.2. Marginal Probability Density Functions#
In order to find the probability of any configuration of the visible units we derive the marginal probability density function.
A similar derivation yields the marginal probability of the hidden units
13.5.3. Conditional Probability Density Functions#
We derive the probability of the hidden units given the visible units using Bayes’ rule
From this we find the probability of a hidden unit being “on” or “off”:
and
Similarly we have that the conditional probability of the visible units given the hidden are
13.5.4. Gaussian-Binary Restricted Boltzmann Machines#
Inserting into the expression for \(E_{RBM}(\boldsymbol{x},\boldsymbol{h})\) in equation results in the energy
13.5.5. Joint Probability Density Function#
with the partition function given by
13.5.6. Marginal Probability Density Functions#
We proceed to find the marginal probability densitites of the Gaussian-binary RBM. We first marginalize over the binary hidden units to find \(p_{GB} (\boldsymbol{x})\)
We next marginalize over the visible units. This is the first time we marginalize over continuous values. We rewrite the exponential factor dependent on \(\boldsymbol{x}\) as a Gaussian function before we integrate in the last step.
13.5.7. Conditional Probability Density Functions#
We finish by deriving the conditional probabilities.
The conditional probability of a binary hidden unit \(h_j\) being on or off again takes the form of a sigmoid function
The conditional probability of the continuous \(\boldsymbol{x}\) now has another form, however.
The form of these conditional probabilities explains the name “Gaussian” and the form of the Gaussian-binary energy function. We see that the conditional probability of \(x_i\) given \(\boldsymbol{h}\) is a normal distribution with mean \(b_i + \boldsymbol{w}_{i\ast}^T \boldsymbol{h}\) and variance \(\sigma_i^2\).
13.6. Neural Quantum States#
The wavefunction should be a probability amplitude depending on \(\boldsymbol{x}\). The RBM model is given by the joint distribution of \(\boldsymbol{x}\) and \(\boldsymbol{h}\)
To find the marginal distribution of \(\boldsymbol{x}\) we set:
Now this is what we use to represent the wave function, calling it a neural-network quantum state (NQS)
The above wavefunction is the most general one because it allows for complex valued wavefunctions. However it fundamentally changes the probabilistic foundation of the RBM, because what is usually a probability in the RBM framework is now a an amplitude. This means that a lot of the theoretical framework usually used to interpret the model, i.e. graphical models, conditional probabilities, and Markov random fields, breaks down. If we assume the wavefunction to be postive definite, however, we can use the RBM to represent the squared wavefunction, and thereby a probability. This also makes it possible to sample from the model using Gibbs sampling, because we can obtain the conditional probabilities.
13.6.1. Cost function#
This is where we deviate from what is common in machine learning. Rather than defining a cost function based on some dataset, our cost function is the energy of the quantum mechanical system. From the variational principle we know that minizing this energy should lead to the ground state wavefunction. As stated previously the local energy is given by
and the gradient is
where \(\alpha_i = a_1,...,a_M,b_1,...,b_N,w_{11},...,w_{MN}\).
We use that \(\frac{1}{\Psi}\frac{\partial \Psi}{\partial \alpha_i} = \frac{\partial \ln{\Psi}}{\partial \alpha_i}\), and find
This gives
If \(\Psi = \sqrt{F_{rbm}}\) we have
which results in
Let us assume again that our Hamiltonian is
where the first summation term represents the standard harmonic oscillator part and the latter the repulsive interaction between two electrons. Natural units (\(\hbar=c=e=m_e=1\)) are used, and \(P\) is the number of particles. This gives us the following expression for the local energy (\(D\) being the number of dimensions)
Letting each visible node in the Boltzmann machine represent one coordinate of one particle, we obtain
where we have that
We now have all the expressions neeeded to calculate the gradient of the expected local energy with respect to the RBM parameters \(\frac{\partial \langle E_L \rangle}{\partial \alpha_i}\).
If we use \(\Psi = \sqrt{F_{rbm}}\) we obtain
The difference between this equation and the previous one is that we multiply by a factor \(1/2\).
13.7. Python version for the two non-interacting particles#
%matplotlib inline
# 2-electron VMC code for 2dim quantum dot with importance sampling
# Using gaussian rng for new positions and Metropolis- Hastings
# Added restricted boltzmann machine method for dealing with the wavefunction
# RBM code based heavily off of:
# https://github.com/CompPhysics/ComputationalPhysics2/tree/gh-pages/doc/Programs/BoltzmannMachines/MLcpp/src/CppCode/ob
from math import exp, sqrt
from random import random, seed, normalvariate
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import sys
# Trial wave function for the 2-electron quantum dot in two dims
def WaveFunction(r,a,b,w):
sigma=1.0
sig2 = sigma**2
Psi1 = 0.0
Psi2 = 1.0
Q = Qfac(r,b,w)
for iq in range(NumberParticles):
for ix in range(Dimension):
Psi1 += (r[iq,ix]-a[iq,ix])**2
for ih in range(NumberHidden):
Psi2 *= (1.0 + np.exp(Q[ih]))
Psi1 = np.exp(-Psi1/(2*sig2))
return Psi1*Psi2
# Local energy for the 2-electron quantum dot in two dims, using analytical local energy
def LocalEnergy(r,a,b,w):
sigma=1.0
sig2 = sigma**2
locenergy = 0.0
Q = Qfac(r,b,w)
for iq in range(NumberParticles):
for ix in range(Dimension):
sum1 = 0.0
sum2 = 0.0
for ih in range(NumberHidden):
sum1 += w[iq,ix,ih]/(1+np.exp(-Q[ih]))
sum2 += w[iq,ix,ih]**2 * np.exp(Q[ih]) / (1.0 + np.exp(Q[ih]))**2
dlnpsi1 = -(r[iq,ix] - a[iq,ix]) /sig2 + sum1/sig2
dlnpsi2 = -1/sig2 + sum2/sig2**2
locenergy += 0.5*(-dlnpsi1*dlnpsi1 - dlnpsi2 + r[iq,ix]**2)
if(interaction==True):
for iq1 in range(NumberParticles):
for iq2 in range(iq1):
distance = 0.0
for ix in range(Dimension):
distance += (r[iq1,ix] - r[iq2,ix])**2
locenergy += 1/sqrt(distance)
return locenergy
# Derivate of wave function ansatz as function of variational parameters
def DerivativeWFansatz(r,a,b,w):
sigma=1.0
sig2 = sigma**2
Q = Qfac(r,b,w)
WfDer = np.empty((3,),dtype=object)
WfDer = [np.copy(a),np.copy(b),np.copy(w)]
WfDer[0] = (r-a)/sig2
WfDer[1] = 1 / (1 + np.exp(-Q))
for ih in range(NumberHidden):
WfDer[2][:,:,ih] = w[:,:,ih] / (sig2*(1+np.exp(-Q[ih])))
return WfDer
# Setting up the quantum force for the two-electron quantum dot, recall that it is a vector
def QuantumForce(r,a,b,w):
sigma=1.0
sig2 = sigma**2
qforce = np.zeros((NumberParticles,Dimension), np.double)
sum1 = np.zeros((NumberParticles,Dimension), np.double)
Q = Qfac(r,b,w)
for ih in range(NumberHidden):
sum1 += w[:,:,ih]/(1+np.exp(-Q[ih]))
qforce = 2*(-(r-a)/sig2 + sum1/sig2)
return qforce
def Qfac(r,b,w):
Q = np.zeros((NumberHidden), np.double)
temp = np.zeros((NumberHidden), np.double)
for ih in range(NumberHidden):
temp[ih] = (r*w[:,:,ih]).sum()
Q = b + temp
return Q
# Computing the derivative of the energy and the energy
def EnergyMinimization(a,b,w):
NumberMCcycles= 10000
# Parameters in the Fokker-Planck simulation of the quantum force
D = 0.5
TimeStep = 0.05
# positions
PositionOld = np.zeros((NumberParticles,Dimension), np.double)
PositionNew = np.zeros((NumberParticles,Dimension), np.double)
# Quantum force
QuantumForceOld = np.zeros((NumberParticles,Dimension), np.double)
QuantumForceNew = np.zeros((NumberParticles,Dimension), np.double)
# seed for rng generator
seed()
energy = 0.0
DeltaE = 0.0
EnergyDer = np.empty((3,),dtype=object)
DeltaPsi = np.empty((3,),dtype=object)
DerivativePsiE = np.empty((3,),dtype=object)
EnergyDer = [np.copy(a),np.copy(b),np.copy(w)]
DeltaPsi = [np.copy(a),np.copy(b),np.copy(w)]
DerivativePsiE = [np.copy(a),np.copy(b),np.copy(w)]
for i in range(3): EnergyDer[i].fill(0.0)
for i in range(3): DeltaPsi[i].fill(0.0)
for i in range(3): DerivativePsiE[i].fill(0.0)
#Initial position
for i in range(NumberParticles):
for j in range(Dimension):
PositionOld[i,j] = normalvariate(0.0,1.0)*sqrt(TimeStep)
wfold = WaveFunction(PositionOld,a,b,w)
QuantumForceOld = QuantumForce(PositionOld,a,b,w)
#Loop over MC MCcycles
for MCcycle in range(NumberMCcycles):
#Trial position moving one particle at the time
for i in range(NumberParticles):
for j in range(Dimension):
PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
QuantumForceOld[i,j]*TimeStep*D
wfnew = WaveFunction(PositionNew,a,b,w)
QuantumForceNew = QuantumForce(PositionNew,a,b,w)
GreensFunction = 0.0
for j in range(Dimension):
GreensFunction += 0.5*(QuantumForceOld[i,j]+QuantumForceNew[i,j])*\
(D*TimeStep*0.5*(QuantumForceOld[i,j]-QuantumForceNew[i,j])-\
PositionNew[i,j]+PositionOld[i,j])
GreensFunction = exp(GreensFunction)
ProbabilityRatio = GreensFunction*wfnew**2/wfold**2
#Metropolis-Hastings test to see whether we accept the move
if random() <= ProbabilityRatio:
for j in range(Dimension):
PositionOld[i,j] = PositionNew[i,j]
QuantumForceOld[i,j] = QuantumForceNew[i,j]
wfold = wfnew
#print("wf new: ", wfnew)
#print("force on 1 new:", QuantumForceNew[0,:])
#print("pos of 1 new: ", PositionNew[0,:])
#print("force on 2 new:", QuantumForceNew[1,:])
#print("pos of 2 new: ", PositionNew[1,:])
DeltaE = LocalEnergy(PositionOld,a,b,w)
DerPsi = DerivativeWFansatz(PositionOld,a,b,w)
DeltaPsi[0] += DerPsi[0]
DeltaPsi[1] += DerPsi[1]
DeltaPsi[2] += DerPsi[2]
energy += DeltaE
DerivativePsiE[0] += DerPsi[0]*DeltaE
DerivativePsiE[1] += DerPsi[1]*DeltaE
DerivativePsiE[2] += DerPsi[2]*DeltaE
# We calculate mean values
energy /= NumberMCcycles
DerivativePsiE[0] /= NumberMCcycles
DerivativePsiE[1] /= NumberMCcycles
DerivativePsiE[2] /= NumberMCcycles
DeltaPsi[0] /= NumberMCcycles
DeltaPsi[1] /= NumberMCcycles
DeltaPsi[2] /= NumberMCcycles
EnergyDer[0] = 2*(DerivativePsiE[0]-DeltaPsi[0]*energy)
EnergyDer[1] = 2*(DerivativePsiE[1]-DeltaPsi[1]*energy)
EnergyDer[2] = 2*(DerivativePsiE[2]-DeltaPsi[2]*energy)
return energy, EnergyDer
#Here starts the main program with variable declarations
NumberParticles = 2
Dimension = 2
NumberHidden = 2
interaction=True
# guess for parameters
a=np.random.normal(loc=0.0, scale=0.001, size=(NumberParticles,Dimension))
b=np.random.normal(loc=0.0, scale=0.001, size=(NumberHidden))
w=np.random.normal(loc=0.0, scale=0.001, size=(NumberParticles,Dimension,NumberHidden))
# Set up iteration using stochastic gradient method
Energy = 0
EDerivative = np.empty((3,),dtype=object)
EDerivative = [np.copy(a),np.copy(b),np.copy(w)]
# Learning rate eta, max iterations, need to change to adaptive learning rate
eta = 0.001
MaxIterations = 50
iter = 0
np.seterr(invalid='raise')
Energies = np.zeros(MaxIterations)
EnergyDerivatives1 = np.zeros(MaxIterations)
EnergyDerivatives2 = np.zeros(MaxIterations)
while iter < MaxIterations:
Energy, EDerivative = EnergyMinimization(a,b,w)
agradient = EDerivative[0]
bgradient = EDerivative[1]
wgradient = EDerivative[2]
a -= eta*agradient
b -= eta*bgradient
w -= eta*wgradient
Energies[iter] = Energy
print("Energy:",Energy)
#EnergyDerivatives1[iter] = EDerivative[0]
#EnergyDerivatives2[iter] = EDerivative[1]
#EnergyDerivatives3[iter] = EDerivative[2]
iter += 1
#nice printout with Pandas
import pandas as pd
from pandas import DataFrame
pd.set_option('max_columns', 6)
data ={'Energy':Energies}#,'A Derivative':EnergyDerivatives1,'B Derivative':EnergyDerivatives2,'Weights Derivative':EnergyDerivatives3}
frame = pd.DataFrame(data)
print(frame)
Energy: 3.2684046836825242
Energy: 3.290566840394456
Energy: 3.248670631398425
Energy: 3.196461454216011
Energy: 3.2144867257969905
Energy: 3.219716062647666
Energy: 3.220800801551698
Energy: 3.274908669413067
Energy: 3.258790506080658
Energy: 3.1961692523819534
Energy: 3.286682467638795
Energy: 3.2571553776923565
Energy: 3.2548093205611743
Energy: 3.2560001325210695
Energy: 3.2415868055768087
Energy: 3.264652037548338
Energy: 3.2763272475308987
Energy: 3.2567484982406483
Energy: 3.307708960581488
Energy: 3.2951147311898676
Energy: 3.2686010237525798
Energy: 3.235155211309001
Energy: 3.2142196314696623
Energy: 3.260299595919089
Energy: 3.360050958708752
Energy: 3.2521497865034728
Energy: 3.2656755355394473
Energy: 3.2706811269166742
Energy: 3.211389934537995
Energy: 3.1746774580237913
Energy: 3.2169641741874773
Energy: 3.2779574007924777
Energy: 3.215204292549719
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[1], line 235
232 EnergyDerivatives2 = np.zeros(MaxIterations)
234 while iter < MaxIterations:
--> 235 Energy, EDerivative = EnergyMinimization(a,b,w)
236 agradient = EDerivative[0]
237 bgradient = EDerivative[1]
Cell In[1], line 161, in EnergyMinimization(a, b, w)
158 for j in range(Dimension):
159 PositionNew[i,j] = PositionOld[i,j]+normalvariate(0.0,1.0)*sqrt(TimeStep)+\
160 QuantumForceOld[i,j]*TimeStep*D
--> 161 wfnew = WaveFunction(PositionNew,a,b,w)
162 QuantumForceNew = QuantumForce(PositionNew,a,b,w)
164 GreensFunction = 0.0
Cell In[1], line 25, in WaveFunction(r, a, b, w)
23 Psi1 = 0.0
24 Psi2 = 1.0
---> 25 Q = Qfac(r,b,w)
27 for iq in range(NumberParticles):
28 for ix in range(Dimension):
Cell In[1], line 111, in Qfac(r, b, w)
108 temp = np.zeros((NumberHidden), np.double)
110 for ih in range(NumberHidden):
--> 111 temp[ih] = (r*w[:,:,ih]).sum()
113 Q = b + temp
115 return Q
KeyboardInterrupt: