Fermat's Library | Physics Informed Deep Learning (Part I): Data-driven Solutions of Nonlinear Partial Differential Equations annotated/explained version.

Researchers have discussed physics informed neural networks for a l...

Partial differential equations (PDEs) arise in many natural phenome...

In the small data regime, it is very easy for neural networks or ot...

The key point made here is that neural networks that can leverage p...

The universal function approximation theorem of neural networks - t...

The elegant thing about this - is that automatic differentiation is...

"This simple yet powerful construction allows us to tackle a wide r...

Burger's equation was first introduced in 1915 by Harry Bateman, an...

One very attractive feature of this method, is that the training da...

One of the biggest advantage of PINNs is that they are simple to co...

Besides specifying the models u(x, t) and f(x, t), learning the par...

Clearly in this simple example of Burger's equation, PINNs converge...

Physics Informed Deep Learning (Part I): Data-driven

Solutions of Nonlinear Partial Diﬀerential Equations

Maziar Raissi

, Paris Perdikaris

, and George Em Karniadakis

Division of Applied Mathematics, Brown University,

Providence, RI, 02912, USA

Department of Mechanical Engineering and Applied Mechanics,

University of Pennsylvania,

Philadelphia, PA, 19104, USA

Abstract

We introduce physics informed neural networks – neural networks that

are trained to solve supervised learning tasks while respecting any given law

of physics described by general nonlinear partial diﬀerential equations. In

this two part treatise, we present our developments in the context of solving

two main classes of problems: data-driven solution and data-driven discovery

of partial diﬀerential equations. Depending on the nature and arrangement

of the available data, we devise two distinct classes of algorithms, namely

continuous time and discrete time models. The resulting neural networks

form a new class of data-eﬃcient universal function approximators that nat-

urally encode any underlying physical laws as prior information. In this ﬁrst

part, we demonstrate how these networks can be used to infer solutions to

partial diﬀerential equations, and obtain physics-informed surrogate models

that are fully diﬀerentiable with respect to all input coordinates and free

parameters.

Keywords: Data-driven scientiﬁc computing, Machine learning, Predictive

modeling, Runge-Kutta methods, Nonlinear dynamics

1. Introduction

With the explosive growth of available data and computing resources, re-

cent advances in machine learning and data analytics have yielded transfor-

mative results across diverse scientiﬁc disciplines, including image recognition

[1], natural language processing [2], cognitive science [3], and genomics [4].

Preprint submitted to Journal Name November 30, 2017

arXiv:1711.10561v1 [cs.AI] 28 Nov 2017

However, more often than not, in the course of analyzing complex physical,

biological or engineering systems, the cost of data acquisition is prohibitive,

and we are inevitably faced with the challenge of drawing conclusions and

making decisions under partial information. In this small data regime, the

vast majority of state-of-the art machine learning techniques (e.g., deep/-

convolutional/recurrent neural networks) are lacking robustness and fail to

provide any guarantees of convergence.

At ﬁrst sight, the task of training a deep learning algorithm to accurately

identify a nonlinear map from a few – potentially very high-dimensional –

input and output data pairs seems at best naive. Coming to our rescue, for

many cases pertaining to the modeling of physical and biological systems,

there a exist a vast amount of prior knowledge that is currently not being

utilized in modern machine learning practice. Let it be the principled physical

laws that govern the time-dependent dynamics of a system, or some empirical

validated rules or other domain expertise, this prior information can act as

a regularization agent that constrains the space of admissible solutions to a

manageable size (for e.g., in incompressible ﬂuid dynamics problems by dis-

carding any non realistic ﬂow solutions that violate the conservation of mass

principle). In return, encoding such structured information into a learning

algorithm results in amplifying the information content of the data that the

algorithm sees, enabling it to quickly steer itself towards the right solution

and generalize well even when only a few training examples are available.

The ﬁrst glimpses of promise for exploiting structured prior information

to construct data-eﬃcient and physics-informed learning machines have al-

ready been showcased in the recent studies of [5, 6, 7]. There, the authors

employed Gaussian process regression [8] to devise functional representations

that are tailored to a given linear operator, and were able to accurately infer

solutions and provide uncertainty estimates for several prototype problems

in mathematical physics. Extensions to nonlinear problems were proposed

in subsequent studies by Raissi et. al. [9, 10] in the context of both in-

ference and systems identiﬁcation. Despite the ﬂexibility and mathematical

elegance of Gaussian processes in encoding prior information, the treatment

of nonlinear problems introduces two important limitations. First, in [9, 10]

the authors had to locally linearize any nonlinear terms in time, thus limit-

ing the applicability of the proposed methods to discrete-time domains and

compromising the accuracy of their predictions in strongly nonlinear regimes.

Secondly, the Bayesian nature of Gaussian process regression requires certain

prior assumptions that may limit the representation capacity of the model

and give rise to robustness/brittleness issues, especially for nonlinear prob-

lems [11].

1.1. Problem setup and summary of contributions

In this work we take a diﬀerent approach by employing deep neural net-

works and leverage their well known capability as universal function ap-

proximators [12]. In this setting, we can directly tackle nonlinear problems

without the need for committing to any prior assumptions, linearization, or

local time-stepping. We exploit recent developments in automatic diﬀeren-

tiation [13] – one of the most useful but perhaps underused techniques in

scientiﬁc computing – to diﬀerentiate neural networks with respect to their

input coordinates and model parameters to obtain physics informed neural

networks. Such neural networks are constrained to respect any symmetry,

invariance, or conservation principles originating from the physical laws that

govern the observed data, as modeled by general time-dependent and non-

linear partial diﬀerential equations. This simple yet powerful construction

allows us to tackle a wide range of problems in computational science and

introduces a potentially disruptive technology leading to the development

of new data-eﬃcient and physics-informed learning machines, new classes of

numerical solvers for partial diﬀerential equations, as well as new data-driven

approaches for model inversion and systems identiﬁcation.

The general aim of this work is to set the foundations for a new paradigm

in modeling and computation that enriches deep learning with the long-

standing developments in mathematical physics. These developments are

presented in the context of two main problem classes: data-driven solution

and data-driven discovery of partial diﬀerential equations. To this end, let

us consider parametrized and nonlinear partial diﬀerential equations of the

general form

+ N[u; λ] = 0,

where u(t, x) denotes the latent (hidden) solution and N[·; λ] is a nonlinear

operator parametrized by λ. This setup encapsulates a wide range of prob-

lems in mathematical physics including conservation laws, diﬀusion processes,

advection-diﬀusion-reaction systems, and kinetic equations. As a motivating

example, the one dimensional Burgers’ equation [14] corresponds to the case

where N[u; λ] = λ

−λ

and λ = (λ

, λ

). Here, the subscripts denote

partial diﬀerentiation in either time or space. Given noisy measurements of

the system, we are interested in the solution of two distinct problems. The

ﬁrst problem is that of predictive inference, ﬁltering and smoothing, or data

driven solutions of partial diﬀerential equations [9, 5] which states: given

ﬁxed model parameters λ what can be said about the unknown hidden state

u(t, x) of the system? The second problem is that of learning, system identi-

ﬁcation, or data-driven discovery of partial diﬀerential equations [10, 6, 15]

stating: what are the parameters λ that best describe the observed data?

In this ﬁrst part of our two-part treatise, we focus on computing data-

driven solutions to partial diﬀerential equations of the general form

+ N[u] = 0, x ∈ Ω, t ∈ [0, T ], (1)

where u(t, x) denotes the latent (hidden) solution, N[·] is a nonlinear diﬀer-

ential operator, and Ω is a subset of R

. In what follows, we put forth two

distinct classes of algorithms, namely continuous and discrete time models,

and highlight their properties and performance through the lens of diﬀerent

benchmark problems. All code and data-sets accompanying this manuscript

are available at https://github.com/maziarraissi/PINNs.

2. Continuous Time Models

We deﬁne f(t, x) to be given by the left-hand-side of equation (1); i.e.,

f := u

+ N[u], (2)

and proceed by approximating u(t, x) by a deep neural network. This as-

sumption along with equation (2) result in a physics informed neural net-

work f (t, x). This network can be derived by applying the chain rule for

diﬀerentiating compositions of functions using automatic diﬀerentiation [13].

2.1. Example (Burgers’ Equation)

As an example, let us consider the Burgers’ equation. This equation arises

in various areas of applied mathematics, including ﬂuid mechanics, nonlinear

acoustics, gas dynamics, and traﬃc ﬂow [14]. It is a fundamental partial

diﬀerential equation and can be derived from the Navier-Stokes equations for

the velocity ﬁeld by dropping the pressure gradient term. For small values of

the viscosity parameters, Burgers’ equation can lead to shock formation that

is notoriously hard to resolve by classical numerical methods. In one space

dimension, the Burger’s equation along with Dirichlet boundary conditions

reads as

+ uu

− (0.01/π)u

= 0, x ∈ [−1, 1], t ∈ [0, 1], (3)

u(0, x) = −sin(πx),

u(t, −1) = u(t, 1) = 0.

Let us deﬁne f(t, x) to be given by

f := u

+ uu

− (0.01/π)u

and proceed by approximating u(t, x) by a deep neural network. To highlight

the simplicity in implementing this idea we have included a Python code

snippet using Tensorﬂow [16]; currently one of the most popular and well

documented open source libraries for machine learning computations. To

this end, u(t, x) can be simply deﬁned as

def u(t, x):

u = neural_net(tf.concat([t,x],1), weights, biases)

return u

Correspondingly, the physics informed neural network f(t, x) takes the form

def f(t, x):

u = u(t, x)

u_t = tf.gradients(u, t)[0]

u_x = tf.gradients(u, x)[0]

u_xx = tf.gradients(u_x, x)[0]

f = u_t + u*u_x - (0.01/tf.pi)*u_xx

return f

The shared parameters between the neural networks u(t, x) and f (t, x) can

be learned by minimizing the mean squared error loss

MSE = MSE

+ MSE

, (4)

where

MSE

i=1

|u(t

, x

) − u

and

MSE

i=1

|f(t

, x

Here, {t

, x

, u

}

i=1

denote the initial and boundary training data on u(t, x)

and {t

, x

}

i=1

specify the collocations points for f(t, x). The loss MSE

corresponds to the initial and boundary data while MSE

enforces the struc-

ture imposed by equation (3) at a ﬁnite set of collocation points.

In all benchmarks considered in this work, the total number of training

data N

is relatively small (a few hundred up to a few thousand points), and

we chose to optimize all loss functions using L-BFGS; a quasi-Newton, full-

batch gradient-based optimization algorithm [17]. For larger data-sets a more

computationally eﬃcient mini-batch setting can be readily employed using

stochastic gradient descent and its modern variants [18, 19]. Despite the

fact that there is no theoretical guarantee that this procedure converges to

a global minimum, our empirical evidence indicates that, if the given partial

diﬀerential equation is well-posed and its solution is unique, our method is

capable of achieving good prediction accuracy given a suﬃciently expressive

neural network architecture and a suﬃcient number of collocation points N

This general observation deeply relates to the resulting optimization land-

scape induced by the mean square error loss of equation 4, and deﬁnes an

open question for research that is in sync with recent theoretical develop-

ments in deep learning [20, 21]. Here, we will test the robustness of the

proposed methodology using a series of systematic sensitivity studies that

accompany the numerical results presented in the following.

Figure 1 summarizes our results for the data-driven solution of the Burg-

ers equation. Speciﬁcally, given a set of N

= 100 randomly distributed

initial and boundary data, we learn the latent solution u(t, x) by training all

3021 parameters of a 9-layer deep neural network using the mean squared

error loss of (4). Each hidden layer contained 20 neurons and a hyperbolic

tangent activation function. In general, the neural network should be given

suﬃcient approximation capacity in order to accommodate the anticipated

complexity of u(t, x). However, in this example, our choice aims to highlight

the robustness of the proposed method with respect to the well known issue

of over-ﬁtting. Speciﬁcally, the term in MSE

in equation (4) acts as a reg-

ularization mechanism that penalizes solutions that do not satisfy equation

(3). Therefore, a key property of physics informed neural networks is that

they can be eﬀectively trained using small data sets; a setting often encoun-

tered in the study of physical systems for which the cost of data acquisition

may be prohibitive.

The top panel of Figure 1 shows the predicted spatio-temporal solution

u(t, x), along with the locations of the initial and boundary training data.

We must underline that, unlike any classical numerical method for solv-

ing partial diﬀerential equations, this prediction is obtained without any

sort of discretization of the spatio-temporal domain. The exact solution for

this problem is analytically available [14], and the resulting prediction error

is measured at 6.7 · 10

−4

in the relative L

-norm. Note that this error is

about two orders of magnitude lower than the one reported in our previous

work on data-driven solution of partial diﬀerential equation using Gaussian

processes [9]. A more detailed assessment of the predicted solution is pre-

sented in the bottom panel of ﬁgure 1. In particular, we present a compari-

son between the exact and the predicted solutions at diﬀerent time instants

t = 0.25, 0.50, 0.75. Using only a handful of initial and boundary data, the

physics informed neural network can accurately capture the intricate non-

linear behavior of the Burgers’ equation that leads to the development of a

sharp internal layer around t = 0.4. The latter is notoriously hard to ac-

curately resolve with classical numerical methods and requires a laborious

spatio-temporal discretization of equation (3).

To further analyze the performance of our method, we have performed

the following systematic studies to quantify its predictive accuracy for diﬀer-

ent number of training and collocation points, as well as for diﬀerent neural

network architectures. In table 1 we report the resulting relative L

error

for diﬀerent number of initial and boundary training data N

and diﬀerent

number of collocation points N

, while keeping the 9-layer network archi-

tecture ﬁxed. The general trend shows increased prediction accuracy as the

total number of training data N

is increased, given a suﬃcient number of

collocation points N

. This observation highlights a key strength of physics

informed neural networks: by encoding the structure of the underlying phys-

0.0 0.2 0.4 0.6 0.8

−1.0

−0.5

0.0

0.5

1.0

u(t, x)

Data (100 points)

−0.75

−0.50

−0.25

0.00

0.25

0.50

0.75

−1 0 1

−1

u(t, x)

t = 0.25

−1 0 1

−1

u(t, x)

t = 0.50

Exact Prediction

−1 0 1

−1

u(t, x)

t = 0.75

Figure 1: Burgers’ equation: Top: Predicted solution u(t, x) along with the initial and

boundary training data. In addition we are using 10,000 collocation points generated using

a Latin Hypercube Sampling strategy. Bottom: Comparison of the predicted and exact

solutions corresponding to the three temporal snapshots depicted by the white vertical

lines in the top panel. The relative L

error for this case is 6.7 ·10

−4

. Model training took

approximately 60 seconds on a single NVIDIA Titan X GPU card.

ical law through the collocation points N

, one can obtain a more accurate

and data-eﬃcient learning algorithm.

Finally, table 2 shows the resulting

relative L

for diﬀerent number of hidden layers, and diﬀerent number of

neurons per layer, while the total number of training and collocation points

is kept ﬁxed to N

= 100 and N

= 10, 000, respectively. As expected, we

observe that as the number of layers and neurons is increased (hence the

capacity of the neural network to approximate more complex functions), the

Note that the case N

= 0 corresponds to a standard neural network model, i.e., a

neural network that does not take into account the underlying governing equation.

2000 4000 6000 7000 8000 10000

20 2.9e-01 4.4e-01 8.9e-01 1.2e+00 9.9e-02 4.2e-02

40 6.5e-02 1.1e-02 5.0e-01 9.6e-03 4.6e-01 7.5e-02

60 3.6e-01 1.2e-02 1.7e-01 5.9e-03 1.9e-03 8.2e-03

80 5.5e-03 1.0e-03 3.2e-03 7.8e-03 4.9e-02 4.5e-03

100 6.6e-02 2.7e-01 7.2e-03 6.8e-04 2.2e-03 6.7e-04

200 1.5e-01 2.3e-03 8.2e-04 8.9e-04 6.1e-04 4.9e-04

Table 1: Burgers’ equation: Relative L

error between the predicted and the exact solution

u(t, x) for diﬀerent number of initial and boundary training data N

, and diﬀerent number

of collocation points N

. Here, the network architecture is ﬁxed to 9 layers with 20 neurons

per hidden layer.

Layers

Neurons

10 20 40

2 7.4e-02 5.3e-02 1.0e-01

4 3.0e-03 9.4e-04 6.4e-04

6 9.6e-03 1.3e-03 6.1e-04

8 2.5e-03 9.6e-04 5.6e-04

Table 2: Burgers’ equation: Relative L

error between the predicted and the exact solution

u(t, x) for diﬀerent number of hidden layers and diﬀerent number of neurons per layer.

Here, the total number of training and collocation points is ﬁxed to N

= 100 and N

10, 000, respectively.

predictive accuracy is increased.

2.2. Example (Shr¨odinger Equation)

This example aims to highlight the ability of our method to handle pe-

riodic boundary conditions, complex-valued solutions, as well as diﬀerent

types of nonlinearities in the governing partial diﬀerential equations. The

one-dimensional nonlinear Schr¨odinger equation is a classical ﬁeld equation

that is used to study quantum mechanical systems, including nonlinear wave

propagation in optical ﬁbers and/or waveguides, Bose-Einstein condensates,

and plasma waves. In optics, the nonlinear term arises from the intensity

dependent index of refraction of a given material. Similarly, the nonlinear

term for Bose-Einstein condensates is a result of the mean-ﬁeld interactions

of an interacting, N-body system. The nonlinear Schr¨odinger equation along

with periodic boundary conditions is given by

+ 0.5h

+ |h|

h = 0, x ∈ [−5, 5], t ∈ [0, π/2], (5)

h(0, x) = 2 sech(x),

h(t, −5) = h(t, 5),

(t, −5) = h

(t, 5),

where h(t, x) is the complex-valued solution. Let us deﬁne f(t, x) to be given

f := ih

+ 0.5h

+ |h|

and proceed by placing a complex-valued neural network prior on h(t, x).

In fact, if u denotes the real part of h and v is the imaginary part, we

are placing a multi-out neural network prior on h(t, x) =



u(t, x) v(t, x)



This will result in the complex-valued (multi-output) physic informed neural

network f(t, x). The shared parameters of the neural networks h(t, x) and

f(t, x) can be learned by minimizing the mean squared error loss

MSE = MSE

+ MSE

, (6)

where

MSE

i=1

|h(0, x

) − h

MSE

i=1



, −5) − h

, 5)|

+ |h

, −5) − h

, 5)|



and

MSE

i=1

|f(t

, x

Here, {x

, h

}

i=1

denotes the initial data, {t

}

i=1

corresponds to the colloca-

tion points on the boundary, and {t

, x

}

i=1

represents the collocation points

on f(t, x). Consequently, MSE

corresponds to the loss on the initial data,

MSE

enforces the periodic boundary conditions, and MSE

penalizes the

Schr¨odinger equation not being satisﬁed on the collocation points.

In order to assess the accuracy of our method, we have simulated equation

(5) using conventional spectral methods to create a high-resolution data set.

Speciﬁcally, starting from an initial state h(0, x) = 2 sech(x) and assuming

periodic boundary conditions h(t, −5) = h(t, 5) and h

(t, −5) = h

(t, 5), we

have integrated equation (5) up to a ﬁnal time t = π/2 using the Chebfun

package [22] with a spectral Fourier discretization with 256 modes and a

fourth-order explicit Runge-Kutta temporal integrator with time-step ∆t =

π/2 · 10

−6

. Under our data-driven setting, all we observe are measurements

, h

}

i=1

of the latent function h(t, x) at time t = 0. In particular, the train-

ing set consists of a total of N

= 50 data points on h(0, x) randomly parsed

from the full high-resolution data-set, as well as N

= 50 randomly sampled

collocation points {t

}

i=1

for enforcing the periodic boundaries. Moreover,

we have assumed N

= 20, 000 randomly sampled collocation points used

to enforce equation (5) inside the solution domain. All randomly sampled

point locations were generated using a space ﬁlling Latin Hypercube Sam-

pling strategy [23].

Here our goal is to infer the entire spatio-temporal solution h(t, x) of the

Schr¨odinger equation (5). We chose to jointly represent the latent func-

tion h(t, x) = [u(t, x) v(t, x)] using a 5-layer deep neural network with

100 neurons per layer and a hyperbolic tangent activation function. Fig-

ure 2 summarizes the results of our experiment. Speciﬁcally, the top panel

of ﬁgure 2 shows the magnitude of the predicted spatio-temporal solution

|h(t, x)| =

(t, x) + v

(t, x), along with the locations of the initial and

boundary training data. The resulting prediction error is validated against

the test data for this problem, and is measured at 1.97 · 10

−3

in the rela-

tive L

-norm. A more detailed assessment of the predicted solution is pre-

sented in the bottom panel of Figure 2. In particular, we present a compar-

ison between the exact and the predicted solutions at diﬀerent time instants

t = 0.59, 0.79, 0.98. Using only a handful of initial data, the physics informed

neural network can accurately capture the intricate nonlinear behavior of the

Schr¨odinger equation.

One potential limitation of the continuous time neural network models

considered so far, stems from the need to use a large number of collocation

points N

in order to enforce physics informed constraints in the entire spatio-

temporal domain. Although this poses no signiﬁcant issues for problems

in one or two spatial dimensions, it may introduce a severe bottleneck in

higher dimensional problems, as the total number of collocation points needed

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

−5

|h(t, x)|

Data (150 points)

0.5

1.0

1.5

2.0

2.5

3.0

3.5

−5 0 5

|h(t, x)|

t = 0.59

−5 0 5

|h(t, x)|

t = 0.79

Exact Prediction

−5 0 5

|h(t, x)|

t = 0.98

Figure 2: Shr¨odinger equation: Top: Predicted solution |h(t, x)| along with the initial and

boundary training data. In addition we are using 20,000 collocation points generated using

a Latin Hypercube Sampling strategy. Bottom: Comparison of the predicted and exact

solutions corresponding to the three temporal snapshots depicted by the dashed vertical

lines in the top panel. The relative L

error for this case is 1.97 · 10

−3

to globally enforce a physics informed constrain (i.e., in our case a partial

diﬀerential equation) will increase exponentially. In the next section, we put

forth a diﬀerent approach that circumvents the need for collocation points

by introducing a more structured neural network representation leveraging

the classical Runge-Kutta time-stepping schemes [24].

3. Discrete Time Models

Let us apply the general form of Runge-Kutta methods with q stages [24]

to equation (1) and obtain

n+c

= u

− ∆t

j=1

N[u

n+c

], i = 1, . . . , q,

n+1

= u

− ∆t

j=1

N[u

n+c

(7)

Here, u

n+c

(x) = u(t

+ c

∆t, x) for j = 1, . . . , q. This general form en-

capsulates both implicit and explicit time-stepping schemes, depending on

the choice of the parameters {a

, b

, c

}. Equations (7) can be equivalently

expressed as

= u

, i = 1, . . . , q,

= u

q+1

(8)

where

:= u

n+c

+ ∆t

j=1

N[u

n+c

], i = 1, . . . , q,

q+1

:= u

n+1

+ ∆t

j=1

N[u

n+c

(9)

We proceed by placing a multi-output neural network prior on



n+c

(x), . . . , u

n+c

(x), u

n+1

(x)



. (10)

This prior assumption along with equations (9) result in a physics informed

neural network that takes x as an input and outputs



(x), . . . , u

(x), u

q+1

(x)



. (11)

3.1. Example (Burgers’ Equation)

To highlight the key features of the discrete time representation we revisit

the problem of data-driven solution of the Burgers’ equation. For this case,

the nonlinear operator in equation (9) is given by

N[u

n+c

] = u

n+c

− (0.01/π)u

n+c

and the shared parameters of the neural networks (10) and (11) can be

learned by minimizing the sum of squared errors

SSE = SSE

+ SSE

, (12)

where

SSE

q+1

j=1

i=1

n,i

) − u

n,i

and

SSE

i=1



n+c

(−1)|

+ |u

n+c

(1)|



+ |u

n+1

(−1)|

+ |u

n+1

(1)|

Here, {x

n,i

, u

n,i

}

i=1

corresponds to the data at time t

. The Runge-Kutta

scheme now allows us to infer the latent solution u(t, x) in a sequential fash-

ion. Starting from initial data {x

n,i

, u

n,i

}

i=1

at time t

and data at the

domain boundaries x = −1 and x = 1, we can use the aforementioned loss

function (12) to train the networks of (10), (11), and predict the solution at

time t

n+1

. A Runge-Kutta time-stepping scheme would then use this predic-

tion as initial data for the next step and proceed to train again and predict

u(t

n+2

, x), u(t

n+3

, x), etc., one step at a time.

In classical numerical analysis, these steps are usually conﬁned to be small

due to stability constraints for explicit schemes or computational complexity

constrains for implicit formulations [24]. These constraints become more se-

vere as the total number of Runge-Kutta stages q is increased, and, for most

problems of practical interest, one needs to take thousands to millions of such

steps until the solution is resolved up to a desired ﬁnal time. In sharp con-

trast to classical methods, here we can employ implicit Runge-Kutta schemes

with an arbitrarily large number of stages at eﬀectively no extra cost.

This

enables us to take very large time steps while retaining stability and high pre-

dictive accuracy, therefore allowing us to resolve the entire spatio-temporal

solution in a single step.

The result of applying this process to the Burgers’ equation is presented

in ﬁgure 3. For illustration purposes, we start with a set of N

= 250 initial

data at t = 0.1, and employ a physics informed neural network induced by an

implicit Runge-Kutta scheme with 500 stages to predict the solution at time

t = 0.9 in a single step. The theoretical error estimates for this scheme predict

a temporal error accumulation of O(∆t

) [24], which in our case translates

into an error way below machine precision, i.e., ∆t

= 0.8

1000

≈ 10

−97

. To

our knowledge, this is the ﬁrst time that an implicit Runge-Kutta scheme

To be precise, it is only the number of parameters in the last layer of the neural

network that increases linearly with the total number of stages.

0.0 0.2 0.4 0.6 0.8

−1.0

−0.5

0.0

0.5

1.0

u(t, x)

−0.75

−0.50

−0.25

0.00

0.25

0.50

0.75

−1 0 1

−1.0

−0.5

0.0

0.5

1.0

u(t, x)

t = 0.10

Data

−1 0 1

−0.5

0.0

0.5

u(t, x)

t = 0.90

Exact Prediction

Figure 3: Burgers equation: Top: Solution u(t, x) along with the location of the initial

training snapshot at t = 0.1 and the ﬁnal prediction snapshot at t = 0.9. Bottom: Initial

training data and ﬁnal prediction at the snapshots depicted by the white vertical lines in

the top panel. The relative L

error for this case is 8.2 · 10

−4

of that high-order has ever been used. Remarkably, starting from smooth

initial data at t = 0.1 we can predict the nearly discontinuous solution at

t = 0.9 in a single time-step with a relative L

error of 8.2·10

−4

. This error is

two orders of magnitude lower that the one reported in [9], and it is entirely

attributed to the neural network’s capacity to approximate u(t, x), as well as

to the degree that the sum of squared errors loss allows interpolation of the

training data. The network architecture used here consists of 4 layers with

50 neurons in each hidden layer.

Layers

Neurons

10 25 50

1 4.1e-02 4.1e-02 1.5e-01

2 2.7e-03 5.0e-03 2.4e-03

3 3.6e-03 1.9e-03 9.5e-04

Table 3: Burgers’ equation: Relative ﬁnal prediction error measure in the L

norm for

diﬀerent number of hidden layers and neurons in each layer. Here, the number of Runge-

Kutta stages is ﬁxed to 500 and the time-step size to ∆t = 0.8.

A detailed systematic study to quantify the eﬀect of diﬀerent network

architectures is presented in table 3. By keeping the number of Runge-Kutta

stages ﬁxed to q = 500 and the time-step size to ∆t = 0.8, we have var-

ied the number of hidden layers and the number of neurons per layer, and

monitored the resulting relative L

error for the predicted solution at time

t = 0.9. Evidently, as the neural network capacity is increased the predictive

accuracy is enhanced.

The key parameters controlling the performance of our discrete time al-

gorithm are the total number of Runge-Kutta stages q and the time-step size

∆t. In table 4 we summarize the results of an extensive systematic study

where we ﬁx the network architecture to 4 hidden layers with 50 neurons

per layer, and vary the number of Runge-Kutta stages q and the time-step

size ∆t. Speciﬁcally, we see how cases with low numbers of stages fail to

yield accurate results when the time-step size is large. For instance, the case

q = 1 corresponding to the classical trapezoidal rule, and the case q = 2

corresponding to the 4

-order Gauss-Legendre method, cannot retain their

predictive accuracy for time-steps larger than 0.2, thus mandating a solu-

tion strategy with multiple time-steps of small size. On the other hand, the

ability to push the number of Runge-Kutta stages to 32 and even higher

allows us to take very large time steps, and eﬀectively resolve the solution

in a single step without sacriﬁcing the accuracy of our predictions. More-

over, numerical stability is not sacriﬁced either as implicit Runge-Kutta is

the only family of time-stepping schemes that remain A-stable regardless of

their order, thus making them ideal for stiﬀ problems [24]. These properties

are unprecedented for an algorithm of such implementation simplicity, and

illustrate one of the key highlights of our discrete time approach.

∆t

0.2 0.4 0.6 0.8

1 3.5e-02 1.1e-01 2.3e-01 3.8e-01

2 5.4e-03 5.1e-02 9.3e-02 2.2e-01

4 1.2e-03 1.5e-02 3.6e-02 5.4e-02

8 6.7e-04 1.8e-03 8.7e-03 5.8e-02

16 5.1e-04 7.6e-02 8.4e-04 1.1e-03

32 7.4e-04 5.2e-04 4.2e-04 7.0e-04

64 4.5e-04 4.8e-04 1.2e-03 7.8e-04

100 5.1e-04 5.7e-04 1.8e-02 1.2e-03

500 4.1e-04 3.8e-04 4.2e-04 8.2e-04

Table 4: Burgers’ equation: Relative ﬁnal prediction error measured in the L

norm for

diﬀerent number of Runge-Kutta stages q and time-step sizes ∆t. Here, the network

architecture is ﬁxed to 4 hidden layers with 50 neurons in each layer.

3.1.1. Example (Allen-Cahn Equation)

This example aims to highlight the ability of the proposed discrete time

models to handle diﬀerent types of nonlinearity in the governing partial dif-

ferential equation. To this end, let us consider the Allen-Cahn equation along

with periodic boundary conditions

− 0.0001u

+ 5u

− 5u = 0, x ∈ [−1, 1], t ∈ [0, 1], (13)

u(0, x) = x

cos(πx),

u(t, −1) = u(t, 1),

(t, −1) = u

(t, 1).

The Allen-Cahn equation is a well-known equation from the area of reaction-

diﬀusion systems. It describes the process of phase separation in multi-

component alloy systems, including order-disorder transitions. For the Allen-

Cahn equation, the nonlinear operator in equation (9) is given by

N[u

n+c

] = −0.0001u

n+c

+ 5



n+c



− 5u

n+c

and the shared parameters of the neural networks (10) and (11) can be

learned by minimizing the sum of squared errors

SSE = SSE

+ SSE

, (14)

where

SSE

q+1

j=1

i=1

n,i

) − u

n,i

and

SSE

i=1

n+c

(−1) − u

n+c

(1)|

+ |u

n+1

(−1) − u

n+1

(1)|

i=1

n+c

(−1) − u

n+c

(1)|

+ |u

n+1

(−1) − u

n+1

(1)|

Here, {x

n,i

, u

n,i

}

i=1

corresponds to the data at time t

. We have generated

a training and test data-set set by simulating the Allen-Cahn equation (13)

using conventional spectral methods. Speciﬁcally, starting from an initial

condition u(0, x) = x

cos(πx) and assuming periodic boundary conditions

u(t, −1) = u(t, 1) and u

(t, −1) = u

(t, 1), we have integrated equation (13)

up to a ﬁnal time t = 1.0 using the Chebfun package [22] with a spectral

Fourier discretization with 512 modes and a fourth-order explicit Runge-

Kutta temporal integrator with time-step ∆t = 10

−5

In this example, we assume N

= 200 initial data points that are ran-

domly sub-sampled from the exact solution at time t = 0.1, and our goal

is to predict the solution at time t = 0.9 using a single time-step with size

∆t = 0.8. To this end, we employ a discrete time physics informed neural

network with 4 hidden layers and 200 neurons per layer, while the output

layer predicts 101 quantities of interest corresponding to the q = 100 Runge-

Kutta stages u

n+c

(x), i = 1, . . . , q, and the solution at ﬁnal time u

n+1

(x).

Figure 4 summarizes our predictions after the network has been trained using

the loss function of equation (14). Evidently, despite the complex dynamics

leading to a solution with two sharp internal layers, we are able to obtain an

accurate prediction of the solution at t = 0.9 using only a small number of

scattered measurements at t = 0.1.

4. Summary and Discussion

We have introduced physics informed neural networks, a new class of uni-

versal function approximators that is capable of encoding any underlying

0.0 0.2 0.4 0.6 0.8 1.0

−1.0

−0.5

0.0

0.5

1.0

u(t, x)

−1.00

−0.75

−0.50

−0.25

0.00

0.25

0.50

0.75

−1 0 1

−1.00

−0.75

−0.50

−0.25

0.00

u(t, x)

t = 0.10

Data

−1 0 1

−1.0

−0.5

0.0

0.5

1.0

u(t, x)

t = 0.90

Exact Prediction

Figure 4: Allen-Cahn equation: Top: Solution u(t, x) along with the location of the initial

training snapshot at t = 0.1 and the ﬁnal prediction snapshot at t = 0.9. Bottom: Initial

training data and ﬁnal prediction at the snapshots depicted by the white vertical lines in

the top panel. The relative L

error for this case is 6.99 · 10

−3

physical laws that govern a given data-set, and can be described by par-

tial diﬀerential equations. In this work, we design data-driven algorithms for

inferring solutions to general nonlinear partial diﬀerential equations, and con-

structing computationally eﬃcient physics-informed surrogate models. The

resulting methods showcase a series of promising results for a diverse collec-

tion of problems in computational science, and open the path for endowing

deep learning with the powerful capacity of mathematical physics to model

the world around us. As deep learning technology is continuing to grow

rapidly both in terms of methodological and algorithmic developments, we

believe that this is a timely contribution that can beneﬁt practitioners across

a wide range of scientiﬁc domains. Speciﬁc applications that can readily en-

joy these beneﬁts include, but are not limited to, data-driven forecasting of

physical processes, model predictive control, multi-physics/multi-scale mod-

eling and simulation.

We must note however that the proposed methods should not be viewed

as replacements of classical numerical methods for solving partial diﬀeren-

tial equations (e.g., ﬁnite elements, spectral methods, etc.). Such methods

have matured over the last 50 years and, in many cases, meet the robustness

and computational eﬃciency standards required in practice. Our message

here, as advocated in Section 3, is that classical methods such as the Runge-

Kutta time-stepping schemes can coexist in harmony with deep neural net-

works, and oﬀer invaluable intuition in constructing structured predictive

algorithms. Moreover, the implementation simplicity of the latter greatly

favors rapid development and testing of new ideas, potentially opening the

path for a new era in data-driven scientiﬁc computing. This will be further

highlighted in the second part of this paper in which physics informed neural

networks are put to the test of data-driven discovery of partial diﬀerential

equations.

Finally, in terms of future work, one pressing question involves addressing

the problem of quantifying the uncertainty associated with the neural net-

work predictions. Although this important element was naturally addressed

in previous work employing Gaussian processes [9], it not captured by the

proposed methodology in its present form and requires further investigation.

Acknowledgements

This work received support by the DARPA EQUiPS grant N66001-15-

2-4055, the MURI/ARO grant W911NF-15-1-0562, and the AFOSR grant

FA9550-17-1-0013. All data and codes used in this manuscript are publicly

available on GitHub at https://github.com/maziarraissi/PINNs.

References

[1] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classiﬁcation with

deep convolutional neural networks, in: Advances in neural information

processing systems, pp. 1097–1105.

[2] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015)

436–444.

[3] B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Human-level concept

learning through probabilistic program induction, Science 350 (2015)

1332–1338.

[4] B. Alipanahi, A. Delong, M. T. Weirauch, B. J. Frey, Predicting the se-

quence speciﬁcities of DNA-and RNA-binding proteins by deep learning,

Nature biotechnology 33 (2015) 831–838.

[5] M. Raissi, P. Perdikaris, G. E. Karniadakis, Inferring solutions of dif-

ferential equations using noisy multi-ﬁdelity data, Journal of Computa-

tional Physics 335 (2017) 736–746.

[6] M. Raissi, P. Perdikaris, G. E. Karniadakis, Machine learning of linear

diﬀerential equations using Gaussian processes, Journal of Computa-

tional Physics 348 (2017) 683 – 693.

[7] H. Owhadi, Bayesian numerical homogenization, Multiscale Modeling

& Simulation 13 (2015) 812–828.

[8] C. E. Rasmussen, C. K. Williams, Gaussian processes for machine learn-

ing, volume 1, MIT press Cambridge, 2006.

[9] M. Raissi, P. Perdikaris, G. E. Karniadakis, Numerical Gaussian pro-

cesses for time-dependent and non-linear partial diﬀerential equations,

arXiv preprint arXiv:1703.10230 (2017).

[10] M. Raissi, G. E. Karniadakis, Hidden physics models: Machine

learning of nonlinear partial diﬀerential equations, arXiv preprint

arXiv:1708.00588 (2017).

[11] H. Owhadi, C. Scovel, T. Sullivan, et al., Brittleness of Bayesian infer-

ence under ﬁnite information in a continuous world, Electronic Journal

of Statistics 9 (2015) 1–79.

[12] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks

are universal approximators, Neural networks 2 (1989) 359–366.

[13] A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Au-

tomatic diﬀerentiation in machine learning: a survey, arXiv preprint

arXiv:1502.05767 (2015).

[14] C. Basdevant, M. Deville, P. Haldenwang, J. Lacroix, J. Ouazzani,

R. Peyret, P. Orlandi, A. Patera, Spectral and ﬁnite diﬀerence solu-

tions of the Burgers equation, Computers & ﬂuids 14 (1986) 23–41.

[15] S. H. Rudy, S. L. Brunton, J. L. Proctor, J. N. Kutz, Data-driven

discovery of partial diﬀerential equations, Science Advances 3 (2017).

[16] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.

Corrado, A. Davis, J. Dean, M. Devin, et al., Tensorﬂow: Large-scale

machine learning on heterogeneous distributed systems, arXiv preprint

arXiv:1603.04467 (2016).

[17] D. C. Liu, J. Nocedal, On the limited memory BFGS method for large

scale optimization, Mathematical programming 45 (1989) 503–528.

[18] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, MIT press, 2016.

[19] D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv

preprint arXiv:1412.6980 (2014).

[20] A. Choromanska, M. Henaﬀ, M. Mathieu, G. B. Arous, Y. LeCun, The

loss surfaces of multilayer networks, in: Artiﬁcial Intelligence and Statis-

tics, pp. 192–204.

[21] R. Shwartz-Ziv, N. Tishby, Opening the black box of deep neural net-

works via information, arXiv preprint arXiv:1703.00810 (2017).

[22] T. A. Driscoll, N. Hale, L. N. Trefethen, Chebfun guide, 2014.

[23] M. Stein, Large sample properties of simulations using latin hypercube

sampling, Technometrics 29 (1987) 143–151.

[24] A. Iserles, A ﬁrst course in the numerical analysis of diﬀerential equa-

tions, 44, Cambridge University Press, 2009.

Discussion

The elegant thing about this - is that automatic differentiation is part of the training procedure of neural networks (back-propagation), so this is already built-in/calculated. The key point made here is that neural networks that can leverage prior information about the underlying system (i.e. the partial differential equation governing the system, and boundary conditions) can regularize the search space of possible solutions, and make a very high-dimensional input/output search space, navigable. One very attractive feature of this method, is that the training data for the neural network, just comes from sampling the domains of x and t... Researchers have discussed physics informed neural networks for a long time. In 1997, Lagaris et al. in their seminal: “Artificial Neural Networks for Solving Ordinary and Partial Differential Equations”, presented a method to solve partial differential equations using NNs. In their work, NNs were used to approximate functions, and the cost term that gets minimized is represented by the sum of the initial/boundary conditions of the PDE and the feedforward NN that trained to solve the differential equation. Lagaris et al. employed a multilayer perceptron with one hidden layer to approximate functions. Fast forward a few decades, with faster computers and GPU parallel processing, the idea of using machine learning and NNs to solve PDEs came back in vogue- physics informed neural networks (PINNs), presented here, can take advantage of advances in computing, and converge on a number of problems. Burger's equation was first introduced in 1915 by Harry Bateman, and is a fundamental PDE that occurs in many different areas of applied mathematics. Source: https://en.wikipedia.org/wiki/Burgers%27_equation The universal function approximation theorem of neural networks - that a sufficiently wide neural network can approximate any function, is key to PINNs actually working - because NNs have to approximate the underlying function that gets solved. Background on the different universal approximation theorems for NNs: https://en.wikipedia.org/wiki/Universal_approximation_theorem "This simple yet powerful construction allows us to tackle a wide range of problems in computational science and introduces a potentially disruptive technology leading to the development of new data-eﬃcient and physics-informed learning machines, new classes of numerical solvers for partial diﬀerential equations, as well as new data-driven approaches for model inversion and systems identiﬁcation. The general aim of this work is to set the foundations for a new paradigm in modeling and computation that enriches deep learning with the longstanding developments in mathematical physics. These developments are presented in the context of two main problem classes: data-driven solution and data-driven discovery of partial diﬀerential equations." Besides specifying the models u(x, t) and f(x, t), learning the parameters of the networks requires introducing a cost function, presented here as MSE. As the MSE gets minimized, the NN will converge on solutions for u(x, t) for any values of x and t. Partial differential equations (PDEs) arise in many natural phenomena, and are equations that impose relations between the partial derivatives of a multivariable function. Famous PDEs include Maxwell’s equations for electrodynamics, the heat flow equation, the wave equation, Schrodinger’s equation and the Einstein field equations. These equations can form systems of equations (multiple coupled equations), involve nonlinear operators, depend on time or not, and have many varieties. More background: https://en.wikipedia.org/wiki/Partial_differential_equation Clearly in this simple example of Burger's equation, PINNs converge on the solutions to u(x, t) with very good accuracy. One of the biggest advantage of PINNs is that they are simple to code up - specifying the neural networks u(t, x) and f(t, x) only took a few lines of code, and the code is explainable/naturally flows from the PDE. In the small data regime, it is very easy for neural networks or other large-parameter models to overfit data. ![Imgur](https://imgur.com/Mo4h3rr.png) Source: https://en.wikipedia.org/wiki/Overfitting

Comments

Products

Project