You may want to inform yourself about human rights in China.

On the Multi-Dimensional Chain-Rule

tags: others maths
date: 2023-09-10
update: 2024-01-18

Note: This article has been generated from a \(\LaTeX\) file by Pandoc; some things are/may be broken. The .pdf is available.

Lake Baikal in winter, Russia

Lake Baikal in winter, Russia by Sergey Pesterev through wikimedia.orgCC-BY-SA-4.0


The goal here is to provide an in-depth version of a standard proof of the multi-dimensional differentiation chain-rule. A (very) generous amount of preliminaries is provided, in part to exercise, but also to better understand how various results and ideas articulate. You may want to skip to the last section for the chain-rule itself.
A good textbook reference is (Rudin 1976).


Vector spaces, Linear maps

I’ll assume familiarity with the definition of a field, of a vector space, of a basis and of the dimension of a vector space. But let’s review how linear maps are defined, given their relevancy in the context of multi-dimensional differentiation: the derivative is the best linear map approximating a function in the vicinity of a point.

Definition 1 (linear transformation). A linear transformation, between two vector spaces over the same field is a vector space homomorphism: it’s a transformation preserving the vector space structure.
More precisely, let \(\phi : V \rightarrow W\) be a linear map between two vector spaces \((V, +_V, ._V)\) and \((W, +_W, ._W)\) over the same field \(\mathbb{K}\). Then, \((\forall (u, v)\in V^2)\) and \((\forall \alpha \in \mathbb{K})\):


addition is preserved: \[\boxed{ \phi(u+_Vv) = \phi(u)+_W\phi(v) }\]


Scalar multiplication is preserved: \[\boxed{ \phi(\alpha._Vu) = \alpha._W\phi(u) }\]

Remark 1. To denote a linear function \(\phi\) between two vector spaces \(V\) and \(W\) over the same field, we’ll use the following notation: \[\boxed{ \phi : V \xrightarrow{\sim} W }\]

Remark 2. We also talk of linear application, linear map and linear operator. While in general those terms can have different meaning, I’ll use them interchangeably here.

We’ll make frequent use of the following elementary result later on; I’ll omit the proof to save space:

Lemma 1. Let \(\phi : U \xrightarrow{\sim}V\) be a linear map between two vector spaces \((U, +_U, ._U)\) and \((V, +_V, ._V)\) over the same field \((\mathbb{K}, +, \times)\). Then, \(\phi\) must preserves the additive neutral elements: \[\boxed{ \phi(0_U) = 0_V }\]


While the one-dimensional derivative can escape explicit mention of norms, they becomes mandatory in higher-dimensions. A norm is defined on a vector space, and there are subtleties tied to some qualities of the underlying field which can affect the definition of a norm.
For simplicity, we’ll only take into consideration a non-problematic case that will be of interest for the multi-dimensional chain-rule, namely \(\mathbb{R}\) or \(\mathbb{C}\).

Definition 2 (norm). Let \(V\) be a vector space over a field \(\mathbb{K}\) being either \(\mathbb{R}\) or \(\mathbb{C}\). A norm on \(V\) is a map \(\left\lVert . \right\rVert : V \rightarrow \mathbb{R}\) satisfying:

(i) The triangle inequality / sub-additivity

: \[\boxed{ (\forall (u, v)\in V^2),\ % \left\lVert u+v \right\rVert \leq \left\lVert u \right\rVert+\left\lVert v \right\rVert }\]

(ii) Absolute homogeneity

\[\boxed{ (\forall (u, k)\in V\times\mathbb{K}),\ % \left\lVert ku \right\rVert = \left\lvert k \right\rvert\left\lVert u \right\rVert }\]

(iii) Positive-definiteness

\[\boxed{ (\forall u\in V),\ \left\lVert u \right\rVert = 0 \Rightarrow u = 0_V }\]

Definition 3 (Normed vector space). A vector space \((V, +, .)\) equipped with a norm \(\left\lVert . \right\rVert\) makes a normed vector space. It’s customary to denote it as a couple \((V, \left\lVert . \right\rVert)\).

Remark 3. It’s customary, and I’ll be often guilty of doing it, to use the same \(\left\lVert . \right\rVert\) symbol for different normed spaces, and thus in general different norms. You can always infer the correct norm by looking at which set its argument is from.
For example, if we say that \((U, \left\lVert . \right\rVert)\) and \((V, \left\lVert . \right\rVert)\) are two normed vector spaces, what we usually mean is that we have \((U, \left\lVert . \right\rVert_U)\) and \((V, \left\lVert . \right\rVert_V)\) where \(\left\lVert . \right\rVert_U\) and \(\left\lVert . \right\rVert_V\) are distinct functions.

Lemma 2. Let \((V, \left\lVert . \right\rVert)\) be a normed vector space. Then \[\boxed{ (\forall u\in V),\ \left\lVert u \right\rVert \geq 0 }\]

Proof. Indeed, let \(u\in V\). Then: \[\begin{aligned} 0 &=&& \left\lVert 0_V \right\rVert &\text{ (positive-definiteness)} \\ ~ &=&& \left\lVert u+(-u) \right\rVert &\text{ (additive inverses)} \\ ~ &\leq&& \left\lVert u \right\rVert + \left\lVert -u \right\rVert &\text{ (triangle inequality)} \\ ~ & \leq && \left\lVert u \right\rVert + \left\lvert -1 \right\rvert\left\lVert u \right\rVert &\text{ (absolute homogeneity)} \\ ~ & \leq && 2\left\lVert u \right\rVert \\ \end{aligned}\] \[\Leftrightarrow \left\lVert u \right\rVert \geq 0\] ◻

The following result will also be of use.

Theorem 1 (Inverse triangular inequality). . Let \((V, \left\lVert . \right\rVert)\) be a normed space. Then: \[\boxed{ (\forall (u,v)\in V^2),\ % \left\lvert \left\lVert u \right\rVert-\left\lVert v \right\rVert \right\rvert \leq \left\lVert u-v \right\rVert }\]

Proof. Indeed, let \((u, v)\in V^2\). Then, using the triangular inequality and some clever "\(+0\)" trick: \[\begin{aligned} \left\lVert u \right\rVert &=& \left\lVert u-v+v \right\rVert &\leq& \left\lVert u-v \right\rVert + \left\lVert v \right\rVert &&\Leftrightarrow&& \left\lVert u \right\rVert-\left\lVert v \right\rVert &\leq& \left\lVert u-v \right\rVert\\ \left\lVert v \right\rVert &=& \left\lVert v-u+u \right\rVert &\leq& \left\lVert v-u \right\rVert + \left\lVert u \right\rVert &&\Leftrightarrow&& \left\lVert v \right\rVert-\left\lVert u \right\rVert &\leq& \left\lVert v-u \right\rVert\\ \end{aligned}\] But, by absolute homogeneity: \[\left\lVert v-u \right\rVert = \left\lVert (-1)(u-v) \right\rVert = \left\lvert -1 \right\rvert\left\lVert u-v \right\rVert = \left\lVert u-v \right\rVert\] And so we really have the following system of equations: \[\begin{cases} \left\lVert u \right\rVert-\left\lVert v \right\rVert &\leq \left\lVert u-v \right\rVert \\ -(\left\lVert u \right\rVert-\left\lVert v \right\rVert) &\leq \left\lVert u-v \right\rVert \\ \end{cases}\] But this is exactly what we wanted to prove, considering the definition of the absolute value: \[\left\lvert \left\lVert u \right\rVert-\left\lVert v \right\rVert \right\rvert = \begin{cases} \left\lVert u \right\rVert-\left\lVert v \right\rVert & \text{ when } \left\lVert u \right\rVert-\left\lVert v \right\rVert \geq 0 \\ -(\left\lVert u \right\rVert-\left\lVert v \right\rVert) = \left\lVert v \right\rVert-\left\lVert u \right\rVert & \text{ otherwise} \\ \end{cases}\] ◻

Continuity, limits

We now want to establish an important result, allowing to "swap" limits and function calls, as long as the function is continuous, and as long as the limit exist. As we’ll see later, norms and linear functions in particular are continuous, and so we can swap limits with those.
For clarity, let’s first recall how (multi-dimensional) limits are defined, in terms of \(\epsilon-\delta\), and how multi-dimensional continuity is defined in terms of limit.

Definition 4. Let \(f : U \rightarrow V\) where \(U\) and \(V\) are two normed vector spaces. We say that \(f\) admits \(L\) as a limit at \(a\in X\) when:

\[(\forall \epsilon > 0),\ (\exists \delta > 0),\ % (\forall u \in U),\quad\left\lVert u-a \right\rVert < \delta% \quad\Rightarrow\quad\left\lVert f% (u) - L \right\rVert < \epsilon\]

Definition 5. Let \(f : U \rightarrow V\) be a function between two normed vector spaces \(U\) and \(V\).
1) We say that \(f\) is continuous at \(a\in X\) if: \[\boxed{ \lim_{x\rightarrow a}f(x) = f(a) }\] i.e.: \[(\forall \epsilon > 0),\ (\exists \delta > 0),\ % (\forall u \in U),\quad\left\lVert u-a \right\rVert < \delta% \quad\Rightarrow\quad\left\lVert f(u) - f(a) \right\rVert < \epsilon\]

2) If \(f\) is continuous at every point of \(X\subseteq U\), we say that \(f\) is continuous on \(X\).

Theorem 2. Let \(f : V \rightarrow W\) and \(g : U \rightarrow V\), where \(U, V, W\) are three normed vector spaces. Furthermore, let \(L\in V\) be the limit of \(g\) as its input goes to \(a\), and \(f\) be continuous at least on \(L\); then \[\boxed{ \lim_{u\rightarrow a}\Bigl(f(g(u))\Bigr) = f\left( \lim_{u\rightarrow a}g(u) \right) }\]

Proof. We can rephrase the limit condition on \(g\) as: \[(\forall \epsilon' > 0),\ (\exists \delta_{\epsilon'} > 0),\ % (\forall u \in U),\quad\left\lVert u-a \right\rVert < \delta_{\epsilon'}% \quad\Rightarrow\quad\left\lVert g(u) - L \right\rVert < \epsilon'\] And the continuity of \(f\) at \(L\) as: \[(\forall \epsilon > 0),\ (\exists \delta' > 0),\ % (\forall v \in V),\quad\left\lVert v-L \right\rVert < \delta'% \quad\Rightarrow\quad\left\lVert f(v) - f(L) \right\rVert < \epsilon\]

So, if we’re given an \(\epsilon > 0\), the continuity condition of \(f\) at \(L\) yields: \[(\exists \delta' > 0),\ (\forall v\in V),\ % \left\lVert v-L \right\rVert<\delta' \quad\Rightarrow\quad \left\lVert f(v)-f(L) \right\rVert < \epsilon\] As this is true \((\forall v\in V)\), so in particular, it’s true for any \(g(u)\), where \(u\in U\): \[(\exists \delta' > 0),\ (\forall u\in U),\ % \left\lVert g(u)-L \right\rVert<\delta' \quad\Rightarrow\quad \left\lVert f(g(u))-f(L) \right\rVert < \epsilon\]

Now observe that the limit condition on \(g\) is true for any \(\epsilon'\): in particular, it must be true for \(\epsilon' = \delta'\): \[(\exists \delta_{\delta'} > 0),\ (\forall u\in U),\ % \left\lVert u-a \right\rVert<\delta_{\delta'} \quad\Rightarrow\quad \left\lVert g(u)-L \right\rVert < \delta'\]

What we have so far can be rewritten as: \[(\forall \epsilon > 0),\ (\exists \delta' > 0),\ % (\exists \delta_{\delta'} > 0),\ (\forall u\in U),\ % \begin{cases} \left\lVert g(u)-L \right\rVert<\delta' \quad\Rightarrow\quad \left\lVert f(g(u))-f(L) \right\rVert < \epsilon \\ \left\lVert u-a \right\rVert<\delta_{\delta'} \quad\Rightarrow\quad \left\lVert g(u)-L \right\rVert < \delta' \\ \end{cases}\] Or combining the two implications: \[(\forall \epsilon > 0),\ (\exists \delta' > 0),\ % (\exists \delta_{\delta'} > 0),\ (\forall u\in U),\ % \left\lVert u-a \right\rVert<\delta_{\delta'} \Rightarrow \left\lVert g(u)-L \right\rVert < \delta' \Rightarrow \left\lVert f(g(u))-f(L) \right\rVert < \epsilon\] Which can be simplified to: \[(\forall \epsilon > 0),\ (\exists \delta > 0),\ % (\forall u \in U),\quad\left\lVert u-a \right\rVert < \delta% \quad\Rightarrow\quad\left\lVert f(g(u)) - f(L) \right\rVert < \epsilon\]

Which is exactly the \(\epsilon-\delta\) formulation of our result. ◻

We now prove our two important special cases.

Theorem 3. Let \((V, \left\lVert . \right\rVert)\) be a normed vector space. Then \(\left\lVert . \right\rVert : V \rightarrow \mathbb{R}\) is continuous, w.r.t. the topology induced by itself on \(V\), and w.r.t. the standard topology on \(\mathbb{R}\).

Remark 4. Observe that our present definition of continuity depends on two norms, one on the domain, one on the codomain of the functions being considered. A more general way to talk about continuity is to equip both sets with a different kind of mathematical structure: topologies. There are standard constructions allowing the construction of topologies from metrics, which themselves can easily be constructed from norms.
The theorem is simply being precise in this regard.

Proof. Let’s rephrase what we want to prove in terms of \(\epsilon-\delta\) (note that the norm on \(\mathbb{R}\) is the absolute value) \[(\forall a\in V),\ (\forall \epsilon > 0),\ (\exists \delta > 0),\ % (\forall v \in V),\quad\left\lVert v-a \right\rVert < \delta% \quad\Rightarrow\quad\left\lvert \left\lVert v \right\rVert - \left\lVert a \right\rVert \right\rvert < \epsilon\] But recall the inverse triangular inequality: \[(\forall (u,v)\in V^2),\ % \left\lvert \left\lVert u \right\rVert-\left\lVert v \right\rVert \right\rvert \leq \left\lVert u-v \right\rVert\] So in particular, if we pick an \(a\in V\), then an arbitrary \(\epsilon>0\), then take \(\delta=\epsilon\), we have: \[(\forall v\in V),\ % \left\lvert \left\lVert v \right\rVert-\left\lVert a \right\rVert \right\rvert \leq \left\lVert v-a \right\rVert < \delta=\epsilon\] Which concludes the proof: as soon as we have \(\left\lVert v-a \right\rVert < \epsilon\), \(\left\lvert \left\lVert v \right\rVert-\left\lVert a \right\rVert \right\rvert < \epsilon\) naturally follows. ◻

We’re now ready to start tackling the case of the continuity of linear maps. Let’s first recall the notion of the supremum of a subset of the real

Definition 6 (supremum). Let \(E\) be a subset of the extended real line: \[\overline{\mathbb{R}} := [-\infty,+\infty] = \mathbb{R}\cup\{-\infty,+\infty\}\] Then the supremum of \(E\) is the least upper bound of \(E\) in \(\overline{\mathbb{R}}\). It is denoted \(\sup E\).

Remark 5. The notion of supremum can be defined in a more general context, in which case, it might not exists. But it’s guaranteed to exists here.

This allows us to define something that is commonly referred to as the operator norm, which as the name indicates, is a norm, on the vector space made by equipping the set of all linear applications between two given vector spaces by the usual pointwise operations. However, we won’t need this vector space structure, nor to establish that this is actually a norm, so I won’t go into further details.
Saying it otherwise, for the purpose of those notes, we’re just defining a symbol associated to a linear application.

Definition 7 (operator norm). Let \((U, \left\lVert . \right\rVert_U)\) and \((V, \left\lVert . \right\rVert_V)\) be two normed vector spaces. Let \(\phi : U \xrightarrow{\sim}V\) be a linear map. The following quantity is called the operator norm of \(\phi\): \[\boxed{ \overline{\mathbb{R}}\ni\left\lVert \phi \right\rVert := \begin{cases} \sup \left\{ \frac{\left\lVert \phi(u) \right\rVert_V}{\left\lVert u \right\rVert_U} \quad\middle|\quad u \in U\backslash\{0_U\} \right\} & \text{ if } U \neq \{0_U\}; \\ 0 & \text{ otherwise.} \\ \end{cases} }\]

From there, we can derive the notion of bounded operator:

Definition 8 (bounded operator). Let \((U, \left\lVert . \right\rVert_U)\) and \((V, \left\lVert . \right\rVert_V)\) be two normed vector spaces. Let \(\phi : U \xrightarrow{\sim}V\) be a linear map (also called a linear operator). We say that: \[\boxed{ \phi \text{ is bounded } \quad:\Leftrightarrow\quad \left\lVert \phi \right\rVert<+\infty }\]

One last step before moving on to the continuity of linear maps:

Theorem 4. Let \((U, \left\lVert . \right\rVert_U)\) and \((V, \left\lVert . \right\rVert_V)\) be two normed vector spaces over a field \(\mathbb{K}\), and where \(U\) is finite-dimensional. Let \(\phi : U \xrightarrow{\sim}V\) be a linear map. Then \(\boxed{\phi\text{ is bounded}}\).

Proof. First, if \(U = \{0_U\}\), then \(\left\lVert \phi \right\rVert=0<+\infty\) and we’re done.
So, consider the case where \(U \ne \{0_U\}\). Let \(n = \dim U\), take a basis \(\{e_i\}_{i\in\{1,2,\ldots,n\}}\) for \(U\). Let then \(u\in U\backslash\{0_U\}\), and expand it in our basis: \[u = \sum_{i=1}^nu^ie_i;\quad\text{ where } (\forall i\in \{1,2,\ldots,n\}),\ u^i\in\mathbb{K}\] Then: \[\begin{aligned} \frac{\left\lVert \phi(u) \right\rVert}{\left\lVert u \right\rVert} &=&& \frac{1}{\left\lVert u \right\rVert}\left\lVert \phi(\sum_{i=1}^nu^ie_i) \right\rVert &&\text{ (expanding $u$)} \\ ~ &=&& \frac{1}{\left\lVert u \right\rVert}\left\lVert \sum_{i=1}^nu^i\phi(e_i) \right\rVert &&\text{ (linearity of $\phi$)} \\ ~ &\leq&& \underbrace{ \frac{1}{\left\lVert u \right\rVert}\sum_{i=1}^n\left\lVert u^i\phi(e_i) \right\rVert }_{=:\alpha} &&\text{ (triangular inequality)} \\ \end{aligned}\] But \(u\neq 0\) implies \(\left\lVert u \right\rVert\neq 0\), and the numerator is a finite sum of real numbers, so \(\alpha\in\mathbb{R}\). Furthermore, as \(u\) was arbitrary, this implies by definition of the operator norm that \(\left\lVert \phi \right\rVert\leq\alpha\in\mathbb{R}\), i.e. \(\left\lVert \phi \right\rVert<+\infty\). ◻


Theorem 5. Let \((U, \left\lVert . \right\rVert_U)\) and \((V, \left\lVert . \right\rVert_V)\) be two normed vector spaces. Then any bounded linear map \(\phi : U \xrightarrow{\sim}V\) is continuous, w.r.t. the topologies induced by the norms on each space.

Proof. For clarity, let’s rephrase what we want to prove in terms of \(\epsilon-\delta\): \[(\forall \epsilon > 0),\ (\exists \delta > 0),\ % (\forall u \in U),\quad\left\lVert u-a \right\rVert < \delta% \quad\Rightarrow\quad\left\lVert \phi(u) - \phi(a) \right\rVert < \epsilon\]

Let then \(\epsilon > 0\).
First, if \(U=\{0_U\}\), then pick say \(\delta:=\epsilon\).
Then we can only pick \(u = a = 0_U\), as there are no other vectors in the space. By positive-definiteness of the norm on \(U\), we always have \(\left\lVert u-a \right\rVert = \left\lVert 0_U \right\rVert = 0 < \delta\). Furthermore, \(\phi(u)=\phi(a)=0_V\), as \(\phi\) is linear. And by positive-definiteness of the norm on \(V\), \(\left\lVert \phi(u)-\phi(a) \right\rVert = \left\lVert 0_V \right\rVert = 0 < \epsilon\)
Second, if \(U\neq\{0_U\}\), i.e. \(\left\lVert \phi \right\rVert \neq 0\). But \(\phi\) is bounded, so we have \(\left\lVert \phi \right\rVert<+\infty\), and thus we can define. \[\delta := \frac{\epsilon}{\left\lVert \phi \right\rVert}\]

Then, if \(u=a\), we’re actually in the same case as before: \(\left\lVert u-a \right\rVert=\left\lVert 0_U \right\rVert=0 < \delta\) always, by positive-definiteness of the norm on \(U\). And \(\left\lVert \phi(u)-\phi(a) \right\rVert = \left\lVert \phi(u-a) \right\rVert = \left\lVert \phi(0_U) \right\rVert = \left\lVert 0_V \right\rVert = 0 < \epsilon\), by linearity of \(\phi\), and because linear applications "send zero to zero", and by positive-definiteness of the norm on \(V\).
Otherwise, \((\forall u\in U\backslash\{a\})\)

\[\begin{aligned} \left\lVert u-a \right\rVert < \delta & \Rightarrow && \left\lVert u-a \right\rVert < \frac{\epsilon}{\left\lVert \phi \right\rVert} & \text{ (definition of $\delta$)} \\ ~ & \Rightarrow && \left\lVert \phi \right\rVert < \frac{\epsilon}{\left\lVert u-a \right\rVert} & \text{ ($u\neq a \Rightarrow \left\lVert u-a \right\rVert\neq 0$, positive-definiteness)} \\ ~ & \Rightarrow && \frac{\left\lVert \phi(u-a) \right\rVert}{\left\lVert u-a \right\rVert} \leq \left\lVert \phi \right\rVert < \frac{\epsilon}{\left\lVert u-a \right\rVert} & \text{ (definition of $\left\lVert \phi \right\rVert$)} \\ ~ & \Rightarrow && \left\lVert \phi(u-a) \right\rVert< \epsilon & \text{ ($\left\lVert u-a \right\rVert \geq 0$)} \\ ~ & \Rightarrow && \left\lVert \phi(u)-\phi(a) \right\rVert < \epsilon & \text{ ($\phi$ is linear)} \\ \end{aligned}\]

Which is what we wanted to prove. ◻

Remark 6. We proved a slightly more general result that what is needed in the context of those notes. But it’s not much more complicated, perhaps even simpler than having to rigorously deal with explicit matrices as is often the case, and will generalize better.
Hopefully, I didn’t miss anything.

Multi-dimensional differentiation

Let’s now recall how the derivative is defined in a multi-dimensional setting. This can be presented in different ways, for example via the Jacobian. But we can stick to the "best local linear approximation" type of definition, as it feels more explicit, avoids the need for partial and directional derivatives, and generalizes better1.
I’ll be considering the real case here, but this should work equally well on the complex field.

Definition 9 (Total derivative, vector-valued real function). Take a non-empty open \(U\subseteq\mathbb{R}^m\), and let \(f:U\rightarrow\mathbb{R}^n\) be a vector-valued function, where \(\mathbb{R}^m\) and \(\mathbb{R}^n\) are considered as normed vector spaces over \(\mathbb{R}\).
1) If there exists a linear function \(\lambda_{f,p} : \mathbb{R}^m \xrightarrow{\sim}\mathbb{R}^n\) such that, for a point \(p\in\mathbb{R}^m\) we have: \[\boxed{ \lim_{\epsilon\rightarrow0}\frac{ \left\lVert f(p+\epsilon) - f(p) - \lambda_{f,p}(\epsilon) \right\rVert_{\mathbb{R}^n} }{\left\lVert \epsilon \right\rVert_{\mathbb{R}^m}} = 0 }\] Then we say that \(f\) is differentiable at \(p\); \(\lambda_{f,p}\) is the total derivative around \(p\): it’s the best linear-approximation of \(f\) in a neighborhood of \(p\).
2) If \(f\) is differentiable at every point of \(U\), we say that \(f\) is differentiable in \(U\); the following \(df\) function is then called the total derivative of \(f\) in \(U\): \[\boxed { df : \begin{pmatrix} \mathbb{R}^m & \rightarrow & (\mathbb{R}^m & \xrightarrow{\sim}& \mathbb{R}^n) \\ p & \mapsto & (q&\mapsto&\lambda_{f,p}(q)) \\ \end{pmatrix} }\]

Remark 7. The total derivative is also called the total differential, or sometimes the differential, of \(f\).

Remark 8. Let me emphasize that by contrast with the scalar-valued partial/directional derivatives, or what is commonly refered to as "derivative" in the one-dimensional case, the total derivative \(df\) is function-valued: for a point in the domain, it returns a linear function, i.e. the best linear approximation of \(f\) at that point: \[\boxed{ df : (U\subseteq \mathbb{R}^m) \rightarrow (\mathbb{R}^m \xrightarrow{\sim}\mathbb{R}^n) }\]

Theorem 6 (Unicity of the derivative). Let \(f : (U\subseteq\mathbb{R}^m) \rightarrow \mathbb{R}^n\) be a function, differentiable at \(p\in U\), for \(U\) open. Then the derivate of \(f\) at \(p\), \(\lambda_{f,p}\), is unique.

Proof. Indeed, suppose we have two such derivatives, say, \(\lambda_{f,p}\) and \(\sigma_{f,p}\), that we’ll rename \(\lambda\) and \(\sigma\). If we want to prove that those two are equals, we need to prove that they are point-wise equal2, i.e.: \[\lambda=\sigma\quad:\Leftrightarrow\quad (\forall q\in\mathbb{R}^m),\ \lambda(q)=\sigma(q)\] First, observe that because both are linear functions by definition, we must have: \[\lambda(0) = \sigma(0) = 0\] Then, pick \(q\in\mathbb{R}^m\backslash\{0\}\): \[\begin{aligned} \left\lVert \lambda(q)-\sigma(q) \right\rVert &=&& \left\lVert -\Bigl( f(p+q)-f(p)-\lambda({q}) \Bigr) + \Bigl( f(p+q)-f(p)-\sigma({q}) \Bigr) \right\rVert & \text{ (clever $+0$ trick)}\\ ~ &\leq && \left\lVert -\Bigl( f(p+q)-f(p)-\lambda({q}) \Bigr) \right\rVert + \left\lVert f(p+q)-f(p)-\sigma({q}) \right\rVert & \text{ (triangular inequality)}\\ ~ &\leq && \left\lvert -1 \right\rvert\left\lVert f(p+q)-f(p)-\lambda({q}) \right\rVert + \left\lVert f(p+q)-f(p)-\sigma({q}) \right\rVert & \text{ (absolute homogeneity)}\\ ~ &\leq && \left\lVert f(p+q)-f(p)-\lambda({q}) \right\rVert + \left\lVert f(p+q)-f(p)-\sigma({q}) \right\rVert \\ \end{aligned}\]

And thus in particular, by definition of \(\lambda\), \(\sigma\) as differentials of \(f\) at \(p\), and because \(q\neq0 \Rightarrow \left\lVert q \right\rVert\neq0\) (by positive-definiteness of the norm):

\[\begin{aligned} \lim_{q\rightarrow0}\left( \frac{\left\lVert \lambda(q)-\sigma(q) \right\rVert} {\left\lVert q \right\rVert} \right) &\leq&& \underbrace{ \lim_{p\rightarrow0}\left( \frac{ \left\lVert f(p+q) -f(p)-\lambda({q}) \right\rVert } {\left\lVert q \right\rVert} \right) }_{=0} + \underbrace{ \lim_{q\rightarrow0}\left( \frac{ \left\lVert f(p+q) -f(p)-\sigma({q}) \right\rVert } {\left\lVert q \right\rVert} \right) }_{=0} \\ ~ &\leq&& 0 \\ \end{aligned}\] But then consider \(q = kq'\) for \(k\in\mathbb{R}\backslash\{0\}\) and \(q'\in\mathbb{R}^m\backslash\{0\}\). Then:

\[\begin{aligned} \lim_{q\rightarrow0}\left( \frac{\left\lVert \lambda(q)-\sigma(q) \right\rVert} {\left\lVert q \right\rVert} \right) &=&& \lim_{k\rightarrow 0}\left( \frac{\left\lVert \lambda(kq')-\sigma(kq') \right\rVert} {\left\lVert kq' \right\rVert} \right) \\ ~ &=&& \lim_{k\rightarrow 0}\left( \frac{\left\lVert k\lambda(q')-k\sigma(q') \right\rVert} {\left\lVert kq' \right\rVert} \right) & \text{ (linearity)} \\ ~ &=&& \lim_{k\rightarrow 0}\left( \frac{\left\lvert k \right\rvert\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lvert k \right\rvert\left\lVert q' \right\rVert} \right) & \text{ (absolute homogeneity)} \\ ~ &=&& \lim_{k\rightarrow 0}\left( \frac{\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lVert q' \right\rVert} \right) \\ ~ &=&& \frac{\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lVert q' \right\rVert} \\ \end{aligned}\]

So by combining our two previous results, we’ve proved that for \(q'\in\mathbb{R}^m\backslash\{0\}\), \[\frac{\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lVert q' \right\rVert} \leq 0\] But the norm is positive, so: \[0 \leq \frac{\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lVert q' \right\rVert} \leq 0 \quad\Leftrightarrow\quad \frac{\left\lVert \lambda(q')-\sigma(q') \right\rVert} {\left\lVert q' \right\rVert} = 0\] Now, \(\left\lVert q' \right\rVert\) is finite, and thus we have: \[\left\lVert \lambda(q')-\sigma(q') \right\rVert = 0\] Finally, by positive definiteness, this is equivalent to saying: \[\lambda(q') - \sigma(q') = 0 \quad\Leftrightarrow\quad \lambda(q') = \sigma(q')\] And so the two differentials are indeed (pointwise) equals on all of \(\mathbb{R}^m\). ◻

The following little lemma offers a convenient alternative characterization of the total differential.

Lemma 3 (Alternative characterization for the total derivative). Take a non-empty open \(U\subseteq\mathbb{R}^m\), and let \(f:U\rightarrow\mathbb{R}^n\) be a vector-valued function, where \(\mathbb{R}^m\) and \(\mathbb{R}^n\) are considered as normed vector spaces over \(\mathbb{R}\).
Then \(f\) is totally differentiable function at \(p\in\mathbb{R}^m\) iff: \[\boxed{ \left(\exists \mu \in \mathcal{C}^0(U \rightarrow \mathbb{R}^n), \lim_{\epsilon\rightarrow0} \frac{\left\lVert \mu(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = 0\right),\ f(p+\epsilon) -f(p) = \lambda_{f,p}(\epsilon) + \mu(\epsilon) }\]

Proof. We have: \[\begin{aligned} ~ && \left(\exists \mu \in \mathcal{C}^0(U \rightarrow \mathbb{R}^n), \lim_{\epsilon\rightarrow0} \frac{\left\lVert \mu(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = 0\right),\ f(p+\epsilon) -f(p) = \lambda_{f,p}(\epsilon) + \mu(\epsilon) \\ \Leftrightarrow && \left(\exists \mu \in \mathcal{C}^0(U \rightarrow \mathbb{R}^n), \lim_{\epsilon\rightarrow0} \frac{\left\lVert \mu(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = 0\right),\ \mu(\epsilon) = f(p+\epsilon) -f(p) - \lambda_{f,p}(\epsilon) \\ \Leftrightarrow && \left( \exists \mu \in \mathcal{C}^0(U \rightarrow \mathbb{R}^n) \right), 0 = \lim_{\epsilon\rightarrow0} \frac{\left\lVert \mu(\epsilon) \right\rVert} {\left\lVert \epsilon \right\rVert} = \lim_{\epsilon\rightarrow0} \frac{\left\lVert f(p+\epsilon) -f(p) -\lambda_{f,p}(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} \\ \Leftrightarrow && \text{ $f$ differentiable at $p$ } \\ \end{aligned}\] ◻


Theorem 7 (Multi-dimensional chain rule). Let \((U\subseteq\mathbb{R}^m) \xrightarrow{g}(V\subseteq\mathbb{R}^d) \xrightarrow{f}\mathbb{R}^n\).
1) Let \(g\) be differentiable at \(p\in U\), and \(f\) be differentiable at \(g(p)\in V\). Then \(f\circ g\) is differentiable at \(p\) and: \[\boxed{ \lambda_{f\circ g,p} = \lambda_{f,g(p)} \circ \lambda_{g,p} }\]

2) Of course, this naturally generalizes to open subsets: if \(g\) is differentiable on an open \(U\subseteq\mathbb{R}^m\) and \(f\) is differentiable on an open \(V\subseteq\mathbb{R}^d\), where \(g(U)\subseteq V\), then: \[\boxed{ d(f\circ g)(p) = df(g(p)) \circ dg(p) }\]

Proof. The goal is to prove that \(f\circ g\) is differentiable at \(p\), and that the best linear approximation at \(p\) is given by the composition of two linear functions: \(df(g(p)) \circ dg(p)\).
By the definition of the differentiation at a point, this means we need to prove: \[\lim_{\epsilon\rightarrow0}\frac{ \left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - \lambda_{(f\circ g),p}(\epsilon) \right\rVert }{\left\lVert \epsilon \right\rVert} = 0\]

Because of the norms positivity, this means we only need to prove that the limit is bounded above by zero, i.e. \[\lim_{\epsilon\rightarrow0}\frac{ \left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - \lambda_{(f\circ g),p}(\epsilon) \right\rVert }{\left\lVert \epsilon \right\rVert} \leq 0\]

Or, by replacing \(\lambda_{(f\circ g),p}\) with the expected result: \[\lim_{\epsilon\rightarrow0}\frac{ \left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - ( ( \lambda_{f,g(p)} \circ \lambda_{g,p} ) (\epsilon) \right\rVert }{\left\lVert \epsilon \right\rVert} \leq 0\]

Now, by our previous characterization, the differentiability conditions on \(f\) and \(g\) at respectively \(g(p)\) and \(p\) can be expressed as: \[\left(\exists \mu \in \mathcal{C}^0(U \rightarrow \mathbb{R}^d), \lim_{\epsilon'\rightarrow0} \frac{\left\lVert \mu(\epsilon') \right\rVert}{\left\lVert \epsilon' \right\rVert} = 0\right),\ f(g(p)+\epsilon') -f(g(p)) = \lambda_{f,g(p)}(\epsilon') + \mu(\epsilon')\] \[\left(\exists \eta \in \mathcal{C}^0(V \rightarrow \mathbb{R}^m), \lim_{\epsilon\rightarrow0} \frac{\left\lVert \eta(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = 0\right),\ g(p+\epsilon) -g(p) = \lambda_{g,p}(\epsilon) + \eta(\epsilon)\]

Pick \(\epsilon\in U\subseteq\mathbb{R}^m\backslash\{0\}\), and let \(\epsilon' := g(p+\epsilon) - g(p)\). Observe that then, as \(\epsilon\) goes to zero, so is \(\epsilon'\) We can now rewrite the numerator of limit involved the previously mentioned expected result as, modulo the norm: \[\begin{aligned} (f\circ g)(p+\epsilon) - (f\circ g)(p) - ( ( \lambda_{f,g(p)} \circ \lambda_{g,p} ) (\epsilon) &=&& f(g(p+\epsilon)) - f(g(p)) - \lambda_{f,g(p)} ( \lambda_{g,p} (\epsilon)) \\ ~ &=&& f(g(p)+\epsilon') - f(g(p)) - \lambda_{f,g(p)} ( \lambda_{g,p} (\epsilon)) \\ ~ &=&& \lambda_{f,g(p)}(\epsilon') + \mu(\epsilon') - \lambda_{f,g(p)} ( \lambda_{g,p} (\epsilon)) \\ ~ &=&& \lambda_{f,g(p)}\Bigl( \epsilon' - \lambda_{g,p}(\epsilon) \Bigr) + \mu(\epsilon') \text{ (linearity)} \\ ~ &=&& \lambda_{f,g(p)}( \eta(\epsilon)) + \mu(\epsilon') \\ \end{aligned}\]

And so: \[\frac{\left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - ( ( \lambda_{f,g(p)} \circ \lambda_{g,p} ) (\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = \frac{\left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) + \mu(\epsilon') \right\rVert}{\left\lVert \epsilon \right\rVert}\] By the triangular inequality: \[\begin{aligned} \frac{\left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - ( ( \lambda_{f,g(p)} \circ \lambda_{g,p} ) (\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} &\leq&& \frac{\left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert + \left\lVert \mu(\epsilon') \right\rVert }{\left\lVert \epsilon \right\rVert} \\ ~ &\leq&& \frac{\left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert }{\left\lVert \epsilon \right\rVert} + \frac{\left\lVert \mu(\epsilon') \right\rVert}{\left\lVert \epsilon \right\rVert} \end{aligned}\]

To establish our expected result, we need to study this as \(\epsilon\) goes to \(0\): if we can prove that the limits of each term of the right hand side of the previous inequality goes to zero, then we can apply the sum rule for the limits, and conclude.
Consider first the second term of the right and side of the previous inequality: as we’ve already observed, by definition of \(\epsilon'\) and by the differentiability of \(f\) at \(g(p)\): \[\lim_{\epsilon\rightarrow0} \frac{\left\lVert \mu(\epsilon') \right\rVert}{\left\lVert \epsilon \right\rVert} = \lim_{\epsilon'\rightarrow0} \frac{\left\lVert \mu(\epsilon') \right\rVert}{\left\lVert \epsilon' \right\rVert} = 0\]

Consider now the first term. We’re interested in studying: \[\lim_{\epsilon\rightarrow0} \frac{ \left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert }{\left\lVert \epsilon \right\rVert}\]

By absolute homogeneity of the norm, and the fact that \(\left\lvert \left\lVert \epsilon \right\rVert \right\rvert = \left\lVert \epsilon \right\rVert\) as the norm is positive: \[\frac{ \left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert }{\left\lVert \epsilon \right\rVert} = \left\lvert \frac{1}{\left\lVert \epsilon \right\rVert} \right\rvert \left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert = \left\lVert \frac{1}{\left\lVert \epsilon \right\rVert} \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert = \left\lVert \lambda_{f,g(p)}\left( \frac{1}{\left\lVert \epsilon \right\rVert}\eta(\epsilon)\right) \right\rVert\] Where the last step relies on the linearity of the \(\lambda\) functions. Now, we know that: \[\lim_{\epsilon\rightarrow0} \frac{\left\lVert \eta(\epsilon) \right\rVert}{\left\lVert \epsilon \right\rVert} = 0\] Which means that \(\left\lVert \eta(\epsilon) \right\rVert\) converges to zero as \(\epsilon\) goes to zero. But because the norm is continuous, this implies that: \[\left\lVert \lim_{\epsilon\rightarrow0} \frac{\eta(\epsilon)}{\left\lVert \epsilon \right\rVert} \right\rVert = 0\] Which implies, by positive-definiteness of the norm: \[\lim_{\epsilon\rightarrow0}\left( \frac{1}{\left\lVert \epsilon \right\rVert}\eta(\epsilon) \right) = 0\]

But then, because both norms and linear functions are continuous, we can "swap the limits": \[\begin{aligned} \lim_{\epsilon\rightarrow0} \frac{ \left\lVert \lambda_{f,g(p)}( \eta(\epsilon)) \right\rVert }{\left\lVert \epsilon \right\rVert} &=&& \lim_{\epsilon\rightarrow0} \left\lVert \lambda_{f,g(p)}\left( \frac{1}{\left\lVert \epsilon \right\rVert}\eta(\epsilon)\right) \right\rVert \\ ~ &=&& \left\lVert \lim_{\epsilon\rightarrow0}\left( \lambda_{f,g(p)}\left( \frac{1}{\left\lVert \epsilon \right\rVert}\eta(\epsilon)\right) \right) \right\rVert \\ ~ &=&& \left\lVert \lambda_{f,g(p)}\left( \lim_{\epsilon\rightarrow0}\left( \frac{1}{\left\lVert \epsilon \right\rVert}\eta(\epsilon) \right) \right) \right\rVert \\ ~ &=&& \left\lVert \lambda_{f,g(p)}(0) \right\rVert \\ ~ &=&& \left\lVert 0 \right\rVert \text{ (linear maps sends $0$ to $0$) } \\ ~ &=&& 0 \text{ (norm's positiveness) } \end{aligned}\]

And so indeed, our expected inequality holds: \[\lim_{\epsilon\rightarrow0}\frac{ \left\lVert (f\circ g)(p+\epsilon) - (f\circ g)(p) - ( ( \lambda_{f,g(p)} \circ \lambda_{g,p} ) (\epsilon) \right\rVert }{\left\lVert \epsilon \right\rVert} \leq 0\]

By a previous results, we know that such local linear approximations are unique: we’ve found one, and it’s thus the one. Which concludes the proof. ◻

Rudin, W. 1976. Principles of Mathematical Analysis. International Series in Pure and Applied Mathematics. McGraw-Hill.


By email, at mathieu.bivert chez: