Module 0369: Thoughts on data modeling

Tak Auyeung

1 What this is about

Assume that a data frame is a vector \(f=(v_0, v_1, \dots v_{n-1})\). After collecting a finite number of data frames, we want to be able to predict missing values of a new data frame where not all values are known in the frame. Let \(U \subseteq \{x|x \in \mathbb{N} \wedge x \ge 0 \wedge x < n\}\), then the new data frame \(f_n=(w_0,\dots w_{n-1})\) has one or more of the values \(w_j\) where \(j \in U\) being missing (undefined).

The objective is to develop a mechanism to predict the undefined values.

2 Linear and non-linear models

Unless the underlying data corresponds to phenomena where known functions are reasonable models, most data sets do not follow linear or non-linear closed form functions.

3 Terms

A data set is a set of data frames, and each data frame is a vector of values. A data frame \(f\) is a vector \((v_0,\dots v_{n-1})\) where \(n\) is the number of data items in a data frame.

The convenience, we define \(D=\{x | x \in \mathbb{N}_0 \wedge x \ge 0 \wedge x < n\}\) as the entire set of dimension indices, each element of \(e \in D\) is an index that identifies a specific value in a data frame, or a specific dimension of the data space.

For each \(i \in D\), \(X_i\) defines a set of all possible values of \(f[i]\) in a data frame \(f\). \(f[i]\) is a shorthand to designate the value at position \(i\) of the data frame \(f\), and \(i\) is zero-oriented.

Also for convenience, define \(o(T)\) as a sorted/ordered tuple of the elements of \(T\). For example, \(o(\{4,2,5,0\})=(0,2,4,5)\). Technically, we can define \(o(T)\) as a tuple where \(\forall i \in T(o(T)[i] \in T) \wedge \forall e \in T(\exists i(T)(e = o(T)[i]))\).

The entire data space \(S_{o(D)}\) is, therefore, the Cartesian product of \(X_i\) where \(i\in D\). \(S_{o(D)} = X_0 \times X_1 \times \dots X_{n-1} = \prod_{i=0}^{n-1} X_i\).

We can take \(E \subset D\) as a subset of the entire set of dimension indices. Then the data subspace corresponding to \(E\) is \(S_{o(E)}=\prod_{i=0}^{|E|-1} X_{o(E)[i]}\).

A model function maps a data subspace to a single data dimension. Assume the domain subspace index set is \(E\) such that \(E \subset D\), then the data subspace serving as the domain of the function is \(S_{o(E)}\). We designate \(c \in D-E\) as the index of the codomain dimension. The modeling function is, therefore \(f : S_{o(E)} \rightarrow X_c\).

The objective is to derive useful model functions based on a finite number of data frames. A useful model function is one that uses given values of future data frames to model the unknown values of the same frames.

4 A raw mesh surface

In the case of using the values of indices \(i \in E\) of data frames to figure out the value of the value at index \(c\), one can see the data frames after being projected to frames only with the values of indices \(i \in E \cup \{c\}\) as points in \(T = S_{o(E\cup \{c\})}\).

These points can, then, be used to form a mesh surface in the space \(T\). As a new data frame becomes available, the point of the new data frame (missing the value of index \(c\)) is placed on the mesh based on the values at indices \(i \in E\), and the mesh surface helps project the value of the missing dimension corresponding to index \(c\) in a data frame.

This approach is prone to over fitting the raw data. Any noise in the data set will affect the projection.

5 How to smooth out the raw mesh

The problem is any “smooth” function has an assumption of what the surface should look like. Otherwise, error cannot be calculated, and if error cannot be calculated, then there can be no method to minimize the error due to noise.

6 Successive diminishing effect linear approximation

This method can be summarized as follows:

find the closest 3 data frames that “surrounds” the new data frame
assume the surface is linear and compute the codomain value of the new data frame
find the next 3 closest data frames that surrounds the new data frame
assume the surface is linear and compute the codomain value of the new data frame
etc.

The approach extends the search radius up to a certain point. Then a weighted sum of the projected codomain values is computed as the projected codomain value of a new data frame. The weighing function probably needs to rely on the actual distances of the data frames used to surround the new data frame.

Technically, the three closest data frames do not need to enclose the new data frame because once a plane is formed, linear extrapolation will find the codomain value of the new data frame.

7 Irrelevant values

It is possible that some values in a data frame do not influence the codomain value. It is important to identify these data frame dimensions. One method to do identify relevance of a dimension is to remove it and see if a data set still retains its ability to accurately compute codomain values.

In the worst case, the power set of all attributes can be used to explore what attributes are useful.

8 Correlation

Correlation is looking at a subset of the attributes as the domain, and try to find the function/model to map points from this domain to another attribute.

In a way, the codomain attribute can be seen as just one of the attributes.