This whole note is under review. The following text is only a draft.
Machine Learning helps us to identify patterns in static data, but it's not directly suited to deal with patterns in data that varies in time.
Dynamic time warping (DTW) is a technique to measure the similarity between two time series. A time series is a sequence of data points ordered by time, e.g the values of a stock in the stock exchange during some period. We could measure the difference between two time series by simply summing up the differences between the series in each time unit, but that would not account for differences in phase.
DTW is a more smart approach that tries to account for the phase. Instead of summing up the differences between the points of the same time unit, we sum the differences between the minimal differences between the data points.
A Markov process is a statistical model that can be applied to the problem of finding patterns in temporal data. It's modeled as follows. First, we assume time is discrete and represented by the set of positive integers (t = 1, 2, 3 ...). Second, we assume data at time t comes from a discrete random variable Xt with a domain S = {s1, ... , sn}, and the probability of each outcome depends on the past outcomes. Then, we make the Markov assumption: the number of previous outcomes that influences on the probability of the current outcome is a fixed positive integer k. If k=1, we have a first-order Markov process that can represented as a Bayesian network.
We assume the process is stationary, i.e there is a single conditional probability table that holds the values of P(Xt | Xt-1) and is used for every random variable in the model. The process is called stationary because conditional probabilities are the same regardless of the time t.
Finally, we define the initial probabilities of each element of S, i.e. P(X1). These will be used to calculate the probability of a element to be in the first position of the sequence.
Markovian nomenclature
A more precise term for what we call here a Markov process would be stationary discrete-time first-order Markov chain. The "chain" part comes from the assumption of discrete random variables, which make possible to represent the problem as a Bayesian network (the "chain").
There are extensions of the Markov process that modify each one of the assumptions made here (stationarity, discrete time and discrete random variables). So a Markov process can also be defined as a broader term that encompasses all those variants. Unfortunately, the nomenclature is not standardized in the literature, so you may find the term "Markov process" referring to one specific subtype, as we have done here for simplicity.
Once we define a Markov process, we can calculate the probability of the sequence O = o1,..., om, where oi ∈ S, being generated by the model. It's given by:
A classical application of Markov processes is in natural language processing (NLP). For example, we can imagine that words in a sentence are a sequence generated by a Markov process. The conditional probability table of that process could be built based on a training corpus of text, by estimating the conditional probabilities as the frequencies of the words that precedes a given word. The number of preceding words we choose to consider defines the order of the Markov process. For the first-order Markov process, we choose one preceding word, second-order for 2 words and so on up to the nth-order. These are called, respectively, unigram, bigram and n-gram models in the NLP jargon.
In such a model, we can suggest the next word in a text (a common feature of text processors) by answering "Given the sentence W = w1, w2, ..., wn, does what word wn+1 produce the sentence W* = w1, w2, ..., wn, wn+1 with the highest P(W*)?".
Hidden Markov Models
Markov process are only useful when the pattern in the sequence can be directly observed. In our NLP example, we can observe the words of a sentence. But there are interesting problems that we can't observe the sequence directly, but we can observe a related random variable. For example, in gesture language recognition, we are usually interested in the words behind the gestures, but all we can see are the gestures themselves. Hidden Markov Model is a extension of the Markov process that deal with those cases.
We assume the process is stationary, i.e there is a fixed n x n transition matrix A that holds the conditional probabilities of every random variable Xt in the model, where n is the size of the random variables' domain (denoted as S = {S1, ... , Sn}). The process is called stationary because A is the same regardless of the time t. In mathematical notation:
And since A holds the full conditional probability of a random variable:
Finally, we define the initial probabilities of each element of S, i.e. P(X1 = Si). These will be used to calculate the probability of a element to be in the first position of the sequence.