First, since we don't know any of the values for the model parameters,
we randomize all of the parameters. Thus, every sequence of states or
transitions should be equally likely. Next, we calculate
,
, and
, and determine some new values. The first value
we compute is the expected number of times
we'll be in a given state
at time
. This is the value of
, and will be the new value of
. Similarly, we
can find new values for
and
. The proof of why we can do this
relies on some heavy duty statistics, specifically gradient descent
and EM procedures, and is therefore beyond the scope of this
paper. The interested reader should refer to [2]. What's
imporant about the Baum-Welch Training Method is that we can feed in
observation sequences, and as long as we present enough data, we will
get an HMM that is pretty close to optimal for the given training
sequences. Thus, if the training sequences are characteristic of the
actual distribution of observation sequences in the system we're
modelling, our HMM will be able to tell us what state the system is in
with a high degree of certainty. It must be noted, however, that as
with any gradient descent algorithm, it is not guaranteed that the HMM
will end up in a state of globally minimal error. Instead, it settles
into a local minimum, which hopefully is not too far from the global
minimum.