BNfinder: Exact and efficient method for learning Bayesian networks 
In the present section we give a brief exposition of the algorithm implemented in BNFinder and its computational cost for two generally used scoring criteria: Minimal Description Length and BayesianDirichlet equivalence.
For a fuller treatment, including detailed proofs, we refer the reader to [2].
1 Polynomialtime exact algorithm
A Bayesian network (BN) N is a representation of a joint distribution of a set of discrete random variables X={X_{1},…,X_{n}}. The representation consists of two components: a directed acyclic graph G=(X,E) encoding conditional (in)dependencies
 a family θ of conditional distributions P(X_{i}Pa_{i}), where
Pa_{i}={Y∈X(Y,X_{i})∈E}
The problem of learning a BN is understood as follows: given a multiset of Xinstances D={x_{1},…,x_{N}} find a network graph G that best matches D. The notion of a good match is formalized by means of a scoring function S(G:D) having positive values and minimized for the best matching network. Thus the point is to find a directed acyclic graph G with the set of vertices X minimizing S(G:D).
The BNFinder program is devoted to the case when there is no need to examine the acyclicity of the graph, for example:
 When dealing with dynamic Bayesian networks. A dynamic BN describes stochastic evolution of a set of random variables over discretized time. Therefore conditional distributions refer to random variables in neighboring time points. The acyclicity constraint is relaxed, because the ”unrolled” graph (with a copy of each variable in each time point) is always acyclic (see [3] for more details). The following considerations apply to dynamic BNs as well.
 In case of static Bayesian Networks, the user has to supply the algorithm with a partial ordering of the vertices, restricting the set of possible edges only to the ones consistent with the ordering. BNFinder lets the user to divide the set of variables into an ordered set of disjoint subsets of variables, where edges can only exist between variables from different subsets and they have to be consistent with the ordering. If such ordering is not known beforehand, one can try to run BNFinder with different orderings and choose a network with the best overall score.
Assumption 1
S(G:D) = Σ_{i=1}^{n} s(X_{i},Pa_{i}:D_{{Xi}∪Pai}), where D_{Y} denotes the restriction of D to the values of the members of Y⊆X.
When there is no need to examine the acyclicity of the graph, this assumption allows to compute the parents set of each variable independently. Thus the point is to find Pa_{i} minimizing s(X_{i},Pa_{i}:D_{{Xi}∪Pai}) for each i.
Let us fix a dataset D and a random variable X. We denote by X' the set of potential parents of X (possibly smaller than X due to given constraints on the structure of the network). To simplify the notation we continue to write s(Pa) for s(X,Pa:D_{{X}∪Pa}).
The following assumption expresses the fact that scoring functions decompose into 2 components: g penalizing the complexity of a network and d evaluating the possibility of explaining data by a network.
This assumption is used in the following algorithm to avoid considering networks with inadequately large component g.
Algorithm 1

In the above algorithm choosing according to g(P) means choosing increasingly with respect to the value of the component g of the local score.
Theorem 1
Suppose that the scoring function satisfies Assumptions 12. Then Algorithm 1 applied to each random variable finds an optimal network.
A disadvantage of the above algorithm is that finding a proper subset P⊆X' involves computing g(P') for all ⊆successors P' of previously chosen subsets. It may be avoided when a further assumption is imposed.
The above assumption suggests the notation g(Pa)=g(Pa). The following algorithm uses the uniformity of g to reduce the number of computations of the component g.
Algorithm 2

Theorem 2
Suppose that the scoring function satisfies Assumptions 13. Then Algorithm 2 applied to each random variable finds an optimal network.
2 Minimal Description Length
The Minimal Description Length (MDL) scoring criterion originates from information theory [5]. A network N is viewed here as a model of compression of a dataset D. The optimal model minimizes the total length of the description, i.e. the sum of the description length of the model and of the compressed data. MDL is effectively equivalent to Bayesian Information Criterion (BIC) (see [6]), which approximates Bayesian scores (see next section) and is also applicable to continuous data.Let us fix a dataset D={x_{1},…,x_{N}} and a random variable X. Recall the decomposition s(Pa)=g(Pa)+d(Pa) of the local score for X. In the MDL score g(Pa) stands for the length of the description of the local part of the network (i.e. the edges ingoing to X and the conditional distribution P(XPa)) and d(Pa) is the length of the compressed version of Xvalues in D.
Let k_{Y} denote the cardinality of the set V_{Y} of possible values of the random variable Y∈X. Thus we have
g(Pa)=Palogn+ 

(k_{X}−1) 

k_{Y} 
However, for many applications we may assume that all the random variables attain values from the same set V of cardinality k. In this case we obtain the formula
g(Pa)=Palogn+ 

(k−1)k^{Pa} 
Compression with respect to the network model is understood as follows: when encoding the Xvalues, the values of Painstances are assumed to be known. Thus the optimal encoding length is given by
d(Pa)=N⋅ H(XPa)
where H(XPa)=−Σ_{v∈V}Σ_{v∈VPa}P(v,v)logP(vv) is the conditional entropy of X given Pa (the distributions are estimated from D).Since all the assumptions from the previous section are satisfied, Algorithm 2 may be applied to learn the optimal network. Let us turn to the analysis of its complexity.
Theorem 3
The worstcase time complexity of Algorithm 2 for the MDL score is O(n^{logk N}Nlog_{k} N).
3 BayesianDirichlet equivalence
The BayesianDirichlet equivalence (BDe) scoring criterion originates from Bayesian statistics [1]. Given a dataset D the optimal network structure G maximizes the posterior conditional probability P(GD). We haveP(GD)∝ P(G)P(DG)=P(G  )  ∫P(DG,θ)P(θG)dθ 
Heckerman et al. [4], following Cooper and Herskovits [1], identified a set of independence assumptions making possible decomposition of the integral in the above formula into a product over X. Under this condition, together with a similar one regarding decomposition of P(G), the scoring criterion
S(G:D)=−logP(G)−logP(DG)
obtained by taking −log of the above term satisfies Assumption 1.
Moreover, the decomposition s(Pa)=g(Pa)+d(Pa) of the local scores appears as well, with the components g and d derived from −logP(G) and −logP(DG), respectively.The distribution P((X,E))∝α^{E} with a penalty parameter 0<α<1 in general is used as a prior over the network structures. This choice results in the function
g(Pa)=Palogα^{−1}
satisfying Assumptions 2 and 3.However, it should be noticed that there are also used priors which satisfy neither Assumption 2 nor 3, e.g. P(G)∝α^{Δ(G,G0)}, where Δ(G,G_{0}) is the cardinality of the symmetric difference between the sets of edges in G and in the prior network G_{0}.
The Dirichlet distribution is generally used as a prior over the conditional distributions' parameters. It yields
d(Pa)=log  ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ 




⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ 
Setting all the hyperparameters to 1 yields
d(Pa)=log  ⎛ ⎜ ⎜ ⎝ 




⎞ ⎟ ⎟ ⎠ 
= 
= 

⎛ ⎜ ⎜ ⎝ 
log( k−1+v∈VΣN_{v,v})!−log  (k−1)! − 

logN_{v,v}!  ⎞ ⎟ ⎟ ⎠ 
The following result allows to refine the decomposition of the local score into the sum of the components g and d.
Proposition 1
Define d_{min}=Σ_{v∈V}(log(k−1+N_{v})!−log(k−1)!−logN_{v}!), where N_{v} denotes the number of samples in D with X=v.
Then d(Pa)≥ d_{min} for each Pa∈X.
By the above proposition, the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) with the components g'(Pa)=g(Pa)+d_{min} and d'(Pa)=d(Pa)−d_{min} satisfies all the assumptions required by Algorithm 2. Let us turn to the analysis of its complexity.
Theorem 4
The worstcase time complexity of Algorithm 2 for the BDe score with the decomposition of the local score given by s(Pa)=g'(Pa)+d'(Pa) is O(n^{Nlogα−1k}N^{2}log_{α−1}k).
4 Continuous variables
Both MDL and BDe scores were originally designed for discrete variables. In order to avoid arbitrary discretization of continuous data we adapted them to deal with continuous variables directly. In case of MDL our approach is essentially congruent to BIC, but because we choose specific mixture of Gaussian distributions, the method is also applicable to BDe. Moreover, it can be applied to heterogenous data sets joining together discrete and continuous variables.The distribution of each continuous variable X is assumed to be a mixture of two normal distributions. The mixture components are considered to be the two possible values (low and high) of a related hidden discrete variable X' and X is viewed as its observable reflection. Therefore regulatory relationships are learned for discrete variables rather than continuous ones.
This approach yields the following form of conditional distributions on continuous variables:
P(XPa  )= 


P(XX'=v)P(X'=vPa'=v)P(Pa'=vPa)= 
= 

⎛ ⎜ ⎜ ⎝ 

P(X'=vPa'=v)P(Pa'=vPa)  ⎞ ⎟ ⎟ ⎠ 
P(XX'=v) 
In a preprocessing step parameters of P(XX') are estimated separately for each variable X. Estimation is based on data clustering with the kmeans algorithm (k=2, cutting the set of variable values in the median yields initial clusters). The parameters enable us to calculate P(XX') as well as P(X'X), and consequently P(Pa'Pa). Thus the space of possible conditional distributions on continuous variables forms a family of Gaussian mixtures, parameterized by P(X'=vPa'=v), the only free parameters in P(XPa) and at the same time the parameters of conditional distributions on corresponding discrete variables.
From a technical point of view, BNFinder learns optimal network structures for these discrete variables, using scoring functions adapted to handle distributions on variable values instead of their determined values (expected values of original scores are computed). For continuous variables it gives optimal Bayesian networks from among all networks with conditional probability distributions belonging to the family of Gaussian mixtures defined above.
References
 [1]

Gregory F. Cooper and Edward Herskovits.
A Bayesian method for the induction of probabilistic networks from
data.
Machine Learning, 9:309–347, 1992.
 [2]

Norbert Dojer.
Learning Bayesian Networks Does Not Have to Be NPHard.
In Rastislav Královic and PaweUrzyczyn, editors, Proceedings of Mathematical Foundations of Computer Science 2006, pages
305–314. SpringerVerlag, 2006.
LNCS 4162.
 [3]

N Friedman, K Murphy, and S Russell.
Learning the structure of dynamic probabilistic networks.
In G F Cooper and S Moral, editors, Proceedings of the
Fourteenth Conference on Uncertainty in Artificial Inteligence, pages
139–147, 1998.
 [4]

David Heckerman, Dan Geiger, and David Maxwell Chickering.
Learning Bayesian networks: The combination of knowledge and
statistical data.
Machine Learning, 20(3):197–243, Sep. 1995.
 [5]

W. Lam and F. Bacchus.
Learning Bayesian belief networks: An approach based on the MDL
principle.
Computational Intelligence, 10(3):269–293, 1994.
 [6]
 Richard E. Neapolitan. Learning Bayesian Networks. Prentice Hall, 2003.