Class CaseStatisticsAnalyzer


  • public class CaseStatisticsAnalyzer
    extends java.lang.Object
    Provides statistical analysis for outliers using non-parametric density estimations with foundations in:

    Dit-Yan Yeung and C. Chow. Parzen-Window Network Intrusion Detectors. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 4, pages 385–388 vol.4, 2002.

    Extensions are made to exploit dependencies in business processes.

    Author:
    Andreas Rogge-Solti
    • Constructor Detail

      • CaseStatisticsAnalyzer

        public CaseStatisticsAnalyzer()
      • CaseStatisticsAnalyzer

        public CaseStatisticsAnalyzer​(StochasticNet stochasticNet,
                                      org.processmining.models.semantics.petrinet.Marking initialMarking,
                                      CaseStatisticsList statistics)
    • Method Detail

      • getOutlierRate

        public double getOutlierRate()
      • setOutlierRate

        public void setOutlierRate​(double outlierRate)
      • setCaseStatistics

        public void setCaseStatistics​(CaseStatisticsList caseStatistics)
      • getMaxActivityCount

        public int getMaxActivityCount()
      • getModelDensities

        public double[] getModelDensities​(ReplayStep x,
                                          org.apache.commons.math3.distribution.RealDistribution assumedErrorDistribution,
                                          double assumedErrorRate)
        Returns the likelihood ratio of the ReplayStep x stemming from an error distribution, or from the original distribution. Assume that there is but one child (could use weighted average of scores for multiple children).

        Let's assume an error distribution that can shift the duration of this step and also affect the duration of the next step. We compare the joint probability of the two durations x and y (y is the activity that follows x) in the original model that we learned from historical observations with the distribution that results when we add an error along the y=-x line. Latter is correct because, if x is a measurement error, it also affects the duration of the child in a conversely. For example, when the end of x is mistakenly measured later, than the duration of y is also affected (it is shorter than expected).

        Parameters:
        x - ReplayStep to compute the error score for
        assumedErrorDistribution - the RealDistribution that is assumed as noise in the data for measurement errors
        assumedErrorRate - the rate of error occurrence (must be between 0 inclusive and 1 exclusive)
        Returns:
        densities of the models:
        • index 0 contains density of p(x,y) original model,
        • index 1 contains density of p(x,y) error-model
        • index 2 contains the weighted ratio for x,y according to the assumed error rate
        • index 3 contains density of original p(x)
        • index 4 contains density of error-model for p(x)
        • index 5 contains the weighted ratio according for x to the assumed error rate
        • index 6 contains density of original p(y)
        • index 7 contains density of error-model for p(y)
        • index 8 contains the weighted ratio according for y to the assumed error rate
        • isOutlierLikelyToBeAnError

          public boolean isOutlierLikelyToBeAnError​(ReplayStep step)
          Let X be this node's random duration variable having the value x. Assume that step is an outlier by itself (i.e., p(X = x) very low compared to the usual values) Let parents be a function assigning the parents to a random duration.

          We compare the probability of P(children | X) with the marginal probability of P(children | parents(X) ). If we see that the marginal probability is higher than the one given X=x, we assume that it is a single (measurement) error in the log. In the other case, we assume that X fits with the following events and is just a regular outlier.

          Parameters:
          step - ReplayStep

          Example:

          U V <- parents (if there are more than one, it was a parallel split) \ / X <- variable / \ Y Z <- children (if there are more than one, the process forked into multiple parallel branches)

          here, we compute P(Y=y,Z=z | X=x) and compare it with integral over X of P(Y=y, Z=z, X | U=u, V=v) That is, we compare u v \ / x with X <- and integrate over all the values of X / \ / \ y z y z

          Returns:
          boolean indicating, whether the outlier is likely to be an error.
        • getPValueOfStepIntegral

          public double getPValueOfStepIntegral​(ReplayStep step)
        • computePValueByApproximateIntegration

          public double computePValueByApproximateIntegration​(ReplayStep step)
        • computePValueByApproximateIntegration

          public double computePValueByApproximateIntegration​(org.apache.commons.math3.distribution.RealDistribution dist,
                                                              double x)
        • getIndividualOutlierSteps

          public java.util.List<ReplayStep> getIndividualOutlierSteps​(CaseStatistics selectedCaseStatistics)
        • getInitialMarking

          public org.processmining.models.semantics.petrinet.Marking getInitialMarking()
        • getLogLikelihoodCutoff

          public java.lang.Double getLogLikelihoodCutoff​(TimedTransition tt)
        • getLogLikelihoodDistribution

          public org.apache.commons.math3.distribution.RealDistribution getLogLikelihoodDistribution​(TimedTransition transition)
        • updateStatistics

          public void updateStatistics​(double outlierRate)
        • updateLikelihoodCutoffs

          public void updateLikelihoodCutoffs()