Back to main page

Relate

Relate API is used to reveal statistical relations within the data

Description

The relate API provides statistical information regarding the relationship between a pair of features (columns in tables). It can be used as an analyzation tool, for example to find causation and correlation.

API end point

/api/v1/_relate

Format:

    {
      "from" : From, 
      "where" : null | Proposition, 
      "relate" : PropositionSet, 
      "orderBy" : null | RelateOrder, 
      "select" : null | Selection, 
      "offset" : null | long, 
      "limit" : null | long
    }

See also:

Example Query

The following query relates the click is true event with the product's title-field features.

POST /api/v1/_relate

    {
       "from" : "impressions",
       "where": {
         "click" : true
       },
       "relate":"product.title",
       "select": ["related", "lift", "condition", "fs"],
       "limit" : 3
    }

Result

    {
      "offset" : 0,
      "total" : 9,
      "hits" : [ {
        "related" : "product.title:acm",
        "lift" : 0.3867066188035049,
        "condition" : "click:true",
        "fs" : {
          "f" : 56,
          "fOnCondition" : 0,
          "fOnNotCondition" : 56,
          "fCondition" : 29,
          "n" : 320
        }
      }, {
        "related" : "product.title:gener",
        "lift" : 0.3867066188035049,
        "condition" : "click:true",
        "fs" : {
          "f" : 56,
          "fOnCondition" : 0,
          "fOnNotCondition" : 56,
          "fCondition" : 29,
          "n" : 320
        }
      }, {
        "related" : "product.title:product",
        "lift" : 0.3867066188035049,
        "condition" : "click:true",
        "fs" : {
          "f" : 56,
          "fOnCondition" : 0,
          "fOnNotCondition" : 56,
          "fCondition" : 29,
          "n" : 320
        }
      } ]
    }

Example Query with fields to fields examination

The following query highlights that what kind of customer click what kinds of products.

POST /api/v1/_relate

    {
       "from" : "impressions",
       "where": {
         "$on" : [
            {"$exists" : ["query", "customer.tags"] },
            {"click":true}
         ]
       },
       "relate":["product.title", "product.tags"],
       "select": ["related", "lift", "condition"],
       "limit" : 5
    }

Result

    {
      "offset" : 0,
      "total" : 576,
      "hits" : [ {
        "related" : "product.tags:laptop",
        "lift" : 2.1500247979453686,
        "condition" : "on(query:laptop, click:true)"
      }, {
        "related" : "product.tags:afford",
        "lift" : 4.568762985853046,
        "condition" : "on(query:cheap, click:true)"
      }, {
        "related" : "product.tags:cover",
        "lift" : 2.960113802323945,
        "condition" : "on(query:cover, click:true)"
      }, {
        "related" : "product.tags:afford",
        "lift" : 3.5630394952311293,
        "condition" : "on(customer.tags:20s, click:true)"
      }, {
        "related" : "product.tags:laptop",
        "lift" : 0.34682686446250044,
        "condition" : "on(query:phone, click:true)"
      } ]
    }

Example Full Response

The following example doesn't have a select clause in order to reveal all available fields.

POST /api/v1/_relate

    {
       "from" : "impressions",
       "where": {
         "click" : true
       },
       "relate":"product.title",
       "limit" : 1
    }

Result

The above query would return result a as a list of statistical information in the following format. As an addition to naming the related proposition, the condition, and the related proposition probability lift that is P(related|condition)/P(related), there are 4 additional groups of statistical information:

  1. The 'fs' group contains information about frequencies:
    • 'f' is the frequency of the related proposition
    • 'fOnCondition' is the frequency of the related proposition, when the condition proposition is true
    • 'fOnNotCondition' is the frequency of the related proposition, when the condition proposition is not true
    • 'fCondition' is the frequency of the condition proposition
    • 'n' is the number of samples in the relation.
  2. The 'ps' group contains probability estimates for the two variables
    • 'p' is the probability of the related proposition
    • 'pOnCondition' is the probability of the related proposition, when the condition proposition is true
    • 'pOnNotCondition' is the probability of the related proposition, when the condition proposition is not true
    • 'pCondition' is the probability of the condition proposition
  3. The 'info' group contains information theoretic metrics for the relations
    • 'h' is the information entropy of the related proposition
    • 'mi' is the mutual information between the related proposition and condition proposition. Mutual information describes the strength of the statistical dependency between the related proposition and condition proposition.
    • 'miTrue' is the 'positive' component of the mutual information of form -p(X&C) log p(X&C) / p(X)p(C), where X is the related proposition and C is the condition proposition. It describes both a) that strength of the positive statistical relationship between the propositions, and b) that how frequent this pattern is. miTrue can be used in anomaly detection to find the strongest and most common positive drives for some outcome, for example: why people click?.
    • 'miFalse' is the 'negative' component of the mutual information of form -p(X&!C) log p(X&!C) / p(X)p(!C). In principle, this component can be used to detect negative drives for an outcome, for example: why people don't click. NOTE: Aito is not great at detecting negative relations. Prefer using 'miTrue' by e.g. negating condition either by using {'click':false} instead of {'click':true} or using 'miTrue' with the not statement: {"$not":{"query":"laptop"}}
  4. The 'relation' group contains the raw statistics of the relation:
    • 'n' describes, that how many valid samples there are, when both related proposition and condition propositions are defined.
    • 'varFs' has the proposition frequencies. The first value is for condition proposition and second value is for the related proposition
    • 'stateFs' are the relation state frequencies in the following form for condition C and related X:
      • [f(!C & !X), f(C & !X), f(!C & X), f(C & X)]
    • 'mi' is the relation's mutual information I(C;X). In theory: this value should be equal to the variable mutual information I(X|C) = H(X) - H(X|C), but in practice it can differ slightly, because it is approximated somewhat differently in implementation.

The results are:

    {
      "offset" : 0,
      "total" : 9,
      "hits" : [ {
        "related" : "product.title:acm",
        "lift" : 0.3867066188035049,
        "condition" : "click:true",
        "fs" : {
          "f" : 56,
          "fOnCondition" : 0,
          "fOnNotCondition" : 56,
          "fCondition" : 29,
          "n" : 320
        },
        "ps" : {
          "p" : 0.17482176004277328,
          "pOnCondition" : 0.06760473171941853,
          "pOnNotCondition" : 0.1857536976992485,
          "pCondition" : 0.09159164615609011
        },
        "info" : {
          "h" : 0.6686169465118937,
          "mi" : 0.07165648669883304,
          "miTrue" : -0.09266503759801603,
          "miFalse" : 0.16432152429684907
        },
        "relation" : {
          "n" : 320,
          "varFs" : [ 29, 56 ],
          "stateFs" : [ 235, 29, 56, 0 ],
          "mi" : 0.007118664006911302
        }
      } ]
    }

Example Anomality Detection

In following example, the aim is to find product features young people prefer. In the example, the results are orderd by a mutual information component of form -p(X&C) log p(X&C)/(p(X)p(C)). This will reveal patterns that are both

  1. common 2) and strong as p(X&C)/p(X)p(C) is essentially the probability lift

POST /api/v1/_relate

    {
       "from" : "impressions",
       "where": {
         "$on": [
           {"customer.tags" : "20s"},
           {"click":true}
         ]
       },
       "relate":"product",
       "orderBy": "info.miTrue",
       "select": ["related", "lift", "condition", "fs", "info.miTrue"],
       "limit" : 3
    }

Result

In this example dataset: young people prefer affordable products as seen in results

    {
      "offset" : 0,
      "total" : 118,
      "hits" : [ {
        "related" : "product.tags:afford",
        "lift" : 3.5630394952311293,
        "condition" : "on(customer.tags:20s, click:true)",
        "fs" : {
          "f" : 4,
          "fOnCondition" : 3,
          "fOnNotCondition" : 1,
          "fCondition" : 4,
          "n" : 29
        },
        "info.miTrue" : 0.9332714310396236
      }, {
        "related" : "product.description:afford",
        "lift" : 3.5414575235569234,
        "condition" : "on(customer.tags:20s, click:true)",
        "fs" : {
          "f" : 4,
          "fOnCondition" : 3,
          "fOnNotCondition" : 1,
          "fCondition" : 4,
          "n" : 29
        },
        "info.miTrue" : 0.9308833210143048
      }, {
        "related" : "product:2",
        "lift" : 3.1647547496563564,
        "condition" : "on(customer.tags:20s, click:true)",
        "fs" : {
          "f" : 3,
          "fOnCondition" : 2,
          "fOnNotCondition" : 1,
          "fCondition" : 4,
          "n" : 29
        },
        "info.miTrue" : 0.5399737420457876
      } ]
    }