Select Cheap & Good at the same time- A Probabilistic Approach

Niteshsukhwani
TheCyPhy
Published in
5 min readJun 30, 2020

--

Have you ever been to beaches if no you must go and if yes then stay will be the first thing that came to your mind? We all want a stay which is cheap and near to the beach so we can spend more on time with beer and beach. Both goals of ours are complementary to each other as the hotel near to the beach won’t be cheap and the one which is cheap won’t be near to the beach. So we need to select that hotel from where we can easily reach the beach as well which doesn’t make a big hole in our pocket.

In this article, we will discuss an interesting problem of data science known as the skyline probability with uncertain preferences. In continuation with the above example all such hotels which are not worse then any of the other hotels in both the dimensions (price & distance from the beach) will be considered as the skyline object.

Each point the graph denotes a hotel and x-axis is denotes the price and the y-axis is its distance from the beach so we can see that all the points which are connected by the line are dominating all other points.

A skyline object is that instance in the dataset which is not dominated by any other instance of the dataset and an instance is said to be dominated by any other instance if it is better in at least one of the dimension and equivalent or not worse in any other dimension. It seems to be an easy problem but here we are sure about the nature of the attributes i.e attributes are certain. In skyline computation, we deal with two types of attributes which are.

  • Certain Attributes: Attributes of data that are certain for example in the above case rent and distance both are certain as any rational person will prefer less rent and less distance from the beach.
  • Uncertain Attributes: those attributes in which everyone will have their preference of choice will be considered as uncertain attributes. for example in a movie recommendation scenario, one may prefer comedy movies to action movies while the preference of others may differ. Individual preference may also change from context to context, for example, someone may prefer cartoons when she is with children and may prefer drama when she is with adults. Compilation of all these varying preferences would lead to uncertain preferences, i.e., one value is preferred over the other with a certain probability

In case of uncertain attribute preferences, we can’t say any object is skyline or not but we can compute the skyline probability of each of the instances. so before finding the skyline probability of we make some assumptions which are

  1. All the attributes are independent of each other
  2. Consider an attribute col1 with two distinct values x & y.

Let sky(O) denotes the skyline probability of an instance of the data.

for a data Θ = {Q1, Q2, … , Qn} data set of n, d-dimensional objects, and O be another d-dimensional object such that O does not belong to Θ. For any instance Q(i) in Θ, Q(i, j) represents the jth attribute of the Q(i). instance. Set S be the power set of (2^θ), distinct(S, j), 1 ≤ j ≤ d, represents the set of distinct attribute values for the jth attributes for all the objects in S. We have used the notation distinct(O; S, j) to denote the distinct attribute value in the jth attribute for all the objects in S excluding Oj.

distinct(O; S) = ∪j distinct(O; S, j). For any two distinct value pair (α, β) of an attribute, we use α ⪯ β to denote that α is preferred over β. The probability that an object O is skyline object is equivalent to the probability that the object O is not being dominated by any other object Qi ∈ Θ and is defined as follows.

Let us understand it more clearly with an example

Table: A 2d dataset example

for the sake of simplicity, we assume that

Pr(A1 ≺ A2) = Pr(A2 ≺ A1) = Pr(A1 ≺ A3) = Pr(A3 ≺ A1) = 0.5

Pr(B1 ≺ B2) = Pr(B2 ≺ B1) = Pr(B1 ≺ B3) = Pr(B3 ≺ B1) = 0.5

So now we have to find P(Union(e)) which is defined in the above formula for data with three instance i.e.

Here P(ei) is the prob that Qi dominates O.

It can be seen that the formulation given is exponential as the second term requires to find the union of the events, and the cost requires for the computation of skyline probability is O(d.2^n ).

Conclusion

The skyline probability can have various applications in the area of recommender system and opinion poll mining, but because of its computational limitation we haven’t seen any practical application which is using it currently but the topic is well studied in research. To improve the performance of this algorithm Pujari et al. proposed an algorithm named Usky-base which reduces the complexity until 99 percent. I would suggest reading the paper mentioned in the reference for a better understanding of the problem and its other optimized solutions.

Reference

  • Arun K Pujari, Venkateswara Rao Kagita, Anubhuti Garg, and Vineet Padmanabhan. 2015. Efficient computation for probabilistic skyline
  • Qing Zhang, Pengjie Ye, Xuemin Lin, and Ying Zhang. 2013. Skyline probability over uncertain preferences. In Proceedings of the 16th International Conference on Extending Database Technology. ACM, 395–405.
  • Nitesh Sukhwani, Venkateswara Rao Kagita, Vikas Kumar. Efficient computation of top-k skyline objects in data set with uncertain preferences(ICMLDS 2020)

--

--