Different parameters involved in Bloom Classifier

  1. Classifier
    1. Logistic regression
      1. Equal weight to both classes
    2. Decision Tree
      1. max. depth of the tree
      2. Equal weight to both classes
  2. No. of elements used for training the classifiers
  3. Size of the bit array
  4. Method of Indexing from the probability given by the classifier
  5. Dimension of the underlying space

Output parameters

  1. Size of the bit array + Size of the classifier
  2. False positive rate/No. of false positive elements

Experimentation

We have done the following experimentation by varying different parameters:

  1. Varying the no. of elements used to train the classifiers, with same Size of the bit array (No. of elements used to train = No. of elements inserted)
  2. Different dimensional space : 1D and 2D
  3. Varying the size of the bit array with constant no. of elements used to train the classifiers
  4. Different Indexing strategies: I1 = floor(p*S) , I2 = floor((p*C)mod S) , p = prob. of element given by classifier, S = Size of bitarray, C = large constant no.

Some results obtained

Avg. Probability distribution for Logistic Regression : 1D

No. of elements used to train =400, No. of iterations = 150

LR

Avg. Probability distribution for Decision Tree : 1D

No. of elements,m =400, No. of iterations = 150, max depth of decision tree = 10

DT

Observations

  1. Probabilities of elements either present in the set or not are very close in logistic regression(i.e, around 0.5) because of which only 2 and 3 indices are set in bitarray,but for false positive elements also it gives same indices. This results in high false positive rate. So, though the size of the classifier + size of the bit array can be very small but false positive rate is very high.
  2. In decision tree, as for all elements from 0-1000 and 2000-3000 since probability is zero , there is no false positive in 0-1000 and 2000-3000

Results with new indexing scheme for logistic regression

New Index = floor( (p*C)mod S ) Old Index = floor( P*S ) , p = prob. of element, C= large constant no. (here = 100000), S = size of bitarray

No. of false positive elements

New Indexing

DT

Old Indexing

DT

False positive rate

New Indexing

DT

Old Indexing

DT

No. of elements/size of bit array

New Indexing

DT

Old Indexing

DT

False positive rate v/s size of bit array

New Indexing

DT

Old Indexing

DT

False positive rate v/s size of bit array

New Indexing

DT

Old Indexing

DT

False positive rate v/s size of bit array

New Indexing

DT

New Indexing

DT