Problem Description and Dataset :
https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions

import pandas as pd
import numpy as np

df_f = pd.read_csv("/home/gogol/mypy/competition/hapt/Train/X_train.txt", sep = "\s+", header = None)
df_y = pd.read_csv("/home/gogol/mypy/competition/hapt/Train/y_train.txt", sep = "\s+", header = None)
c = pd.read_csv("/home/gogol/mypy/competition/hapt/features.txt", header = None)
num_rows = df_f.shape[0]
num_columns = df_f.shape[1]

c = list(c[0])
df_f.columns = c
df_f.shape

(7767, 561)

This dataset has 561 columns, hence dimensionality reduction is necessary. The problem is a multiclassification problem. If we use tree or ensemble based models then it might perform poorly as the data density is diluted because the points are sparsed into high dimentional space. Adding to that, model requires high time, space and harwware resources.
Before assuming the distribution of data, let us use Singular Value Decomposition on the dataset.
Properties of Singular Value decomposition:

It decomposes a $[m/n]$ matrix into three matrices ( $U$, $V$, $S$) which have their own properties, based on two equations:
(a) transpose($A$) . $A$ = $V$ . transpose($S$) . $S$ . transpose($V$)
(b) $C$ . $V$ = $U$ . $S$
Where, $A$ = i/p DataSet. $[m/n]$
$U$ = Left Singular Matrix, i.e. row-to-concept value matrix, it is an Orthogonal Matrix. [m/r]
$S$ = Sigular values for each column in the DataSet, it is a Diagonal Matrix with non-negatives. [r/r]
$V$ = Right Singular Vectors. i.e. target-to-concept matrix, it is an Orthogonal Matrix. [r/n]

$U$ denotes the row to latent target values for each row. $S$ denotes the strength of each features against latent target values. $V$ denotes column to latent target values.

$U$ and $V$ are unique for every DataSet, they are orthonormal,
hence, transpose($U$) = inverse($U$)
and, tranpose($U$) . $U$ = $I$
Their coulmns are othogonal unit vectors with Eucledean distance = 1.

from numpy.linalg import svd

U, S, V = svd(df_f)

$S$ contains the so called singular values of the features sorted in descending order:

x = list()
threshold = 10.0 # Change threshold for sigular values here#

for i in range(len(S)):
    if S[i]> threshold:
        x.append(S[i])
        

print S.shape
print 'maximum singular value in S = ', S[0]
print 'minimum singular value in S = ', S[560]
print 'number of columns whos corrsponding singular value is greater than threshold, are: ',len(x)
print 'singular value of last column = ', S[len(x)-1]

(561,)
maximum singular value in S =  1434.01012035
minimum singular value in S =  1.33849363033e-13
number of columns whos corrsponding singular value is greater than threshold, are:  141
singular value of last column =  10.0389026461

Here, we can choose only those columns whos sigular values are greater than the threshold. This will reduce dimensionality of the DataSet to certain extend.

df_curated_svd = df_f.iloc[:,:len(x)]

df_curated_svd.shape

(7767, 141)

Now, we check whether the Datapoints are normally distributed. If yes, then it makes our life a lot easier. A lot of assumptions on the data can be made.

We are randomly sampling one column out of all the columns and using Shapiro-Wilk's test for normalcy along with scipy's normaltest. And then, we find out the percentage of coulmns which are normally distributed.
--> How Shapiro-Wilk's works: -
A Q-Q plot is made. Y-axis contains: sorted datapoints from the data-array. X-axis contains: corresponding ideal values if the dataset is normally distributed with same mean and standard deviation. If the plot is a striaght line through origin then the dataset is normally distributed.
Shapiro-Wilk's test for Normalcy is a formal test with null-hypothesis H: one sample frawn from nomally distributed dataset.
type of data: univariate, continuous.
now, statistic W is: Screenshot%20from%202018-03-10%2019-08-51.png

Screenshot%20from%202018-03-10%2019-08-51.png

numerator = slope of the data point versus ideal data points, squared , is an estimte of variance.
denominator = normalized and squared ideal data points with same mean and standard deviations of the dataset, is also an estimte of variance.
So, if H is true, W$\approx$1 .
if, W < 1, Then H is rejected, datapoint might be significantly different from normal. in that case, we consider p-values for further analysis.

import scipy.stats as sc
from random import randint

feature_normalcy_test = randint(0, num_columns+1)

shapiro_results = sc.shapiro(df_f[df_f.columns[feature_normalcy_test]])
normal_test_results = scipy.stats.mstats.normaltest(df_f[df_f.columns[feature_normalcy_test]])

print 'for column ', df_f.columns[feature_normalcy_test], 'test statistics are listed bellow : '

print 'shapiro_results : ', shapiro_results
print normal_test_results


pos = 0
neg = 0

for i in range(num_columns):
    swr = sc.shapiro(df_f[df_f.columns[i]])
    #normal_test_results = scipy.stats.mstats.normaltest(df_f[df_f.columns[feature_normalcy_test]])
    swr = swr[0]
    if swr > 0.7 and swr < 1.01:
        pos = pos+1
    else:
        neg = neg+1
        
print (float(pos)/561.0)*100, 'percent of columns are normally distributed'
print (float(neg)/561.0)*100, 'percent of columns are NOT normally distributed'

# I'm using not-so-strict threshold for normalcy test.
# because for Principal Component analysis, Dataset being nomally distributed is not a very critical assumption.

for column  tBodyAccJerk-ropy-1             test statistics are listed bellow : 
shapiro_results :  (0.8336191177368164, 0.0)
NormaltestResult(statistic=masked_array(data = [-- -- --],
             mask = [ True  True  True],
       fill_value = 1e+20)
, pvalue=array([  1.62073221e-02,   6.03077044e-01,   2.17781980e-06]))
70.7664884135 percent of columns are normally distributed
29.2335115865 percent of columns are NOT normally distributed

So, we can consider df_f to be normally distributed.
Now, we can apply Principal Component analysis on it in order to reduce the dimensionality.
--> How PCA works: -
PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components. It finds a coordinate system (an eigenvector basis) that maximizes the variance explained in the data.
It uses the theme of rotating a vector by multiplying it with a transformation matrix. and ultimately, finding it's Eigen Vector(s).
For an Eigen Vector v and tranformation matrix A the following holds true, Screenshot%20from%202018-03-10%2020-15-29.png

Screenshot%20from%202018-03-10%2020-15-29.png

Where, $\lambda$ is the corresponding Eigen Vector, a scalar value.
In other words, for every linear transformation $T(v)$, there exists Eigen Vector $v$ which even upon multiplying by transformation matrix $A$ doesnot change it's direction, rather only get's scaled by the factor $\lambda$.
Steps:
(1) Centerize the Data : $x$ - mean($x$)
(2) Compute the covariance matrix : Screenshot%20from%202018-03-10%2020-52-57.png

Screenshot%20from%202018-03-10%2020-52-57.png

Covariance matrix determines following factors:
(a) do features x1 and x2 increase together? (b) does x2 decrease while x1 increases or vice versa?
(3) We are tranforming a Vector towards the space of maximum variance or data. Iteratively we multiply a column vector with covariance matrix to slowly turn it towards the spread of data. With every instances the turn gets smalled and smaller and finally it converges to a dimension of greatest variance and it stops turning. Each time we take a transformation now it only gets sclaed. This is how we find the Eigen Vector and Eigen Values.
(4) Principal Component is the Eigen Vector with highest Eigen Value.

Hence, we check it's corresponding Eigen Values and Eigen Vectors of the covarience matrix for primary analysis.
All values are standardized. (datum-mean/std)

from scipy.linalg import eig, eigvals

E = eig(df_f.cov())
EV = E[0] # Keeping only the Eigen Values#
EV = list(EV)

print 'Higest Eigen Value = ', EV[0].real
print 'lowest Eigen Value = ', EV[-1].real

Higest Eigen Value =  30.5757314247
lowest Eigen Value =  -5.3380618106e-35

In order to choose the number of components for PCA, we take the cumulative sum of all Eigen Values :

ev_sum = 0
ev_rel = list()
ev_rel_percentage = list()

for i in EV:
    ev_sum = ev_sum +i

for i in range(len(EV)):
    ev_rel.append(EV[i].real/ev_sum.real)
    ev_rel_percentage.append(ev_rel[i]*100)

su = 0.0

for i in range(len(ev_rel_percentage)):
    su = su + ev_rel_percentage[i]
    print su, i
    if su > 96.0:
        break

58.6355934893 0
65.0612056138 1
69.5502222103 2
71.6671273961 3
73.3035104626 4
74.667014295 5
75.9186778527 6
77.0775694525 7
78.0760752451 8
78.9868097623 9
79.7958394423 10
80.5362308435 11
81.2433152187 12
81.8309607216 13
82.3659311503 14
82.8782681667 15
83.3654080568 16
83.842075207 17
84.2852042085 18
84.7132310533 19
85.1310466017 20
85.5402070863 21
85.9386113206 22
86.3117019327 23
86.6632261131 24
86.9969099343 25
87.3057399278 26
87.6043009791 27
87.8961874898 28
88.1847872186 29
88.4660550591 30
88.7359477819 31
88.9963510434 32
89.2498938892 33
89.4996013277 34
89.7441516246 35
89.9830506618 36
90.2011860667 37
90.4241869485 38
90.6496735705 39
90.857276834 40
91.0573435596 41
91.250437431 42
91.4419588527 43
91.6265615819 44
91.8053476323 45
91.9820824456 46
92.149451572 47
92.312989839 48
92.4724720728 49
92.6255492611 50
92.7792196229 51
92.9287412851 52
93.0762835571 53
93.2176175015 54
93.3505781741 55
93.4842895279 56
93.610713695 57
93.7341434701 58
93.8553385059 59
93.9750861148 60
94.0901345677 61
94.1991200839 62
94.3118487104 63
94.4236457535 64
94.52997753 65
94.6323874286 66
94.7346074644 67
94.8325734848 68
94.9281659925 69
95.0234602771 70
95.1148661818 71
95.2036041113 72
95.289685031 73
95.3741022245 74
95.457413066 75
95.5398614181 76
95.6226511539 77
95.701835643 78
95.7780734034 79
95.8538622134 80
95.9287105925 81
96.0026425984 82

We can see that $95$% of the variance can be determined by 80 components.

from sklearn.decomposition import PCA

my_pca = PCA(n_components = 80)
df_curated_pca = my_pca.fit_transform(df_f)
df_curated_pca = pd.DataFrame(df_curated_pca)

print df_curated_pca.shape

(7767, 80)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_curated_svd, df_y, test_size = 0.1)

from sklearn.metrics import accuracy_score

Now, we will use 3 models to comparatively predict Human activity And we have two curated datasets to study.

ML models:

(1) k-Nearest-Neighbor Classifier (Basis Model)
(2) RandomForestClassifier
(3) Extreme Gradient Boosting

Datasets:

(1) df_curated_svd (7767, 141)
(2) df_curated_pca (7767, 80)

--> How k-Nearest-Neighbor Classifier works :
Given a set of categories {c1,c2,...cn}, also called classes, e.g. {"male", "female"}. There is also a learnset LS consisting of labelled instances.
The task of classification consists in assigning a category or class to an arbitrary instance. If the instance o is an element of LS, the label of the instance will be used.
Now, we will look at the case where o is not in LS:
o is compared with all instances of LS. A distance metric is used for comparison. We determine the k closest neighbors of o, i.e. the items with the smallest distances. k is a user defined constant and a positive integer, which is usually small.
The most common class of LS will be assigned to the instance o. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
The algorithm for the k-nearest neighbor classifier is among the simplest of all machine learning algorithms. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all the computations are performed, when we do the actual classification.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
KNeighborsClassifier(algorithm='auto', 
                     leaf_size=30, 
                     metric='minkowski',
                     metric_params=None, 
                     n_jobs=1, 
                     n_neighbors=5, 
                     p=2,
                     weights='uniform')

knn.fit(X_train, y_train) 

predictions = knn.predict(X_test)
Score = accuracy_score(predictions, y_test)

print Score

/home/gogol/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:14: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

0.939510939511

--> How RandomForest works:
There are two levels of randomness in this algorithm:

(1) At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of n rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.

(2) At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100)

rfc.fit(X_train,y_train)

predictions = rfc.predict(X_test)

Score = accuracy_score(predictions, y_test)

print Score

/home/gogol/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:5: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  """

0.925353925354

import xgboost as xgb

X_train.columns = range(X_train.shape[1])
X_test.columns = range(X_test.shape[1])

xgc = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
xgc.fit(X_train, y_train)

predictions = xgc.predict(X_test)

Score = accuracy_score(predictions, y_test)

print Score

0.980694980695

My Data Journey

Monday, March 5, 2018

HAPT using Dimensionality Reduction Techniques