Problem Description and Dataset :
https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
https://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions
In [1]:
import pandas as pd
import numpy as np
In [37]:
df_f = pd.read_csv("/home/gogol/mypy/competition/hapt/Train/X_train.txt", sep = "\s+", header = None)
df_y = pd.read_csv("/home/gogol/mypy/competition/hapt/Train/y_train.txt", sep = "\s+", header = None)
c = pd.read_csv("/home/gogol/mypy/competition/hapt/features.txt", header = None)
num_rows = df_f.shape[0]
num_columns = df_f.shape[1]
In [68]:
c = list(c[0])
df_f.columns = c
df_f.shape
Out[68]:
This dataset has 561 columns, hence dimensionality reduction is necessary.
The problem is a multiclassification problem. If we use tree or ensemble based models then it might perform poorly
as the data density is diluted because the points are sparsed into high dimentional space.
Adding to that, model requires high time, space and harwware resources.
Before assuming the distribution of data, let us use Singular Value Decomposition on the dataset.
Properties of Singular Value decomposition:
Before assuming the distribution of data, let us use Singular Value Decomposition on the dataset.
Properties of Singular Value decomposition:
- It decomposes a $[m/n]$ matrix into three matrices ( $U$, $V$, $S$) which have their own properties, based on two equations:
(a) transpose($A$) . $A$ = $V$ . transpose($S$) . $S$ . transpose($V$)
(b) $C$ . $V$ = $U$ . $S$
Where, $A$ = i/p DataSet. $[m/n]$
$U$ = Left Singular Matrix, i.e. row-to-concept value matrix, it is an Orthogonal Matrix. [m/r]
$S$ = Sigular values for each column in the DataSet, it is a Diagonal Matrix with non-negatives. [r/r]
$V$ = Right Singular Vectors. i.e. target-to-concept matrix, it is an Orthogonal Matrix. [r/n]
- $U$ denotes the row to latent target values for each row. $S$ denotes the strength of each features against latent target values. $V$ denotes column to latent target values.
- $U$ and $V$ are unique for every DataSet, they are orthonormal,
hence, transpose($U$) = inverse($U$)
and, tranpose($U$) . $U$ = $I$
Their coulmns are othogonal unit vectors with Eucledean distance = 1.
In [5]:
from numpy.linalg import svd
U, S, V = svd(df_f)
$S$ contains the so called singular values of the features sorted in descending order:
In [30]:
x = list()
threshold = 10.0 # Change threshold for sigular values here#
for i in range(len(S)):
if S[i]> threshold:
x.append(S[i])
print S.shape
print 'maximum singular value in S = ', S[0]
print 'minimum singular value in S = ', S[560]
print 'number of columns whos corrsponding singular value is greater than threshold, are: ',len(x)
print 'singular value of last column = ', S[len(x)-1]
Here, we can choose only those columns whos sigular values are greater than the threshold. This will reduce dimensionality of the DataSet to certain extend.
In [28]:
df_curated_svd = df_f.iloc[:,:len(x)]
In [29]:
df_curated_svd.shape
Out[29]:
Now, we check whether the Datapoints are normally distributed. If yes, then it makes our life a lot easier. A lot of assumptions on the data can be made.
We are randomly sampling one column out of all the columns and using Shapiro-Wilk's test for normalcy along with scipy's normaltest.
And then, we find out the percentage of coulmns which are normally distributed.
--> How Shapiro-Wilk's works: -
A Q-Q plot is made. Y-axis contains: sorted datapoints from the data-array. X-axis contains: corresponding ideal values if the dataset is normally distributed with same mean and standard deviation. If the plot is a striaght line through origin then the dataset is normally distributed.
Shapiro-Wilk's test for Normalcy is a formal test with null-hypothesis H: one sample frawn from nomally distributed dataset.
type of data: univariate, continuous.
now, statistic W is:
numerator = slope of the data point versus ideal data points, squared , is an estimte of variance.
denominator = normalized and squared ideal data points with same mean and standard deviations of the dataset, is also an estimte of variance.
So, if H is true, W$\approx$1 .
if, W < 1, Then H is rejected, datapoint might be significantly different from normal. in that case, we consider p-values for further analysis.
--> How Shapiro-Wilk's works: -
A Q-Q plot is made. Y-axis contains: sorted datapoints from the data-array. X-axis contains: corresponding ideal values if the dataset is normally distributed with same mean and standard deviation. If the plot is a striaght line through origin then the dataset is normally distributed.
Shapiro-Wilk's test for Normalcy is a formal test with null-hypothesis H: one sample frawn from nomally distributed dataset.
type of data: univariate, continuous.
now, statistic W is:

numerator = slope of the data point versus ideal data points, squared , is an estimte of variance.
denominator = normalized and squared ideal data points with same mean and standard deviations of the dataset, is also an estimte of variance.
So, if H is true, W$\approx$1 .
if, W < 1, Then H is rejected, datapoint might be significantly different from normal. in that case, we consider p-values for further analysis.
In [146]:
import scipy.stats as sc
from random import randint
feature_normalcy_test = randint(0, num_columns+1)
shapiro_results = sc.shapiro(df_f[df_f.columns[feature_normalcy_test]])
normal_test_results = scipy.stats.mstats.normaltest(df_f[df_f.columns[feature_normalcy_test]])
print 'for column ', df_f.columns[feature_normalcy_test], 'test statistics are listed bellow : '
print 'shapiro_results : ', shapiro_results
print normal_test_results
pos = 0
neg = 0
for i in range(num_columns):
swr = sc.shapiro(df_f[df_f.columns[i]])
#normal_test_results = scipy.stats.mstats.normaltest(df_f[df_f.columns[feature_normalcy_test]])
swr = swr[0]
if swr > 0.7 and swr < 1.01:
pos = pos+1
else:
neg = neg+1
print (float(pos)/561.0)*100, 'percent of columns are normally distributed'
print (float(neg)/561.0)*100, 'percent of columns are NOT normally distributed'
# I'm using not-so-strict threshold for normalcy test.
# because for Principal Component analysis, Dataset being nomally distributed is not a very critical assumption.
So, we can consider df_f to be normally distributed.
Now, we can apply Principal Component analysis on it in order to reduce the dimensionality.
--> How PCA works: -
PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components. It finds a coordinate system (an eigenvector basis) that maximizes the variance explained in the data.
It uses the theme of rotating a vector by multiplying it with a transformation matrix. and ultimately, finding it's Eigen Vector(s).
For an Eigen Vector v and tranformation matrix A the following holds true,
Where, $\lambda$ is the corresponding Eigen Vector, a scalar value.
In other words, for every linear transformation $T(v)$, there exists Eigen Vector $v$ which even upon multiplying by transformation matrix $A$ doesnot change it's direction, rather only get's scaled by the factor $\lambda$.
Steps:
(1) Centerize the Data : $x$ - mean($x$)
(2) Compute the covariance matrix :
Covariance matrix determines following factors:
(a) do features x1 and x2 increase together? (b) does x2 decrease while x1 increases or vice versa?
(3) We are tranforming a Vector towards the space of maximum variance or data. Iteratively we multiply a column vector with covariance matrix to slowly turn it towards the spread of data. With every instances the turn gets smalled and smaller and finally it converges to a dimension of greatest variance and it stops turning. Each time we take a transformation now it only gets sclaed. This is how we find the Eigen Vector and Eigen Values.
(4) Principal Component is the Eigen Vector with highest Eigen Value.
Now, we can apply Principal Component analysis on it in order to reduce the dimensionality.
--> How PCA works: -
PCA uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components. It finds a coordinate system (an eigenvector basis) that maximizes the variance explained in the data.
It uses the theme of rotating a vector by multiplying it with a transformation matrix. and ultimately, finding it's Eigen Vector(s).
For an Eigen Vector v and tranformation matrix A the following holds true,

Where, $\lambda$ is the corresponding Eigen Vector, a scalar value.
In other words, for every linear transformation $T(v)$, there exists Eigen Vector $v$ which even upon multiplying by transformation matrix $A$ doesnot change it's direction, rather only get's scaled by the factor $\lambda$.
Steps:
(1) Centerize the Data : $x$ - mean($x$)
(2) Compute the covariance matrix :
Covariance matrix determines following factors:(a) do features x1 and x2 increase together? (b) does x2 decrease while x1 increases or vice versa?
(3) We are tranforming a Vector towards the space of maximum variance or data. Iteratively we multiply a column vector with covariance matrix to slowly turn it towards the spread of data. With every instances the turn gets smalled and smaller and finally it converges to a dimension of greatest variance and it stops turning. Each time we take a transformation now it only gets sclaed. This is how we find the Eigen Vector and Eigen Values.
(4) Principal Component is the Eigen Vector with highest Eigen Value.
Hence, we check it's corresponding Eigen Values and Eigen Vectors of the covarience matrix for primary analysis.
All values are standardized. (datum-mean/std)
All values are standardized. (datum-mean/std)
In [152]:
from scipy.linalg import eig, eigvals
E = eig(df_f.cov())
EV = E[0] # Keeping only the Eigen Values#
EV = list(EV)
In [153]:
print 'Higest Eigen Value = ', EV[0].real
print 'lowest Eigen Value = ', EV[-1].real
In order to choose the number of components for PCA, we take the cumulative sum of all Eigen Values :
In [178]:
ev_sum = 0
ev_rel = list()
ev_rel_percentage = list()
for i in EV:
ev_sum = ev_sum +i
for i in range(len(EV)):
ev_rel.append(EV[i].real/ev_sum.real)
ev_rel_percentage.append(ev_rel[i]*100)
su = 0.0
for i in range(len(ev_rel_percentage)):
su = su + ev_rel_percentage[i]
print su, i
if su > 96.0:
break
We can see that $95$% of the variance can be determined by 80 components.
In [181]:
from sklearn.decomposition import PCA
my_pca = PCA(n_components = 80)
df_curated_pca = my_pca.fit_transform(df_f)
df_curated_pca = pd.DataFrame(df_curated_pca)
print df_curated_pca.shape
In [208]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_curated_svd, df_y, test_size = 0.1)
In [ ]:
from sklearn.metrics import accuracy_score
Now, we will use 3 models to comparatively predict Human activity And we have two curated datasets to study.
ML models:
(1) k-Nearest-Neighbor Classifier (Basis Model)
(2) RandomForestClassifier
(3) Extreme Gradient Boosting
Datasets:(1) df_curated_svd (7767, 141)
(2) df_curated_pca (7767, 80)
--> How k-Nearest-Neighbor Classifier works :
Given a set of categories {c1,c2,...cn}, also called classes, e.g. {"male", "female"}. There is also a learnset LS consisting of labelled instances.
The task of classification consists in assigning a category or class to an arbitrary instance. If the instance o is an element of LS, the label of the instance will be used.
Now, we will look at the case where o is not in LS:
o is compared with all instances of LS. A distance metric is used for comparison. We determine the k closest neighbors of o, i.e. the items with the smallest distances. k is a user defined constant and a positive integer, which is usually small.
The most common class of LS will be assigned to the instance o. If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
The algorithm for the k-nearest neighbor classifier is among the simplest of all machine learning algorithms. k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all the computations are performed, when we do the actual classification.
In [186]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
KNeighborsClassifier(algorithm='auto',
leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=1,
n_neighbors=5,
p=2,
weights='uniform')
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
Score = accuracy_score(predictions, y_test)
print Score
--> How RandomForest works:
There are two levels of randomness in this algorithm:
(1) At row level: Each of these decision trees gets a random sample of the training data (say 10%) i.e. each of these trees will be trained independently on 100 randomly chosen rows out of n rows of data. Keep in mind that each of these decision trees is getting trained on 100 randomly chosen rows from the dataset i.e they are different from each other in terms of predictions.
(2) At column level: The second level of randomness is introduced at the column level. Not all the columns are passed into training each of the decision trees. Say we want only 10% of columns to be sent to each tree. This means a randomly selected 3 column will be sent to each tree. So for the first decision tree, may be column C1, C2 and C4 were chosen. The next DT will have C4, C5, C10 as chosen columns and so on.
In [191]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
predictions = rfc.predict(X_test)
Score = accuracy_score(predictions, y_test)
print Score
In [209]:
import xgboost as xgb
X_train.columns = range(X_train.shape[1])
X_test.columns = range(X_test.shape[1])
xgc = xgb.XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)
xgc.fit(X_train, y_train)
predictions = xgc.predict(X_test)
Score = accuracy_score(predictions, y_test)
print Score