After the Machine Learning class concluded last month, I walked around with an air of muhaha-i-know-ml, only to watch it soon develop into a big “now-what?:-o”. As it happens, I wasn’t surrounded with readily available problems statements packaged with trustworthy data and a glowing smiley imprinted. Alas. In a perfect world…
Then, came codesprint. One of the best programming competitions I’ve taken part in, with an interesting facet- a number of Silicon-valley companies hiring right off the bat based on your performance. One of those 16 problems, was based on ML. With a well-defined real-world problem statement. And accompanied by trustworthy processed data. I went like
So it goes.
The problem statement Quora Classifier went like so:
Quora moderates the answers provided by its users for a question by giving it a rating of +1/-1. The problem was to predict the rating for a given answer based on the given features of that particular answer.
The following were the inputs/outputs that was need to be read/written to STDIN/STDOUT.
Input:
2 23
2LuzC +1 1:2101216030446 2:1.807711 3:1 4:4.262680 5:4.488636 6:87.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.891820 12:0 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:2 21:2.197225 22:0.000000 23:0.000000
LmnUc +1 1:99548723068 2:3.032810 3:1 4:2.772589 5:2.708050 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:4.727388 12:5 13:1 14:0 15:0 16:1 17:1 18:0 19:0 20:9 21:2.833213 22:0.000000 23:0.000000
2
PdxMK 1:340674897225 2:1.744152 3:1 4:5.023881 5:7.042286 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.367296 12:0 13:1 14:0 15:0 16:0 17:0 18:0 19:0 20:12 21:4.499810 22:0.000000 23:0.000000
ehZ0a 1:2090062840058 2:1.939101 3:1 4:3.258097 5:2.995732 6:75.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.433987 12:0 13:1 14:0 15:0 16:1 17:0 18:0 19:0 20:3 21:2.639057 22:0.000000 23:0.000000
(i.e)
[No. of training examples] [No. of parameters]
( (Answer-ID) (+1/-1) (<param-id>:<param-value>)* )*
[No. of queries]
( (Answer-ID) (<param-id>:<param-value>)* ) *
Output:
PdxMK +1
ehZ0a -1
(i.e)
( [Answer-ID] [Predicted-rating] )*
My code:
import random
import math
from pylab import *
alpha = 0.3
steps = 12
jth = []
def cost(x, y, theta, n, m):
sum = 0
# Iterate over all training examples
for i in range(n):
# Calculate cost of each training example against
# the hypothesis prediction
sum += ( y[i] * math.log(hypo(x[i], theta)) ) +
( (1-y[i]) * math.log((1 - hypo(x[i], theta))) )
return -sum/n
def hypo(x, theta):
"""
Calculate tx = theta' * x, and return k = sigmoid(tx)
"""
tx = 0
for i in range(len(x)):
tx += (x[i] * theta[i])
try:
k = 1 / (1 + math.exp(-tx))
# Overflow error being thrown for extreme values.
# Approximating depending on domain.
except OverflowError:
if tx > 10 ** 5:
k = 1 / (1 + math.exp(-100))
elif tx < 10 ** 5:
k = 1 / (1 + math.exp(100))
#print tx, k
return k
def gradient_descent(x, y, theta, n, m):
for youcanthazthisvar in range(steps):
theta2 = m * [0]
# Iterate over all parameters
for j in range(m):
k = 0
for i in range(n):
k += (hypo(x[i], theta) - y[i]) * x[i][j]
theta2[j] = theta[j] - (alpha / n) * k
# Simultaneous update
theta = theta2[:]
jth.append(cost(x, y, theta, n, m))
return theta
def get_feature_params(x, n, m):
max_x = x[0][:]
sum_x = [0 for i in range(m)]
for j in range(m):
for i in range(n):
max_x[j] = max( max_x[j], x[i][j] )
sum_x[j] += x[i][j]
mean_x = [sum_x[j]/n for j in range(m)]
return max_x, mean_x
def fscale(x, n, m, max_x, mean_x):
for i in range(n):
for j in range(m):
try:
x[i][j] = (x[i][j] - mean_x[j]) / max_x[j];
except ZeroDivisionError:
pass
return x
def print_mat(x):
for i in x:
print ("%.2f " * len(i)) % tuple(i)
def main():
"""
x = feature matrix
y = pre determined classes for above-
1 for a '+1' and 0 for '-1'
theta = parameter vector
n = No. of training examples
m = No. of features
"""
n, m = [int(x) for x in raw_input().split()]
# Initialize X-matrix for features and y-vector
# for answers
y = n * [0]
x = [[0 for i in range(m + 1)] for j in range(n)]
# Prepend 1 to theta-vector, for theta-0, the
# bias parameter
theta = [1] + [random() for i in range(m)]
# Store training data in x and y
for i in range(n):
str = raw_input().split()
y[i] = int(str[1] == '+1')
for param in str[2:]:
p, q = param.split(':')
x[i][int(p)] = float(q)
# Increase m by 1 to account for theta-0
m = m+1
#----- Feature Scaling -----
print "Before feature scaling: "
print_mat(x)
max_x, mean_x = get_feature_params(x, n, m)
x = fscale(x, n, m, max_x, mean_x)
print "After feature scaling: "
print_mat(x)
# ----- Optimize Theta -----
print "Cost before training: ", cost(x, y, theta, n, m)
theta = gradient_descent(x, y, theta, n, m)
print "Cost after training: ", cost(x, y, theta, n, m)
# ----- Query -----
q = int(raw_input())
qnames = []
x_query = []
for i in range(q):
str = raw_input().split()
x_query = [1] + [ float(param.split(':')[1])
for param in str[1:] ]
x_query = fscale([x_query,], 1, m, max_x, mean_x)[0]
print str[0], ('-1', '+1')[
hypo(x_query, theta) >= 0.5 ]
# ----- Plot Cost vs. Iterations -----
iters = range(steps)
plot(jth, 'r-')
show()
if __name__ == '__main__':
main()
P.S: This prints some extra information- Cost before/after training, Features of X before/after feature-scaling, a graph of cost vs. gradient-descent iterations; to STDOUT, in addition to what’s required.