#if DEBUG, python flavoured

I rely a lot on the preprocessor directives for debugging in C++, which looks usually looks like this

#define DEBUG 1
int main()
{
    ....
    #if DEBUG
        cout<<"In debug mode"<<endl;
    #elif
        cout<<"Not in debug mode"<<endl;
    #endif
    ....
}

One simple method of handling such stuff in python involves the use of __debug__. So, the equivalent of above would be:

if __debug__:
    print "In debug mode"
else:
    print "Not in debug mode"

To run a program with the __debug__ flag turned off, you execute
> python -O <filename.py>

Posted in Uncategorized | Tagged | Leave a comment

Logistic Regression, in python

After the Machine Learning class concluded last month, I walked around with an air of muhaha-i-know-ml, only to watch it soon develop into a big “now-what?:-o”. As it happens, I wasn’t surrounded with readily available problems statements packaged with trustworthy data and a glowing smiley imprinted.  Alas. In a perfect world…

Then, came codesprint. One of the best programming competitions I’ve taken part in, with an interesting facet- a number of Silicon-valley companies hiring right off the bat based on your performance. One of those 16 problems, was based on ML. With a well-defined real-world problem statement. And accompanied by trustworthy processed data. I went like

^o^

So it goes.

The problem statement Quora Classifier went like so:
Quora moderates the answers provided by its users for a question by giving it a rating of +1/-1.  The problem was to predict the rating for a given answer based on the given features of that particular answer.
The following were the inputs/outputs that was need to be read/written to STDIN/STDOUT.

Input:

2 23
2LuzC +1 1:2101216030446 2:1.807711 3:1 4:4.262680 5:4.488636 6:87.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.891820 12:0 13:1 14:0 15:0 16:0 17:1 18:1 19:0 20:2 21:2.197225 22:0.000000 23:0.000000
LmnUc +1 1:99548723068 2:3.032810 3:1 4:2.772589 5:2.708050 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:4.727388 12:5 13:1 14:0 15:0 16:1 17:1 18:0 19:0 20:9 21:2.833213 22:0.000000 23:0.000000
2
PdxMK 1:340674897225 2:1.744152 3:1 4:5.023881 5:7.042286 6:0.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.367296 12:0 13:1 14:0 15:0 16:0 17:0 18:0 19:0 20:12 21:4.499810 22:0.000000 23:0.000000
ehZ0a 1:2090062840058 2:1.939101 3:1 4:3.258097 5:2.995732 6:75.000000 7:0.000000 8:0.000000 9:0 10:0 11:3.433987 12:0 13:1 14:0 15:0 16:1 17:0 18:0 19:0 20:3 21:2.639057 22:0.000000 23:0.000000

(i.e)
[No. of training examples] [No. of parameters]
( (Answer-ID)  (+1/-1)  (<param-id>:<param-value>)* )*
[No. of queries]
( (Answer-ID) (<param-id>:<param-value>)* ) *

Output:

PdxMK +1
ehZ0a -1

(i.e)
( [Answer-ID] [Predicted-rating] )*

My code:

import random
import math
from pylab import *

alpha = 0.3
steps = 12
jth = []

def cost(x, y, theta, n, m):
    sum = 0
    # Iterate over all training examples
    for i in range(n):
        # Calculate cost of each training example against
        # the hypothesis prediction
        sum += ( y[i] * math.log(hypo(x[i], theta)) ) +
               ( (1-y[i]) * math.log((1 - hypo(x[i], theta))) )
    return -sum/n

def hypo(x, theta):
    """
    Calculate tx = theta' * x, and return k = sigmoid(tx)
    """
    tx = 0
    for i in range(len(x)):
        tx += (x[i] * theta[i])
    try:
        k =  1 / (1 + math.exp(-tx))
    # Overflow error being thrown for extreme values.
    # Approximating depending on domain.
    except OverflowError:
        if tx > 10 ** 5:
            k = 1 / (1 + math.exp(-100))
        elif tx < 10 ** 5:
             k = 1 / (1 + math.exp(100))
     #print tx, k
     return k

 def gradient_descent(x, y, theta, n, m):
     for youcanthazthisvar in range(steps):
         theta2 = m * [0]
         # Iterate over all parameters
         for j in range(m):
             k = 0
             for i in range(n):
                 k += (hypo(x[i], theta) - y[i]) * x[i][j]
             theta2[j] = theta[j] - (alpha / n) * k
         # Simultaneous update
         theta = theta2[:]
         jth.append(cost(x, y, theta, n, m))
     return theta

 def get_feature_params(x, n, m):
     max_x = x[0][:]
     sum_x = [0 for i in range(m)]
     for j in range(m):
         for i in range(n):
             max_x[j] = max( max_x[j], x[i][j] )
             sum_x[j] += x[i][j]
     mean_x = [sum_x[j]/n for j in range(m)]
     return max_x, mean_x

 def fscale(x, n, m, max_x, mean_x):
     for i in range(n):
         for j in range(m):
             try:
                 x[i][j] = (x[i][j] - mean_x[j]) / max_x[j];
             except ZeroDivisionError:
                 pass
     return x

 def print_mat(x):
     for i in x:
         print ("%.2f   " * len(i)) % tuple(i)

 def main():
     """
     x = feature matrix
     y = pre determined classes for above-
         1 for a '+1' and 0 for '-1'
     theta = parameter vector
     n = No. of training examples
     m = No. of features
     """
     n, m = [int(x) for x in raw_input().split()]

     # Initialize X-matrix for features and y-vector
     # for answers
     y = n * [0]
     x = [[0 for i in range(m + 1)] for j in range(n)]

     # Prepend 1 to theta-vector, for theta-0, the
     # bias parameter
     theta = [1] + [random() for i in range(m)]

     # Store training data in x and y
     for i in range(n):
         str = raw_input().split()
         y[i] = int(str[1] == '+1')
         for param in str[2:]:
             p, q = param.split(':')
             x[i][int(p)] = float(q)

     # Increase m by 1 to account for theta-0
     m = m+1

     #----- Feature Scaling -----
     print "Before feature scaling: "
     print_mat(x)
     max_x, mean_x = get_feature_params(x, n, m)
     x = fscale(x, n, m, max_x, mean_x)
     print "After feature scaling: "
     print_mat(x)

     # ----- Optimize Theta -----
     print "Cost before training: ", cost(x, y, theta, n, m)
     theta = gradient_descent(x, y, theta, n, m)
     print "Cost after training: ", cost(x, y, theta, n, m)

     # ----- Query -----
     q = int(raw_input())
     qnames = []
     x_query = []
     for i in range(q):
         str = raw_input().split()
         x_query = [1] + [ float(param.split(':')[1])
                           for param in str[1:] ]
         x_query = fscale([x_query,], 1, m, max_x, mean_x)[0]
         print str[0], ('-1', '+1')[
                             hypo(x_query, theta) >= 0.5 ]

    # ----- Plot Cost vs. Iterations -----
    iters = range(steps)
    plot(jth, 'r-')
    show()

if __name__ == '__main__':
    main()

P.S: This prints some extra information- Cost before/after training, Features of X before/after feature-scaling, a graph of cost vs. gradient-descent iterations; to STDOUT, in addition to what’s required.

Posted in Code | Tagged , , | 2 Comments

“Among other things, you’ll find that you’re not the first person who was ever confused and frightened and even sickened by human behavior. You’re by no means alone on that score, you’ll be excited and stimulated to know. Many, many men have been just as troubled morally and spiritually as you are right now. Happily, some of them kept records of their troubles. You’ll learn from them — if you want to. Just as someday, if you have something to offer, someone will learn something from you. It’s a beautiful reciprocal arrangement. And it isn’t education. It’s history. It’s poetry.”

Posted in Excerpts | Tagged , | Leave a comment

and this little piggy wanted to showboat with Django…

I have been cursed with blog infidelity. While this blog, the second one, tests my faithfulness, I worked the previous week during my free time, reading up on “Practical Django Projects” by James Bennett, Apress Publications and following one of their projects. Specifically, on a project that involved creating a blog :-/, using Django. Initially expecting more commitment when using a product devised by my blood, sweat and tears, turned out to be more enlightening than quixotic.

Django’s a refreshing change
The last time I designed a web-site, collectively spanned over a thousand odd lines, after intense refactoring. I walked around with a sense of pride compacting the code to what I considered then – critical mass, completely oblivious to the world of frameworks. Recently, at my internship at Mahiti, we were introduced to a project using Django. At the same time, I started spending my evenings working on a blog, following the aforementioned book. These two experiences amalgamated to hit me with something equivalent to the smoker’s first drag, early in the morning, or so I presume.
I love python. It’s the mechanic’s most coveted all-in-one tool-kit. But, I am not ready to say the same with Django. Not just yet. Working on something like Django, requires versatility. Versatility which makes makes a noob like myself program, design, manage and breathe heavily, at the same time. While designing the blog, I missed out on foreboding few intricate details which the author effortlessly managed to consider, which I assume would require experience and forethought; the kind that makes you go “Oh no, I know whats gonna happen if I stick with this”.
Django’s a refreshing change. It makes you factor your code. It disciplines you. It teaches you how to cut in between the lines and be appreciated for it. Take for example, “tags” for any type of content. All that needed to be done involved installing django-tags, and mentioning it in your code and using it right away. So is the case with comments, rss and delicious feeds and what not.

Why I refrain from using the Django blog
When I kicked off with the first chapter in the book, I had a clear picture of what I wanted. It was realistic. Maybe because it was minimal. As soon as I was done coding the django blog, I got to writing this post. Even finding a free django-friendly hosting service (https://www.alwaysdata.com/) hadn’t pushed me into using that blog. Maybe because realising the subtle complexities on how sometimes a minor change that needs to be carried out, affects various parts of the project inside-out sent a chill down my spine. A change that makes one think in terms of the tool rather than the stark logic. Maybe I haven’t matured enough to assimilate the inner-workings of Django that’d make the process fun. Maybe my inexperience with web-programming is being highlighted. All in all, I refrain moving over to the Django blog…until I gain the ability to visualize the changes in the code that needs to be carried out when a need surfaces.

To learn, check out:
http://www.djangobook.com/en/2.0/
https://docs.djangoproject.com/en/1.3/intro/tutorial01/

Posted in Uncategorized | Tagged , , | Leave a comment

Hello, world!

“The Hitchhiker’s Guide to the Galaxy” quotes:

“The Hitchhiker’s Guide to the Galaxy” offers this definition of the word “Infinite.” INFINITE: Bigger than the biggest thing ever and then some. Much bigger than that in fact, really amazingly immense, a totally stunning size, real “wow, that’s big,” time. Infinity is just so big that by comparison, bigness itself looks real titchy. Gigantic multiplied by colossal multiplied by staggeringly huge is the sort of concept we’re trying to get across here.”

Although the context of “infinite” above differs from what’s used in the blog’s title, I find peace and joy in a sardonic perspective of matters over which an untold, but statistically probable number of great minds loom over.

But wait, that wasn’t the main intention of this blog being allocated some space at some server half way around the world. I claim being pressured by a reason which can be concisely, but precisely labelled “Technical propaganda” (Deja Vu, http://automatocalypse.blogspot.com).

Posted in A new beginning | Tagged | Leave a comment