Noisebridge - User contributions [en]

Machine Learning/VirtualBox

2011-01-27T04:25:40Z

Jjhale:

To avoid spending time installing stuff we have an image of Ubuntu with most of the software we use in the ML group. The image can be run on VirtualBox.

It's about 2GB and available on USB at meetings.

* username: nbml
* password: nbmlsvm

Machine Learning/VirtualBox

2011-01-27T04:25:26Z

Jjhale: Created page with 'To avoid spending time installing stuff we have an image of Ubuntu with most of the software we use in the ML group. The image can be run on VirtualBox. It's about 2GB and avail…'

Machine Learning/Kaggle Social Network Contest/Features

2010-11-25T23:50:55Z

Jjhale: /* Joe's attempt */

== TODO ==
* Precisely define the listed features

== Possible Features ==
*Node Features
**nodeid
**outdegree
**indegree
**local clustering coefficient
**reciprocation of inbound probability (num of edges returned / num of inbound edges)
**reciprocation of outbound probability (num of edges returned / num of outbound edges)

*Edge Features
**nodetofollowid
**shortest distance nodeid to nodetofollowid
**density? (<strike>median path length</strike>)
**does reverse edge exist? (aka is nodetofollowid following nodeid?)
**number of common friends
**indegrees & outdegrees of nodetofollowid

* Network features
** unweighted random walk score
** global clustering coefficient
** Adamic-Adar score
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]

* Clustering
** membership of the same strongly connected cluster
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

== Joe's attempt ==
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:
# is there a directed edge from t to s?
# the in-degree of s
# the out-degree of s
# the in-degree of t
# the out-degree of t
# RLD-1(s)
# RLD1(s)
# RLD0(s)
# RLD-1(t)
# RLD1(t)
# RLD0(t)
#AA01(s,t)
# AA01.5(s,t)
# AA02(s,t)
# AA-11(s,t)
# AA-11.5(s,t)
# AA-12(s,t)
# AA11(s,t)
# AA11.5(s,t)
# AA12(s,t)

where
* RLDx(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
** note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
** logs are taken to base e

I define Nxh(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
* Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
* The sets Cxh(s,t) are distinct for different h.
* It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
* AA is the Adamic-Adar score calculated over different common neighbourhoods.
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
* AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)

Machine Learning/Kaggle Social Network Contest/Features

2010-11-25T23:44:58Z

Jjhale: /* Joe's attempt */

== TODO ==
* Precisely define the listed features

== Possible Features ==
*Node Features
**nodeid
**outdegree
**indegree
**local clustering coefficient
**reciprocation of inbound probability (num of edges returned / num of inbound edges)
**reciprocation of outbound probability (num of edges returned / num of outbound edges)

*Edge Features
**nodetofollowid
**shortest distance nodeid to nodetofollowid
**density? (<strike>median path length</strike>)
**does reverse edge exist? (aka is nodetofollowid following nodeid?)
**number of common friends
**indegrees & outdegrees of nodetofollowid

* Network features
** unweighted random walk score
** global clustering coefficient
** Adamic-Adar score
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]

* Clustering
** membership of the same strongly connected cluster
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

== Joe's attempt ==
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:
# the in-degree of s
# the out-degree of s
# the in-degree of t
# the out-degree of t
# RLD-1(s)
# RLD1(s)
# RLD0(s)
# RLD-1(t)
# RLD1(t)
# RLD0(t)
#AA01(s,t)
# AA01.5(s,t)
# AA02(s,t)
# AA-11(s,t)
# AA-11.5(s,t)
# AA-12(s,t)
# AA11(s,t)
# AA11.5(s,t)
# AA12(s,t)

where
* RLDx(n) is 1 / log(0.1 + the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )
** note that I add 0.1 so that nodes with degree 1 have a score of 1/log(1.1) = 10.49 rather than1/log(1) which is a divide by zero
** logs are taken to base e

I define Nxh(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
* Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
* The sets Cxh(s,t) are distinct for different h.
* It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
* AA is the Adamic-Adar score calculated over different common neighbourhoods.
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
* AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)

Machine Learning/Kaggle Social Network Contest/Features

2010-11-25T23:28:55Z

Jjhale: Adding my approah

== TODO ==
* Precisely define the listed features

== Possible Features ==
*Node Features
**nodeid
**outdegree
**indegree
**local clustering coefficient
**reciprocation of inbound probability (num of edges returned / num of inbound edges)
**reciprocation of outbound probability (num of edges returned / num of outbound edges)

*Edge Features
**nodetofollowid
**shortest distance nodeid to nodetofollowid
**density? (<strike>median path length</strike>)
**does reverse edge exist? (aka is nodetofollowid following nodeid?)
**number of common friends
**indegrees & outdegrees of nodetofollowid

* Network features
** unweighted random walk score
** global clustering coefficient
** Adamic-Adar score
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
*** R igraph: [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/similarity.html similarity.invlogweighted]

* Clustering
** membership of the same strongly connected cluster
*** using [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clusters]

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

== Joe's attempt ==
I'm planning on collecting features based on an edge. Then sample the features over existing and randomly created edges and fit a logistic regression model to it.

For an edge from node s to node t I will calculate:
# the in-degree of s
# the out-degree of s
# the in-degree of t
# the out-degree of t
# RLD-1(s)
# RLD1(s)
# RLD0(s)
# RLD-1(t)
# RLD1(t)
# RLD0(t)
#AA01(s,t)
# AA01.5(s,t)
# AA02(s,t)
# AA-11(s,t)
# AA-11.5(s,t)
# AA-12(s,t)
# AA11(s,t)
# AA11.5(s,t)
# AA12(s,t)

where
* RLDx(n) is 1 / log(the x-degree of node n), where -1 = in, 1 = out and 0 = any. (RLD = reciprocal log of degree )

I define Nxh(n) to be the nodes reachable from ''n'' in ''h'' hops along either any edge (x = 0), edges from t towards s (x = -1) or edges from s towards t (x = 1).

I define Cxh(s,t) as the set of common neighbours of s and t a distance of h hops from s and t, excluding nodes in a closer common neighbourhood ie
* Cxh(s,t) = (Nxh(s) ∩ N-xh(t)) \ ∪h' < h (Nxh'(s) ∩ N-xh'(t))
** h = 1.5 corresponds to nodes which are one hop from either s or t and two hops from either t or s
* The sets Cxh(s,t) are distinct for different h.
* It is directional, ie sometimes Cxh(s,t)≠Cxh(t,s)
* AA is the Adamic-Adar score calculated over different common neighbourhoods.
** the subscript 0, -1, 1 referes to neighbours reachable be following any, in or out node respectively
** the superscript 1, 1.5 and 2 refer to the the number of hops from a focal node the neighbour is.
* AAxh(s,t) = sumn ∈ Cxh(s,t) RLD0(n)

Machine Learning/Kaggle Social Network Contest/Network Description

2010-11-24T08:07:05Z

Jjhale: /* Conectivity */

Here we can put the descriptive statistics of the network:

* Number of fully sampled nodes: 37,689
** ie the unique "outnodes" in the edge list
* Total number of nodes: 1,133,547
* number of edges: 7,237,983

== Conectivity ==
"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia]

* The Training Graph is '''not''' weakly connected
* It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering]
** There is one very large cluster containing all but 154 verticies, then 4 with size 10 - 37, 8 sized 3 - 7 and 13 size 2
** note that igraph seems to create a vertex labelled 0 but the labels in the traindata file range from 1 to 1133547

* I also grabbed the number of strongly connected subgraphs
{| border="1"
|-
!| Cluster Size
| 1
| 2
| 3
| 4
| 5
| 9
| 10
| 32464
|-
!| freq
|1100647
| 162
| 18
|5
|4
| 1
| 1
| 1
|}

When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.
{| border="1"
|-
!| Cluster Size
| 1
| 2
| 3
| 4
| 5
| 7
| 10
| 23
| 37
| 1133394
| 1133466
|-
!| Train
| 1
| 13
| 3
| 2
| 2
| 1
| 1
| 2
| 1
| 1
| 0
|-
!| Train + Test
| 1
| 13
| 2
| 1
| 1
| 1
| 1
| 1
| 0
| 0
| 1
|}

Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?
* TODO: figure out probs of adding and removing nodes under different sampling hypotheses.
* TODO: identify the edges which are merging the clusters
* I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.

* Diameter of the directed graph is 14
** This is the longest of the shortest directed paths between two nodes
** R igraph
*** diameter (dg, directed = TRUE, unconnected = TRUE)
*** Was taking forever so I aborted (after 34 minutes...)
* Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
** For each of our 38k I calculated the number of outbound neighbours and summed it
** R igraph:
*** sum([http://cneurocvs.rmki.kfki.hu/igraph/doc/R/neighborhood.html neighborhood.size](dg, 1, nodes=myGuys, mode="out"))
*** mode = "in", "out" or "all"

Machine Learning/Kaggle Social Network Contest/Network Description

2010-11-24T07:53:54Z

Jjhale: /* Conectivity */ effect of adding test data on clusters

Here we can put the descriptive statistics of the network:

* Number of fully sampled nodes: 37,689
** ie the unique "outnodes" in the edge list
* Total number of nodes: 1,133,547
* number of edges: 7,237,983

== Conectivity ==
"A digraph is strongly connected if every vertex is reachable from every other following the directions of the arcs. On the contrary, a digraph is weakly connected if its underlying undirected graph is connected. A weakly connected graph can be thought of as a digraph in which every vertex is "reachable" from every other but not necessarily following the directions of the arcs. A strong orientation is an orientation that produces a strongly connected digraph." [http://en.wikipedia.org/wiki/Glossary_of_graph_theory wikipedia]

* The Training Graph is '''not''' weakly connected
* It contains 27 subgraphs This means that it can be broken down into at least two discrete subgraphs.
** c.f. [http://cneurocvs.rmki.kfki.hu/igraph/doc/R/clusters.html igraph clustering]
** There is one very large cluster containing all but 154 verticies, then 4 with size 10 - 37, 8 sized 3 - 7 and 13 size 2
** note that igraph seems to create a vertex labelled 0 but the labels in the traindata file range from 1 to 1133547

* I also grabbed the number of strongly connected subgraphs
{| border="1"
|-
!| Cluster Size
| 1
| 2
| 3
| 4
| 5
| 9
| 10
| 32464
|-
!| freq
|1100647
| 162
| 18
|5
|4
| 1
| 1
| 1
|}

When I added all of the test data to the graph and then re-ran the cluster analysis it found 22 clusters instead of 27. The largest cluster grew by 72 vertices.
{| border="1"
|-
!| Cluster Size
| 1
| 2
| 3
| 4
| 5
| 7
| 10
| 23
| 37
| 1133394
| 1133466
|-
!| Train
| 1
| 13
| 3
| 2
| 2
| 1
| 1
| 2
| 1
| 1
| 0
|-
!| Train + Test
| 1
| 13
| 2
| 1
| 1
| 1
| 1
| 1
| 0
| 0
| 1
|}

Is it more likely that clusters were created by removing nodes or that they merged due to randomly adding nodes?
* TODO figure out probs of adding and removing nodes under different sampling hypotheses.
* I'm guessing that the chances of a randomly generated edge joins the small clusters is very low.

* Diameter of the directed graph is 14
** This is the longest of the shortest directed paths between two nodes
** R igraph
*** diameter (dg, directed = TRUE, unconnected = TRUE)
*** Was taking forever so I aborted (after 34 minutes...)
* Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
** For each of our 38k I calculated the number of outbound neighbours and summed it
** R igraph:
*** sum([http://cneurocvs.rmki.kfki.hu/igraph/doc/R/neighborhood.html neighborhood.size](dg, 1, nodes=myGuys, mode="out"))
*** mode = "in", "out" or "all"

Machine Learning/Kaggle Social Network Contest/Features

2010-11-23T04:05:53Z

Jjhale: /* Possible Features */

Machine Learning/Kaggle Social Network Contest/Network Description

2010-11-23T03:59:59Z

Jjhale:

Machine Learning/Kaggle Social Network Contest/Network Description

2010-11-23T03:29:11Z

Jjhale:

Here we can put the descriptive statistics of the network:

* Number of fully sampled nodes: 37,689
** ie the unique "outnodes" in the edge list
* Total number of nodes: 1,133,547
* number of edges: 7,237,983

* The Graph is '''not''' weakly connected! This means that it can be broken down into at least two discrete subgraphs.

* Diameter of the directed graph
** This is the longest of the shortest directed paths between two nodes
** R igraph
*** diameter (dg, directed = TRUE, unconnected = TRUE)
*** Was taking forever so I aborted (after 34 minutes...)
* Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
** For each of our 38k I calculated the number of outbound neighbours and summed it
** R igraph:
*** sum([http://cneurocvs.rmki.kfki.hu/igraph/doc/R/neighborhood.html neighborhood.size](dg, 1, nodes=myGuys, mode="out"))
*** mode = "in", "out" or "all"

Machine Learning/Kaggle Social Network Contest/Network Description

2010-11-23T03:26:56Z

Jjhale: Created page with 'Here we can put the descriptive statistics of the network: * Number of fully sampled nodes: 37,689 ** ie the unique "outnodes" in the edge list * Total number of nodes: 1,133,5…'

Here we can put the descriptive statistics of the network:

* Number of fully sampled nodes: 37,689
** ie the unique "outnodes" in the edge list
* Total number of nodes: 1,133,547
* number of edges: 7,237,983

* Diameter of the directed graph
** This is the longest of the shortest directed paths between two nodes
** R igraph
*** diameter (dg, directed = TRUE, unconnected = TRUE)
*** Was taking forever so I aborted (after 34 minutes...)
* Total number of direct neighbours out: 7 275 672, in: 508 688, all: 7 473 273
** For each of our 38k I calculated the number of outbound neighbours and summed it
** R igraph:
*** sum([http://cneurocvs.rmki.kfki.hu/igraph/doc/R/neighborhood.html neighborhood.size](dg, 1, nodes=myGuys, mode="out"))
*** mode = "in", "out" or "all"

Machine Learning/Kaggle Social Network Contest

2010-11-23T02:55:05Z

Jjhale: /* Status */

== Status ==
{| border="1" cellspacing="0" cellpadding="2"
|-
! |Tasks
!| Status
!| target date
!|subpage
|-
!|Lit review
| started
| -
| [[/lit review| Lit Review]]
|-
|-
!|Load data
| started
| 11/24
| [[/load data| Load data]]
|-
!|Describe network
| started
| 11/24
| [[/Network Description| Network Description]]
|-
!|Choose problem representation
| started
| 11/24
| [[/Problem Representation| Problem Representation]]
|-
!|Generate candidate features
| 0%
| 11/24
| [[/Features | Features]]
|-
!|fit to model
| 0%
|
| [[/model | Model]]
|-
!|Win competition
| 0%
|
| [[/what to do with all the money | Prize Plan]]
|}

== Official Contest Links ==
* Overview: http://kaggle.com/socialNetwork
* Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)

== Official Data Downloads ==
* Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
* Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
* Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv

== Key Contest Info ==
The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.

You are given 7,237,983 contacts/edges from a social network (social_train.zip). The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.

There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.

The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.

The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.

You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.

Don’t despair if your first couple of solutions score low, this is an explorative process.

== Our Working Data Dumps ==
* Adjacency list based from the training data:
http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv
First column: outbound vertex
Remaining columns: list of vertices to which it points
Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
* Adjacency list of the reversed Graph:
http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
First column: inbound vertex
Remaining columns: list of vertices which point to it
Note: This is useful if interested in following the edges backwards quickly.
This is useful to load as a hashtable keyed on inbound vertex returning the list.
* Degree Features for all Nodes:
http://dl.dropbox.com/u/14895843/social-network-kaggle/node_degree_features.csv
First column: Node Id
Second column: Outbound Degree (count of the number of outbound edges from node)
Third column: Inbound Degree (count of the number of inbound edges to node)
Note: You can think of these as number of followees and followers (respectively).
Additionally, note that only the first 32.7k rows have 'followees'

== Useful Links ==
* [[http://arxiv.org/abs/1011.4071 | Supervised Random Walks: Predicting and Recommending Links in Social Networks]]
* Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
* "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm
* http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
* http://en.wikipedia.org/wiki/Glossary_of_graph_theory
* [[http://books.google.com/books?id=Ww3_bKcz6kgC&lpg=PA67&ots=aFSGYEjA_g&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PP7#v=onepage&q&f=false | Another Google Book on Social Network Analysis]]

Machine Learning/Kaggle Social Network Contest/Features

2010-11-23T02:32:47Z

Jjhale: /* Possible Features */

Machine Learning/Kaggle Social Network Contest/load data

2010-11-23T02:24:35Z

Jjhale: /* R */

=Python=

== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.

The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

'''Method 1'''
<pre>
import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
</pre>

'''Method 2'''
<pre>
import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])
DG.add_edge(tmp1, tmp2)

print "Loaded in ", str(time.clock() - t0), "s"
</pre>

Below is the time to load different numbers of row using the two methods on a 2.8Ghz Quad core machine with 3GB RAM. The second method seems quicker. Note that these are just based on single loads and are intended to be a guide rather than a rigorous analysis of the methods!
{| border="1"
|-
!|Rows
!| 1M
!| 2M
!| 3M
|-
!|Method 1
| 20s
| 53s
| 103s
|-
!|Method 2
| 15s
| 41s
| 86s
|}

= Ruby =

== Note on CSV Libraries ==
If you happen to be using Ruby (like Jared) for loading data in and out of CSV files, you should definitely try [[ http://fastercsv.rubyforge.org/ | FasterCSV ]](require 'faster_csv') instead of the stock CSV (require 'csv'). For example, when loading the adjacency list it was literally ten times faster using FasterCSV than using the normal CSV.

== Loading Adjacency Lists ==
<pre>
require 'rubygems'
require 'faster_csv'
def load_adj_list_faster(filename)
adj_list_hash={}
FasterCSV.foreach(filename, :quote_char => '"', :col_sep =>',', :row_sep =>:auto) do |row|
node_id=row.shift
list_of_adj=row
adj_list_hash[node_id] = list_of_adj
end
return adj_list_hash
end

adj_list_lookup = load_adj_list_faster('adj_list.out.csv')
rev_adj_list_lookup = load_adj_list_faster('reverse_adj_list.out.csv')
</pre>

= R =
== igraph ==
The full dataset loaded pretty fast using the R package igraph. With the full data set loaded R is using less than 900MB of RAM.

Grab the package with:
<pre>
install.packages("igraph")
</pre>

Load the data using:
<pre>
data <-as.matrix(read.csv("social_train.csv", header = FALSE));
dg <- graph.edgelist(data, directed=TRUE)
</pre>

Machine Learning/Kaggle Social Network Contest/Features

2010-11-20T05:30:14Z

Jjhale: /* Possible Features */

== TODO ==
* Precisely define the listed features

== Possible Features ==
*nodeid
*nodetofollowid
*median path length
*shortest distance from nodeid to nodetofollowid
*inbound edges
*outbound edges
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

From the Backstrom and Leskovec, for a node s and a potential target c
* Network features
** unweighted random walk score
** Adamic-Adar score
*** see [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.108.1370&rep=rep1&type=pdf original paper]
** number of common friends
** indegrees and outdegrees of s
*** the indegree is the number of edges coming into node s
*** the outdegree is the number of edges leaving node s
** indegrees and outdegrees of c

Machine Learning/Kaggle Social Network Contest/Features

2010-11-20T05:17:08Z

Jjhale:

== TODO ==
* Precisely define the listed features

== Possible Features ==
*nodeid
*nodetofollowid
*median path length
*shortest distance from nodeid to nodetofollowid
*inbound edges
*outbound edges
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

From the Backstrom and Leskovec, for a node s and a potential target c
* Network features
** unweighted random walk score
** Adamic-Adar score
** number of common friends
** indegrees and outdegrees of s
*** the indegree is the number of edges coming into node s
*** the outdegree is the number of edges leaving node s
** indegrees and outdegrees of c

Machine Learning/Kaggle Social Network Contest/Problem Representation

2010-11-20T05:02:08Z

Jjhale:

== TODO ==
* come up with a plan of attack.

== Idea C - two hop neighbourhood ==
For each node of 38k sampled individuals calculate features based on the a two hop neighbourhood -ie friends of friends.
* these neighbourhoods should make the problem a little more tractable.

== Idea A - huge CSV==
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

Suggestion on how to tame the size: Perhaps breaking into two csvs with one for data columns about nodes and another for columns about edges. For example:
* node_features.csv: node_id, inbound_degree, outbound_degree, clustering coefficient, etc, etc.
* edge_features.csv: node_i, node_j, shortest_distance, density, etc, etc
If both types of data are in one datafile, we'd be probably be duplicating any single-node-centric data points for every single edge row. I understand we might need to ultimately need to create such a single file, but I feel like two files will help keep it manageable as we identify and calculate feature data in the short term.

== Idea B - online learning ==
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Can whoever added/proposed this please flesh this out more? Curious to explore alternative approaches for sure, even if they seem computationally difficult.

Machine Learning/Kaggle Social Network Contest/model

2010-11-20T04:59:29Z

Jjhale: /* Candidate models */

== Candidate models ==
* Logistic regression
* Supervised Random walk

Machine Learning/Kaggle Social Network Contest/lit review

2010-11-20T04:30:45Z

Jjhale: Created page with 'This page contains links to relevant articles and summaries of the papers. == Papers == === Supervised Random Walks === * title: "Supervised Random Walks: Predicting and Recom…'

This page contains links to relevant articles and summaries of the papers.

== Papers ==

=== Supervised Random Walks ===
* title: "Supervised Random Walks: Predicting and Recommending Links in Social Networks"
* authors: Lars Backstrom and Jure Leskovec
* [http://arxiv.org/abs/1011.4071 paper]
* '''Summary'''
** develop an algorithm based on Supervised Random Walks
** uses network structure info combined with node and edge level attributes to guide the walk
** learn a function to weight edges s.t. random walker more likely to visit nodes to which new links will be created (equivalent to missing nodes for our application)
** they develop a good training algorithm.
** test it on a facebook network and on co-author network
** compare to decision trees, logistic regression and unsupervised techniques.

Machine Learning/Kaggle Social Network Contest

2010-11-19T07:46:10Z

Jjhale: /* Status */

== Official Contest Links ==
* Overview: http://kaggle.com/socialNetwork
* Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)

== Official Data Downloads ==
* Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
* Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
* Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv

== Status ==
{| border="1" cellspacing="0" cellpadding="2"
|-
! |Tasks
!| Status
!| target date
!|subpage
|-
!|Load data
| started
| 11/24
| [[/load data| Load data]]
|-
!|Choose problem representation
| started
| 11/24
| [[/Problem Representation| Problem Representation]]
|-
!|Generate candidate features
| 0%
| 11/24
| [[/Features | Features]]
|-
!|fit to model
| 0%
|
| [[/model | Model]]
|-
!|Win competition
| 0%
|
| [[/what to do with all the money | Prize Plan]]
|}

== Key Contest Info ==
The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.

You are given 7,237,983 contacts/edges from a social network (social_train.zip). The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.

There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.

The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.

The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.

You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.

Don’t despair if your first couple of solutions score low, this is an explorative process.

== Brainstorming on Process ==
* We shouldn't have a single approach to solving the problem. If people have ideas they should run with them and report back their success/failure to the group. The collaboration between our diverse ideas/approaches/experiences will be our strength in working together.
* Since this is throw away code for this competition only, we need not get hung up on efficiency or elegant implementations. That said, if we hit a point where our code is not able to perform fast enough then we can address it at that point, instead of overengineering from the get-go.
* Theo suggested that we start by using things like python/ruby scripts to massage the starting data set into something more useful (with more features), then analyse and visualize that using things like R.
* Jared was wondering if people think it's legit to use the mailing list for discussion or if we should create a discussion list for the competition to prevent from spamming the main list with competition collboration? (Update: Maybe we can use wiki instead?)
* Also, as we transform the dataset into different views, we are going to end up with some large files that we will be passing around to each other. Any suggestions on how to best do that? Jared has been using Dropbox (see dumps below).

== Brainstorming on Strategy ==
* The dataset forms a graph of directed edges between vertices. At the core of this problem will performing analysis on that graph. The first intuitive approach we had come to mind was that the shorter the distance between two vertices using existing edges, the more likely it would be that an edge could/should exist between those vertices.
* After the talk, Erin, Theo, and jared stumbled on the idea that some vertices might be uber-followers (meaning more outbound edges than the average vertex) and that some vertices might be uber-followees (meaning more inbound edges than average). This reminded us of PageRank for link graphs, so perhaps we can draw from techniques in that vein. The application of this in our problem, might be in weighting. For example, people who follow lots of people might be more likely to follow someone further out in their "network", where someone who doesn't follow many people might less likely to follow someone outside their "network".
* Since the edges are directional, we know that it's possible for people to "follow" someone with out that person "following back". At first glance it might make sense that the reverse edges would be likely in cases like this. However consider a "hub" user with lots of followers who doesn't reciprocate with edges back to his followers, then the information of who follows him is less important in determining who he would follow. Conversely, for a user who commonly reciprocates with followbacks, then the information on who follows her might be useful in suggesting who she follow.

== Useful Links ==
* http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
* http://en.wikipedia.org/wiki/Glossary_of_graph_theory
* On calculating in & outdegrees: [[http://books.google.com/books?id=CAm2DpIqRUIC&lpg=PA163&ots=HtNuxg3DOf&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PA163#v=onepage&q=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&f=false|Google Book on Social Network Analysis]]
* [[http://books.google.com/books?id=Ww3_bKcz6kgC&lpg=PA67&ots=aFSGYEjA_g&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PP7#v=onepage&q&f=false | Another Google Book on Social Network Analysis]]
* Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
* "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm

== Working Data Dumps ==
* Adjacency list based from the training data:
http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv
First column: outbound vertex
Remaining columns: list of vertices to which it points
Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
* Adjacency list of the reversed Graph:
http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
First column: inbound vertex
Remaining columns: list of vertices which point to it
Note: This is useful if interested in following the edges backwards quickly. This is useful to load as a hashtable keyed on inbound vertex returning the list.

== Possible Features ==
*nodeid
*nodetofollowid
*median path length
*shortest distance from nodeid to nodetofollowid
*inbound edges
*outbound edges
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Machine Learning/Kaggle Social Network Contest/Problem Representation

2010-11-19T07:41:12Z

Jjhale: /* Idea A */

== TODO ==
* come up with a plan of attack.

== Idea A ==
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

If we just consider the 38k outbound nodes we'd still be dealing with a 112 GB file.

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

== Idea B ==
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Machine Learning/Kaggle Social Network Contest/Problem Representation

2010-11-19T07:40:05Z

Jjhale:

== TODO ==
* come up with a plan of attack.

== Idea A ==
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

* The node_i's would come from the set of sampled users (ie the 38k outbound nodes).
* The node_j's would come from the union of outbound and inbound nodes (1,133,518 of them)

The length of this would be huge. The file would need about (37689 * 1133547) - 1133547 = 42 721 119 336 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 3,342 gigabytes

Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

== Idea B ==
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 42 billion steps - which sounds like a lot.

Machine Learning/Kaggle Social Network Contest/model

2010-11-19T07:17:15Z

Jjhale: Created page with '== Candidate models == * Logistic regression * ??'

== Candidate models ==
* Logistic regression
* ??

Machine Learning/Kaggle Social Network Contest/what to do with all the money

2010-11-19T07:15:04Z

Jjhale:

In the event of winning the $950 the consensus on 17 November was that it should be awarded to Noisebridge (maybe in the form of a new projector for the back classroom?).

This is a space for discussing alternatives (but it is largely academic :)

Machine Learning/Kaggle Social Network Contest/what to do with all the money

2010-11-19T07:14:43Z

Jjhale: Created page with 'In the event of winning the $950 the consensus on 17 November was that it should be awarded to Nosiebridge (maybe in the form of a new projector for the back classroom? This is …'

In the event of winning the $950 the consensus on 17 November was that it should be awarded to Nosiebridge (maybe in the form of a new projector for the back classroom?

This is a space for discussing alternatives (but it is largely academic :)

Machine Learning/Kaggle Social Network Contest/Features

2010-11-19T07:12:13Z

Jjhale: Created page with '== Possible Features == *nodeid *nodetofollowid *median path length *shortest distance from nodeid to nodetofollowid *inbound edges *outbound edges *clustering coefficient *recip…'

== Possible Features ==
*nodeid
*nodetofollowid
*median path length
*shortest distance from nodeid to nodetofollowid
*inbound edges
*outbound edges
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Machine Learning/Kaggle Social Network Contest/Problem Representation

2010-11-19T07:11:03Z

Jjhale: Created page with '== TODO == * someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx * come up with a plan of attack. == Idea A == Construct a huge…'

== TODO ==
* someone with large memory (>5.5GB) double check the number of unique nodes by loading it in networkx
* come up with a plan of attack.

== Idea A ==
Construct a huge csv file containing each possible directed link and a bunch of features associated with it, then do some supervised learning on it.

It would have the following format

node_i, node_j, feature_ij_1, feature_ij_2, ...

The length of this would be long. When loading 3M rows of the edge list file I get 732166 nodes which means that this file would need (732 166^2) - 732 166 = 536 066 319 390 rows.

Say each column took up took up 7 characters and there were 12 columns (ie 10 features) we'd have a row of size 84 bytes. This makes it about 4.5 x10^13 bytes = 41 937 gigabytes

This is just if we use the first 3 million rows.

(Note if I have miscounted the number of unique nodes and there really are only 38k we'd still be dealing with a 112 GB file.)

This number could be culled by considering just the nodes in some neighbourhood - but I figure that would only provide us with information about nodes which are connected.

== Idea B ==
We could perform some kind of online learning on the network where compute features based on a pair of nodes and then update of parameters. This would take 500 billion steps - which sounds like a lot (again just based on the first 3M rows from the edge file).

Machine Learning/Kaggle Social Network Contest/load data

2010-11-19T06:24:32Z

Jjhale: /* How to load the network into networkx */

== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.

The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges

NOTE: John found that it took up about 5.5GB of memory to load the entire network. We may need to process it in chunks - or maybe decompose it into smaller sub networks.

Method 1
<pre>
import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
</pre>

Method 2
<pre>
import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])
DG.add_edge(tmp1, tmp2)

print "Loaded in ", str(time.clock() - t0), "s"
</pre>

{| border="1"
|-
!|Rows
!| 1M
!| 2M
!| 3M
|-
!|Method 1
| 20s
| 53s
| 103s
|-
!|Method 2
| 15s
| 41s
| 86s
|}

Machine Learning/Kaggle Social Network Contest/load data

2010-11-19T06:19:28Z

Jjhale:

== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.

The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx or by manually adding edges

Method 1
<pre>
import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
</pre>

Method 2
<pre>
import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])
DG.add_edge(tmp1, tmp2)

print "Loaded in ", str(time.clock() - t0), "s"
</pre>

{| border="1"
|-
!|Rows
!| 1M
!| 2M
!| 3M
|-
!|Method 1
| 20s
| 53s
| 103s
|-
!|Method 2
| 15s
| 41s
| 86s
|}

Machine Learning/Kaggle Social Network Contest

2010-11-19T05:41:02Z

Jjhale: /* Status */

== Official Contest Links ==
* Overview: http://kaggle.com/socialNetwork
* Data Details: http://kaggle.com/socialNetwork?viewtype=data (login required)

== Official Data Downloads ==
* Official Training Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_train.csv
* Official Test Data File: http://dl.dropbox.com/u/14895843/social-network-kaggle/social_test.txt
* Official Sample Submission File: http://dl.dropbox.com/u/14895843/social-network-kaggle/sample_submission.csv

== Status ==
{| border="1" cellspacing="0" cellpadding="2"
|-
! |Tasks
!| Status
!| target date
!|subpage
|-
!|Load data
| started
| -
| [[/load data| Load data]]
|-
!|Choose problem representation
| started
| -
| [[/Problem Representation| Problem Representation]]
|-
!|Generate candidate features
| 0%
| 11/24
| [[/Features | Features]]
|-
!|fit to model
| 0%
| 11/24
| [[/model | Model]]
|-
!|Win competition
| 0%
| 11/24
| [[/what to do with all the money | Prize Plan]]
|}

== Key Contest Info ==
The data has been downloaded using the API of a social network. There are 7.2m contacts/edges of 38k users/nodes. These have been drawn randomly ensuring a certain level of closedness.

You are given 7,237,983 contacts/edges from a social network (social_train.zip). The first column is the outbound node and the second column is the inbound node. The ids have been encoded so that the users are anonymous. Ids reach from 1 to 1,133,547.

There are 37,689 outbound nodes and 1,133,518 inbound nodes. Most outbound nodes are also inbound nodes so that the total number of unique nodes is 1,133,547.

The way the contacts were sampled makes sure that the universe is roughly closed. Note that not every relationship is mutual.

The test dataset contains 8,960 edges from 8,960 unique outbound nodes (social_test.csv). Of those 4,480 are true and 4,480 are false edges. You are tasked to predict which are true (1) and which are false (0). You need to supply back a file with outbound node id,inbound node id,[0,1] in each row. This means you can assign a probability of being true to an edge. You are being scored on the AUC. A random model will have an AUC of 0.5, so you need to try to do better than that (ie have a higher AUC). Your entry should conform to the format in sample_submission.csv.

You are encouraged to explore techniques which explain the social network/graph. The best entrant should try to explain his approach/method to other users.

Don’t despair if your first couple of solutions score low, this is an explorative process.

== Brainstorming on Process ==
* We shouldn't have a single approach to solving the problem. If people have ideas they should run with them and report back their success/failure to the group. The collaboration between our diverse ideas/approaches/experiences will be our strength in working together.
* Since this is throw away code for this competition only, we need not get hung up on efficiency or elegant implementations. That said, if we hit a point where our code is not able to perform fast enough then we can address it at that point, instead of overengineering from the get-go.
* Theo suggested that we start by using things like python/ruby scripts to massage the starting data set into something more useful (with more features), then analyse and visualize that using things like R.
* Jared was wondering if people think it's legit to use the mailing list for discussion or if we should create a discussion list for the competition to prevent from spamming the main list with competition collboration? (Update: Maybe we can use wiki instead?)
* Also, as we transform the dataset into different views, we are going to end up with some large files that we will be passing around to each other. Any suggestions on how to best do that? Jared has been using Dropbox (see dumps below).

== Brainstorming on Strategy ==
* The dataset forms a graph of directed edges between vertices. At the core of this problem will performing analysis on that graph. The first intuitive approach we had come to mind was that the shorter the distance between two vertices using existing edges, the more likely it would be that an edge could/should exist between those vertices.
* After the talk, Erin, Theo, and jared stumbled on the idea that some vertices might be uber-followers (meaning more outbound edges than the average vertex) and that some vertices might be uber-followees (meaning more inbound edges than average). This reminded us of PageRank for link graphs, so perhaps we can draw from techniques in that vein. The application of this in our problem, might be in weighting. For example, people who follow lots of people might be more likely to follow someone further out in their "network", where someone who doesn't follow many people might less likely to follow someone outside their "network".
* Since the edges are directional, we know that it's possible for people to "follow" someone with out that person "following back". At first glance it might make sense that the reverse edges would be likely in cases like this. However consider a "hub" user with lots of followers who doesn't reciprocate with edges back to his followers, then the information of who follows him is less important in determining who he would follow. Conversely, for a user who commonly reciprocates with followbacks, then the information on who follows her might be useful in suggesting who she follow.

== Useful Links ==
* http://en.wikipedia.org/wiki/Graph_theory#Graph-theoretic_data_structures
* http://en.wikipedia.org/wiki/Glossary_of_graph_theory
* On calculating in & outdegrees: [[http://books.google.com/books?id=CAm2DpIqRUIC&lpg=PA163&ots=HtNuxg3DOf&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PA163#v=onepage&q=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&f=false|Google Book on Social Network Analysis]]
* [[http://books.google.com/books?id=Ww3_bKcz6kgC&lpg=PA67&ots=aFSGYEjA_g&dq=calculate%20degree%20in%20%22directed%20graph%22%20OR%20digraph&pg=PP7#v=onepage&q&f=false | Another Google Book on Social Network Analysis]]
* Matrix Digraph Algs: http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/graphIntro.htm
* "Strongly Connected Components": http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/GraphAlgor/strongComponent.htm

== Working Data Dumps ==
* Adjacency list based from the training data:
http://dl.dropbox.com/u/14895843/social-network-kaggle/adj_list.out.csv
First column: outbound vertex
Remaining columns: list of vertices to which it points
Note: Useful when loaded up as a hashtable keyed on outbound vertex returning the list.
* Adjacency list of the reversed Graph:
http://dl.dropbox.com/u/14895843/social-network-kaggle/reverse_adj_list.out.csv
First column: inbound vertex
Remaining columns: list of vertices which point to it
Note: This is useful if interested in following the edges backwards quickly. This is useful to load as a hashtable keyed on inbound vertex returning the list.

== Possible Features ==
*nodeid
*nodetofollowid
*median path length
*shortest distance from nodeid to nodetofollowid
*inbound edges
*outbound edges
*clustering coefficient
*reciprocation probability (num of edges returned / num of outbound edges)

The response variable is the probability that the nodeid to nodetofollowid edge will be created in the future

Machine Learning/Kaggle Social Network Contest/load data

2010-11-19T05:40:31Z

Jjhale: Created page with '== How to load the network into networkx == There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_i…'

== How to load the network into networkx ==
There is a network analysis package for Python called [http://networkx.lanl.gov/ networkx]. This package can be installed using easy_install.

The network can be loaded using the [http://networkx.lanl.gov/reference/generated/networkx.read_edgelist.html read_edgelist] function in networkx
eg:
<pre>
import networkx as nx
DG = nx.read_edgelist('social_train.csv', create_using=nx.DiGraph(), nodetype=int, delimiter=',')
</pre>

Loading 1M rows of the edge list took 21s on a MacPro with 3GB mem and 2.8Ghz Quad-Core processor. I can do this in 15s with the following.

An alternate method of loading it is the follow which seems to run quicker for me (Joe).

<pre>
import networkx as nx
import csv
import time

t0 = time.clock()
DG = nx.DiGraph()

netcsv = csv.reader(open('social_train.csv', 'rb'), delimiter=',')

for row in netcsv:
tmp1 = int(row[0])
tmp2 = int(row[1])
DG.add_edge(tmp1, tmp2)

print "Loaded in ", str(time.clock() - t0), "s"
</pre>

CS229

2010-11-08T21:29:55Z

Jjhale: /* Progress: Watching Lectures */

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can watch the lectures on [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 5 weeks.

The plan is to '''watch the lectures in your own time'''. We'll be discussing our solutions to problem sets every 5 weeks. Bring any questions about the course you have along to a meeting and there might be someone there who can help you out.

Please note:
* there is no instructor at Noisebridge - this is just a study group.
* We are taking the course at a slower rate than the actual course (which is currently in session at the farm).
* Not everyone is at the same point in the course - its ok if you want to start today, there are others who have recently started too.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* Problem set 1: [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Kai
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
| Q1-4
|
|
|
|-
| Glen
|
|
|
|
|-
| Kai
| 1a,2a
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-10-21T02:13:52Z

Jjhale: /* Progress: Watching Lectures */ updating progress

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can watch the lectures on [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 5 weeks.

The plan is to '''watch the lectures in your own time'''. We'll be discussing our solutions to problem sets every 5 weeks. Bring any questions about the course you have along to a meeting and there might be someone there who can help you out.

Please note:
* there is no instructor at Noisebridge - this is just a study group.
* We are taking the course at a slower rate than the actual course (which is currently in session at the farm).
* Not everyone is at the same point in the course - its ok if you want to start today, there are others who have recently started too.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* Problem set 1: [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Kai
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
| Q1-4
|
|
|
|-
| Glen
|
|
|
|
|-
| Kai
| 1a,2a
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-10-14T00:06:49Z

Jjhale: /* Progress: Watching Lectures */ Joe Lecture 6

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can watch the lectures on [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 5 weeks.

The plan is to '''watch the lectures in your own time'''. We'll be discussing our solutions to problem sets every 5 weeks. Bring any questions about the course you have along to a meeting and there might be someone there who can help you out.

Please note:
* there is no instructor at Noisebridge - this is just a study group.
* We are taking the course at a slower rate than the actual course (which is currently in session at the farm).
* Not everyone is at the same point in the course - its ok if you want to start today, there are others who have recently started too.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* Problem set 1: [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Kai
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
| Q1-4
|
|
|
|-
| Glen
|
|
|
|
|-
| Kai
| 1a,2a
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-10-05T03:11:07Z

Jjhale: /* Overview */ Clarified how we are running the study group

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can watch the lectures on [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 5 weeks.

The plan is to '''watch the lectures in your own time'''. We'll be discussing our solutions to problem sets every 5 weeks. Bring any questions about the course you have along to a meeting and there might be someone there who can help you out.

Please note:
* there is no instructor at Noisebridge - this is just a study group.
* We are taking the course at a slower rate than the actual course (which is currently in session at the farm).
* Not everyone is at the same point in the course - its ok if you want to start today, there are others who have recently started too.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* Problem set 1: [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
| Q1-4
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

User:Jjhale

2010-09-23T05:54:09Z

Jjhale: Added pic

[[Image:20100421 tiny.jpg|frame|Joe Hale (with shorter hair)]]
Hi,

I'm [http://jjhale.com Joe Hale]. I'm interested in [[Machine Learning]] at Noisebridge and am working my way through the Stanford Machine Learning course [[CS229]].

File:20100421 tiny.jpg

2010-09-23T05:47:37Z

Jjhale: Photo of JJHale for user page

Photo of JJHale for user page

CS229

2010-09-23T05:39:36Z

Jjhale: /* Problem Sets from 2009 */

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* Problem set 1: [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229 Problem Set 1 q2y dat

2010-09-23T05:25:01Z

Jjhale: Created page with '<pre> 1.1717629e+00 1.8823554e+00 3.4282705e-01 2.1056512e+00 1.6476588e+00 2.3623765e+00 2.1211766e+00 -7.9712021e-01 2.0310951e+00 1.9795313e+00 …'

CS229 Problem Set 1 q2x dat

2010-09-23T05:24:00Z

Jjhale: Created page with '<pre> 1.2421431e+00 2.3348046e+00 1.3264331e-01 2.3469988e+00 6.7389056e+00 3.7088873e+00 1.1853350e+01 -1.8707854e+00 4.5024590e+00 3.2798363e+00 …'

CS229 Problem Set 1 q1y dat

2010-09-23T02:55:33Z

Jjhale: Created page with '<pre> 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 0.0000000e+00 …'

CS229 Problem Set 1 q1x dat

2010-09-23T02:52:49Z

Jjhale: Created page with '<pre> 1.3432504e+00 -1.3311479e+00 1.8205529e+00 -6.3466810e-01 9.8632067e-01 -1.8885762e+00 1.9443734e+00 -1.6354520e+00 9.7673352e-01 -1.3533151e+00 1.94…'

CS229

2010-09-23T02:50:30Z

Jjhale: /* Problem Sets from 2009 */

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
* [[File:CS229 ps1.pdf]]
** [[CS229 Problem Set 1 q1x dat]]
** [[CS229 Problem Set 1 q1y dat]]
** [[CS229 Problem Set 1 q2x dat]]
** [[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-09-23T02:49:50Z

Jjhale: /* Problem Sets from 2009 */

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
[[File:CS229 ps1.pdf]]
[[CS229 Problem Set 1 q1x dat]]
[[CS229 Problem Set 1 q1y dat]]
[[CS229 Problem Set 1 q2x dat]]
[[CS229 Problem Set 1 q2y dat]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-09-23T02:42:34Z

Jjhale:

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====
[[File:CS229 ps1.pdf]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

File:CS229 ps1.pdf

2010-09-23T02:42:05Z

Jjhale: Problem set 1 from the CS229 Machine Learning course 2009

Problem set 1 from the CS229 Machine Learning course 2009

CS229

2010-09-23T02:40:26Z

Jjhale:

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==== Problem Sets from 2009 ====

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

CS229

2010-09-22T20:03:39Z

Jjhale:

== Overview ==
CS229 is the undergraduate machine learning course at Stanford. You can see the lectures from [http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048#ls=1 iTunesU] and [http://www.youtube.com/results?search_query=stanford%20cs%20229&search=Search&sa=X&oi=spell&resnum=0&spell=1 Youtube]. We are going to be working through the course at one lecture a week starting 1 September 2010 and finishing in January 2011. There are four problem sets which we'll be doing one every 4 weeks.

[http://www.stanford.edu/class/cs229/ http://www.stanford.edu/class/cs229/]

=== Course Description ===

This course provides a broad introduction to machine learning and
statistical pattern recognition. Topics include: supervised learning
(generative/discriminative learning, parametric/non-parametric
learning, neural networks, support vector machines); unsupervised
learning (clustering, dimensionality reduction, kernel methods);
learning theory (bias/variance tradeoffs; VC theory; large margins);
reinforcement learning and adaptive control. The course will also
discuss recent applications of machine learning, such as to robotic
control, data mining, autonomous navigation, bioinformatics, speech
recognition, and text and web data processing.

== Schedule ==
* one lecture a week
* one problem set every five weeks

[http://www.google.com/calendar/embed?src=cWE3bGFpNnZxazdpamNjbmc4bXJsY2hyNGdAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ Google Calendar of schedule]

=== Supplemental Materials ===

[[File: CS229_sample_data.xls]]

==Progress: Watching Lectures ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Lecture 1
| Lecture 2
| Lecture 3
| Lecture 4
| Lecture 5 9/29
| Lecture 6
| Lecture 7
| Lecture 8
| Lecture 9
| Lecture 10 11/3
| Lecture 11
| Lecture 12
| Lecture 13
| Lecture 14
| Lecture 15 12/8
| Lecture 16
| Lecture 17
| Lecture 18
| Lecture 19
| Lecture 20 1/12
|-
| Thomas
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
| [[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Joe
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Glen
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jared
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Dave
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| Jason
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|[[Image:Gold-star.jpg|center|30px|Gold-Star]]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|-
| You!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|}

==Progress: Assignments ==
{| border="1" cellspacing="0" cellpadding="5" align="center"
| Name
| Problem set 1 due 9/29
| Problem set 2 due 11/3
| Problem set 3 due 12/8
| Problem set 4 due 1/20
|-
| Thomas
|
|
|
|
|-
|-
| Joe
|
|
|
|
|-
| Glen
|
|
|
|
|-
| Jared
|
|
|
|
|-
| You!
|
|
|
|
|-
|}

User:Jjhale

2010-09-22T20:01:17Z

Jjhale:

Interested in [[Machine Learning]] at Noisebridge.

User:Jjhale

2010-09-22T20:00:15Z

Jjhale: Created page with 'Interested in Machine Learning at Noisebridge.'

Interested in Machine Learning at Noisebridge.