KDD Cup Research Paper

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
 52
 
  Our group ranked 108 in KDD Cup 2014, and here is the approach we had
Related documents
Share
Transcript
  • 1. CS4642 - Data Mining & Information Retrieval KDDCup - 2014 Submission Paper 100112V - E.A.S.D.Edirisinghe 100132G - W.V.D.Fernando 100440A - R.H.T.D.Ranasinghe 100444N - M.C.S.Ranathunghe
  • 2. 1. Introduction KDD Cup is the annual Data Mining and Knowledge Discovery competition organized by ACM Special Interest Group on Knowledge Discovery and Data Mining, the leading professional organization of data miners.[1]Following are some examples for the past few years, KDD Cup 2010 - Student performance evaluation KDD Cup 2009 - Customer relationship prediction The organizers of KDD Cup 2014 has used the kaggle platform which is considered as one of the world's largest community of data scientists who compete with each other to solve complex data science problems. For the KDD Cup 2014 the task is to predict exciting projects in Donorschoose.org, a United States–based nonprofit organization that allows individuals to donate directly to public school classroom projects. Public school teachers can post their project requests on Donorschoose.org site so that any interested party can donate any amount of money to the project(min 1$). When a project reaches its funding goal, Donorschoose.org admins ship the materials to the school. If a partially funded project expires(which is up to only 4 months from the date posted if the teacher hasn’t set any earlier deadline), donors get their donations returned as account credits . The 2014 KDD Cup asks participants to help DonorsChoose.org identify projects that are exceptionally exciting to people, at the time of posting. While all projects on the site fulfill some kind of need, certain projects have a quality above and beyond what is typical. By identifying and recommending such projects early, they will improve funding outcomes, improve the user experience, and help more students in receiving the materials they need to learn. 2. Description of available data ● donations.csv - This file contains information about the donations to each project. This is only provided for projects in the training set. Has information about the donor and about the payment also ● essays.csv - This file contains project text posted by the teachers. Has a title, need statement, short description and essays for each project. This is provided for both the training and test set. ● projects.csv - This file contains information about each project - school information, teacher information, focussing areas/subjects of the projects etc. This is provided for both the training and test set. ● resources.csv - This file contains information about the resources requested for each project - how many items requested, unit price of an item, vendor name, resource type etc. This is provided for both the training and test set. ● outcomes.csv - This file contains information about the outcomes of projects in the training set. The field is_exciting contains here for each project in training set. ● sampleSubmission.csv - This file contains the project ids of the test set and shows the submission format for the competition.
  • 3. 2. Data Analysis and Pre-processing First we collected all the data and looked at the basic statistics of each of the features in the dataset. Through this analysis we found that the data demanded some pre-processing before it could be mined. One of the first issues we identified was the incompleteness of the data. During our analysis we found that some features in the data set had incomplete data. Most significant of which was the following features 1. fulfilment_labor_materials - Data not available for 35082 projects 2. students_reached - Data not available for 146 projects 3. secondary_focus_subject - Data not available for 207893 projects 4. secondary_focus_area - Data not available for 207893 projects So we used several methods to estimate the missing data. For the secondary_focus_subject and secondary_focus_area features, the missing data was filled with the values from primary_focus_subject and primary_focus_area features of the same project. But for the other two features we used multiple imputation[2] to generate the missing values. We used the ‘mice’[3] package available with R for the imputation of the data set. After the imputation we were able to use the completed data set for analysing. 2.1 Analysis of data Upon closer inspection of the data we found out some interesting patterns. 2.1.1. Relationship of exciting projects with reference to the project’s accepted date. As we can see from the above figure, which shows the number of ‘exciting’ projects for each year, there are no exciting projects before the year 2010. According to the discussions in the forums, the consensus is that this is due to the DonorsChoose.org organisation not keeping track of several factors
  • 4. that determine the “excitingness” of the project before 2010. So we decided to only consider the projects submitted after 2010 as our training data We can also see that projects submitted at the end of summer and the beginning of fall have a higher chance of becoming an exciting project . We can conclude that this is due to people donating more towards the start of a new school session that starts in the fall. Overall trend for the number of exciting projects in each year can be seen in the graph below.
  • 5. 2.1.2. Exciting projects by the state where the school is situated We can see that projects proposed in some states have a higher chance of becoming an exciting projects in comparison to projects from others. Specifically we can see that a project proposed in California has the highest probability of becoming an exciting project. So there should be a significant relationship between the state where the school is situated and a project becoming exciting. 2.1.3. Exciting projects based on the required resource type. Some resource requirements can be seen to generate more exciting projects than others. Projects with technological and supply requirements have a higher probability of being an exciting project.
  • 6. 2.1.4. Exciting projects based on the poverty level of the school. Projects that contribute to schools which have a higher poverty level have a higher chance of becoming an exciting project. As we can see from the chart above, more than half of the exciting projects come from the schools that are in the highest poverty level. 2.1.5. Relationship between exciting projects and the target grade level of the project Projects that cater to younger students generate more exciting projects. As we can see from the chart above, projects aimed at PreK-2 students and Grades 3-9 have a higher percentage of exciting projects. 2.1.6 Relationship between exciting projects and the essay word count We needed a simple way of of using the essay data, which amounted to the largest data set among the available data, so we decided to check for a relationship between the essay length and a project becoming exciting. Our analysis, as shown in the graph below, showed us that there appears to be a relationship among them.
  • 7. Based on the above mentioned relationships and further analysis of data, we decided on a set of features to be used in our training models mentioned in Appendix I. Some categorical data features that contained a large number of categories could not be used with the libraries we used for data mining. So those features were encoded using One-Hot encoding, which encoded the categorical feature as a set of dummy numerical features (one per each category) Another step we took before applying the dataset to models was to breakup the available data set into a training set and a testing set. This was done for cross validation purposes. We used the ‘split’ functionality available with R to split the data so that both the training set and the testing set have a similar distribution of exciting projects. 3. Main Methods used 3.1 Random Forests Random Forests[4] is an ensemble learning method widely used in data mining tasks. It uses a set of decision trees which are generated at run time and lets them vote on the classification of new data. In the Random Forest method we generate a set of classification trees using the following steps, 1. For each we sample a set of cases from the original training set with replacement and this sample is used in generating the tree 2. For each tree a random set of features are also selected from the available features and the training is only done using these features. 3. Each classification tree is generated to the full extent and no pruning is used These generated trees are then used to classify new items. For each new item, each tree in the random forest votes on the classification of the new item. Then the majority decision is used for the classification. If we are trying to calculate the probability of belonging to a class, then the mean probability output from each tree is considered. Our decision to use Random Forest models was taken considering the following advantages of the method among others.
  • 8. 1. High accuracy - Random forest are known to be highly accurate a 2. Ease of use - Random forest are very easy to generate even with very large datasets. In our project we used the ‘randomForest’[5] package available with R to generate a Random Forest model. To overcome the data imbalance issue due to the exciting projects being only about 5% of the data set we used stratified sampling so that when the samples are taken from the training set the exciting projects make up around 25% of the sample. We were able to generate some good results with this model. We achieved 0.61986 score on the public leaderboard with this model alone. But our result could have been improved if we used a proper training method to train the model parameters. We tried the training procedure using the caret package in R but failed due performance issues. 3.2 Gradient Boosting Regression Trees Gradient Tree Boosting[6] or Gradient Boosted Regression Trees (GBRT) is a generalization of boosting to arbitrary differentiable loss functions. It uses decision trees as the weak model and a loss function to calculate the residual. GBRT is an accurate and effective procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology. The advantages of GBRT are: ● Natural handling of data of mixed type (heterogeneous features) ● Predictive power ● Robustness to outliers in output space (via robust loss functions) The disadvantages of GBRT are: ● Scalability, due to the sequential nature of boosting it can hardly be parallelized. We have used Gradient boosting regression trees algorithm from the Scikit-learn[7] python library in-order to build our model. In this method there are few hyper-parameters which should be specified carefully in-order to yield the optimum result. Below is the list of those hyper-parameters ● number of regression trees (n_estimators) Gradient boosting is fairly robust to overfitting, hence a large number usually results in better performance. ● depth of each individual tree (max_depth) The maximum depth limits the number of nodes in the tree. ● loss function (loss) loss function to be optimized. There are four functions which could be used off the shelf and those are ‘ls’, ‘lad’, ‘huber’ and ‘quantile’. ‘ls’ refers to least squares regression. ‘lad’ (least absolute deviation) is a highly robust loss function solely based on order information of the input variables. ‘huber’ is a combination of ‘ls’ and ‘lad. ‘quantile’ allows quantile regression. ● learning rate (learning_rate)
  • 9. learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators. 3.2.1 Hyper-parameter tuning One of the pitfalls of machine learning in general is the over-fitting of the model to the training data and in general GBRT is highly vulnerable of being over-fit to the training data if the hyper- parameters are not tuned properly. In-order to tune the hyper-parameters we followed below steps 1. Choosed ‘lad’ (least absolute deviation) as the loss function since it is highly robust and the running time of the search was limited. 2. Picked n_estimators as large as (computationally) possible (1000 estimators). 3. Tuned max_depth, learning_rate, min_samples_leaf, and max_features via grid search[8]. 4. Increased n_estimators even more and tuned learning_rate again holding the other parameters fixed. The result of hyper-parameter tuning was max_features=1.0, learning_rate=0.1, max_depth=6, min_samples_leaf=9 We were able to achieve a score of 0.61875 on the public score board only using this model alone. 3.3 Logistic Regression Logistic Regression is used to predict a binary response from a binary predictor, used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features). For an example in this contest we can use logistic regression to predict whether an project is exciting or not based on the features that we identified or discovered in data preprocessing section. The logistic regression uses the below function with features x1 to xk with each feature having it’s own weight The output of the above function always lies between 0 and 1 regardless of the weights or features.
  • 10. -x) So using this function we can predict a value for is_exciting which will be between 0 and 1. 3.3.1 Implementation There are 2 packages we considered in R when we were implementing logistic regression. The package glm and glmnet. Both packages use a model called generalized linear model which generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value. Considering the glm and glmnet packages we found out that, glm package fails to handle large numbers of columns due to low RAM space. But glmnet handles big data efficiently and the algorithm itself can handle Factor values in R without modifying it. So we chose glmnet package to create our model. The glmnet package can deal with all shapes of data, including very large sparse data matrices. It fits linear, logistic and multinomial, poisson, and Cox regression models. [9] We considered the following parameters when using the glmnet package. x - input matrix, of dimension nobs x nvars; each row is an observation vector. Can be in sparse matrix format. Used the important columns we identified in data preprocessing and analyzing section. y - response variable. In this case whether the project was exciting or not
  • 11. alpha - The elasticnet mixing parameter, with 0 ≤ α ≤ 1. alpha=1 is the lasso penalty, and alpha=0 the ridge penalty. lambda - A user supplied lambda sequence. Typical usage is to have the program compute its own lambda sequence based on nlambda and lambda.min.ratio. Supplying a value of lambda overrides this. In addition the glmnet package also requires the packages “matrix” in order to create a sparse matrix from important columns we identified. 3.3.2 Hyper Parameter tuning Using the test data we created from splitting the training set in data preprocessing section we tuned the values alpha and lambda to give the best results. We observed when the alpha is very close to ridge penalty the model tends to give good results to the test set. So we used alpha = 0.001. Also we used lambda = 0.2 which gave good results to the test set we prepared. 3.3.3 Results The model gave fairly good results in the competition. It scored 0.61 in public leaderboard and 0.60 in private leaderboard., which is a fairly good result. 3.4 Ensembling Our final submission was an ensemble of the three methods described above. We used a simple weighted ensemble method where we assigned a weight to each method and calculated the result by combining the final results of the three methods using the weight. 4. Other methods tested 4.1 Neural Networks Artificial neural networks provide a general, practical method for learning real-valued, discrete-valued, and vector-valued functions from examples. Algorithms such as Backpropagation use gradient descent to tune network parameters to best fit a training set of input-output pairs. Neural Network learning is robust to errors in the training data and has been successfully applied to problems such as interpreting visual scenes, speech recognition, and learning robot control strategies.
  • 12. We have used the PyBrain[10] python library to build a neural network which used backpropagation algorithm to train the network. While training the neural network, we have faced a number of problems such as 1. Number of hidden layers to be used Number of Hidden Layers Result none Only capable of representing linear separable functions or decisions. 1 Can approximate any function that contains a continuous mapping from one finite space to another. 2 Can represent an arbitrary decision boundary to arbitrary accuracy with rational activation functions and can approximate any smooth mapping to any accuracy. Table XX summarize the knowledge we have acquired by going through various research papers. But unfortunately we were unable to find a specific method to determine the number hidden layers and hence we’ve tested a various number of hidden layers ranging from 2 to 50. We were unable to increase the number of hidden layers further due to the huge amount of time taken by the network training phase. 2. Numbers of neurons for each hidden layer We were unable to find any specific formula to calculate the number of neurons in a particular hidden layer. Although we found many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following: ● The number of hidden neurons should be between the size of the input layer and the size of the output layer. ● The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. ● The number of hidden neurons should be less than twice the size of the input layer. We have applied above rules and further we tried to decide the number of neurons in a hidden layer based on fibonacci number series. 3. Neural Network training time Training the neural network took a lot of time and hence we didn’t had time to research about pruning algorithms[11] and to tune the neural network using genetic algorithms[12] However, all the prediction results obtained through the neural network model performed poorly when compared to other models during the cross validation.
  • 13. 4.2 Classification and Regression Trees (CART) and Conditional Inference Trees The idea of these models is to build a tree by splitting on variables.To predict the outcome for an observation, follow the splits and at the end, predict the most frequent outcome. We used the package “rpart” and “party” to implement this CART trees in R. The problem with this method was it didn’t provide good results because of the data set being too imbalanced in favour of “non-excitingness”. Rpart and CTREE methods that we tested, (rpart from R package “rpart” and CTREE from package Party) builds a tree by splitting on variables. The motivation for this approach would be the interpretability of how the decision was reached, which is not possible in logistic regression methods. To predict the outcome follow the splits in the tree and at the end (leaves) select the most frequent outcome. 4.2.1 CART CART tries to the split the data to subset so that each subset is as homogenous as possible. Based on how the splits are made tree is generated. The number of splits generated can be controlled by setting a lower bound for split size by “minbucket” parameter. If the splits are too small overfitting will occur and if they are
  • Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks