GoFundMe Predictive Analysis

Published in

Data Science Student Society @ UC San Diego

13 min readApr 1, 2021

Written By: Derek Leung, Emily Chen, Gauri Samith, and Thy Nguyen

Introduction

GoFundMe is an American for-profit crowdfunding platform. Any user can easily set up an account on the platform and begin a fundraising campaign. Most fundraisers on the website focus on raising money for education or tuition costs, volunteer programs, festivals, funerals, and quite commonly, medical expenses for severe injuries and illnesses. A campaign always begins with a preset fundraising goal that it is hoping to achieve. Unlike websites such as Kickstarter that work on an all or nothing basis, any amount donated to the cause, regardless of whether the goal is achieved, goes to the campaign owner.

However, it is quite evident that some fundraisers tend to do much better than others. The main aim of our project is to figure out what factors associated with a fundraiser could predict the amount of money it raises. With this in mind, we would like to be able to accurately predict the final outcome, in terms of amount raised, of any given fundraiser on the GoFundMe platform. Additionally, we also aim to be able to use text descriptions weighted by their success and amount raised in order to auto-generate text for an arbitrary fundraiser. Achieving this goal could make designing fundraising campaigns on such platforms much more efficient and effective as well as further our knowledge about campaigns or marketing in general.

Data

We obtain the data by web-scraping them from the GoFundMe website. We firstly visit the website to understand how the data is stored. As a result, we discover that the website contains the information of different fundraisers, each of which includes the campaign’s name/title, location, description, goal, and amount raised. Additionally, when clicking the campaign’s link, we are able to find the contact information of the organizers as well as the beneficiaries, and this helps us verify the accuracy of the campaign. Apart from the material donation, GoFundMe allows its users to leave comments or messages to further support the people involved. Furthermore, due to this type of design, GoFundMe can collect users’ data via each of their entries. For example, the website can track how much users donate over time using a “Donate Now” button, which campaigns they are interested in, or whether users share the campaign with other people outside GoFundMe. However, if users hope to remain anonymous, these activities will be recorded without naming any personal identities. Finally, the website publishes its “Fundraiser Stats” to summarize the main information of each campaign such as the total number of donors, shares, and followers.

After understanding the website’s structure, we begin web scraping. From the main page, we scrape data based on the fundraising categories. For example, we alternate the categories at the end of the https://www.gofundme.com/discover/ link to generate various campaign URLs per category. Due to computing time, we decided to limit the number of campaigns we get per category. We chose to click “Show More” up to five times per category. We then implemented a headless browser in the selenium package to get the data live at the time we scrape the sites, and lastly used the regex package to find specific data on each campaign page.

Consequently, the obtained data frame consists of 13 columns, and after data cleaning, we begin our analysis, in which 7 out of the 13 columns are used to conduct the Exploratory Data Analysis; 3 columns are used for future prediction of the amount raised by fundraisers, and 1 column is used to implement auto-text generation. Most columns are categorical variables, except for the columns that pertain to donation statistics. For examples, “Amount_Raised”, “Goal”, “Number_of_Donations”, “FB_Shares”, “Number_of_Donors”, and “Followers” are numerical continuous variables.

Background

The goal for our project was to investigate what makes a campaign successful. We decided to look at what specific factors heavily influence how much a campaign will or how successful it will be, relative to its goal. Additionally, we aimed to develop a model that could accurately predict how much money a campaign would raise based on these factors and the text from the campaign’s description. We found two studies that took a similar approach to determining predictors of campaign success based on factors such as category and text.

The study, “What Contributes to a Crowdfunding Campaign’s Success Evidences and Analyses from GoFundMe Data,” conducted by Xupin Zhang, Hanjia Lyu, Jiebo Luo from the University of Rochester focused on the performance of the crowdfunding campaigns on GoFundMe over a wide variety of funding categories, and identified any factors that were important for fundraising. The project aimed to use language topic modeling and computer vision methods to extract features from image and text descriptions. Using these features, they were able to see which features were important for specific categories. Later, they developed a fusion analytic framework that combined both textual and visual descriptions of the campaigns to predict the campaign’s success and outcomes. Within text descriptions, they found out that there were specific words that made a campaign successful based on the specific category. For instance, words like “bio” and “health” were positively correlated with the chance of success in the category of “Sports, Teams & Clubs” may be due to the fact that they are related to health issues or that the fundraiser is likely to fund medical treatment. However, they noted that they have not found sufficient evidence to conclude that there is a relationship between the description and the campaign’s success. For the images, they discovered that higher quality cover images were more successful and images, such as happy, family pictures or faces of elders in the cover image led the donors to respond more positively, due to sympathy. They noted that this is mainly noticeable for medical categories, as people are more willing to donate if it’s a life-or-death urgency over fundraisers of less urgent events like weddings or competitions. The relationships between the categories, text description, and images would be a meaningful area for our project to explore as well.

Similarly, the study “The Language that Gets People to Give: Phrases that Predict Success on Kickstarter” by Tanushree Mitra and Eric Gilbert from Georgia Institute of Technology looked into phrases that could get people to donate and predict success. The textual content from each campaign’s homepage was scraped and they identified any control variables. Three models, Null, Controls-Only, and Phrases + Controls, were developed and reported the percentage cross validation error and prediction error for ridge regression and lasso regression. Using their Phrases + Controls model, they found that phrases appealing to social identity, liking, scarcity, social proof, and authority were particularly effective. The authors acknowledged that there is definitely more predictive information outside the text and their list of control variables. Our project takes a similar approach to this study regarding our Natural Language Processing and Auto-Text Generation, and understanding why specific words and their implications stood out in certain categories versus others, could be a future consideration for us.

Investigation (Ridge Regression)

To answer the main concerns of our analysis, we implemented a ridge regression model. The tools for this investigation are primarily provided through the scikit-learn package. We start by standardizing the numerical data, in addition to typical cleaning, casting, and encoding categorical variables.

After that, we split the data into training and testing sets. For this analysis, we split 70–30. Next, we fit the model and took a look at it’s accuracy within the training data.

We found high accuracy with the training data, as we might expect. Following this, we went on to try out the model on the test data.

With an accuracy of about 76%, we believe that this was a decent model. However, our small sample size led to a high variance in the r squared value depending on the random seed. Therefore, we implement a bootstrapping method to find approximately what the r squared value should be.

From this, we find a median r squared value of .7849, an overall good indication that the model performs accurately.

Theory (Ridge Regression)

In our project, we wanted to look at what factors influenced the amount raised of a GoFundMe campaign. Based on these factors, we wanted to build a model that could accurately predict a campaign’s amount raised. One of the first steps in doing regression analysis is to determine whether there is an issue of multicollinearity, where independent variables have strong linear relationships with each other. This can become a problem because it can create inaccurate estimates of the regression coefficients and can negatively impact and ruin the predictability of the model.

In an ordinary least squares regression (OLS), the formula below is used to estimate coefficients:

(XTX)-1XTY

XTX represents a correlational matrix of all the predictors and Yis a vector containing dependent variables. Performing OLS is done under the assumption that (XTX)-1exists. However, multicollinearity can result in (XTX)-1being indeterminate, meaning that the determinant of (XTX)-1 is equal to zero. This can cause the parameter estimates to have extremely high variance.

Matrix Containing Independent Variables and Correlation Coefficients

Ridge regression modifies XTX such that the determinant is not equal to zero, which can eliminate multicollinearity and result in more accurate estimates. XTX is modified by being introduced to a ridge parameter, , or alpha value. The alpha value determines the amount of difference between the ridge parameters and OLS parameters. The parameter is incorporated into formula below:

(XTX + I)-1XTY

I represents the identity matrix and represents the ridge parameter, alpha. Due to multicollinearity, the columns of the correlational matrix are not independent, causing the matrix to be indeterminate. However, adding a Ridge parameter allows us to break up the dependent relationships between columns.

The parameter of Ridge, alpha value, is not learned by the model, so we standardized the dataset and the optimal alpha value for Ridge Regression was determined using RidgeCV. We determined the best alpha value to be zero.

Using the Ridge function with an alpha value of zero, we found the best fit model on the training and testing datasets. With this best fit model, we plotted the distribution of r squared scores for both the training and testing datasets in order to see the overall accuracy of the model when predicting the amount raised.

We decided to test the prediction accuracy of our model without number of donations, number of donors, and followers because they were all highly correlated with each other. Based on our results, we were able to see the respective means and medians of the r squared values for the training and testing datasets were much lower, meaning that these factors are strong predictors of the amount raised.

NLP

We wanted to look into whether the text descriptions of the fundraisers could be used to create good predictive features. We decided to use NLP to create these features from the text descriptions and use them to predict the category of a particular fundraiser.

Our first step was to preprocess the text using the ‘re’ (Regular Expressions) and ‘nltk’ (Natural Language Tool-Kit) packages. We extracted the keywords from the text descriptions, leaving behind any non-alphanumeric characters (punctuations), phone numbers, URLs, IP addresses, etc. A tokenizer was used to remove named entities from the description as well. However, instead of replacing these named entities with a space, we decided to use an indicator ‘NLP’ which would allow us to take into consideration how often named entities showed up in the descriptions.

Thus, we were left with a set of lowercase keywords for each of the fundraiser descriptions. The preprocessed data was then split 3:1 into the training and test datasets using the scikit-learn package. We used two main models to represent our new text data — the Bag of Words Model and a TF-IDF vector. Once again using scikit-learn, we transformed our data using these models and used a classifier to generate predicted values for our test set.

The last step was to evaluate which text representation led to more accurate predictions. The metrics used for this purpose were the accuracy score and the f1 score. A comparison of the predicted and actual test set values for each text representation model led us to the following scores:

Bag-of-words

Accuracy: 100

F1-score weighted: 0.4631861173188212

Tfidf

Accuracy: 112

F1-score weighted: 0.511904126167525

Thus, for our data, the TF-IDF vector representation led to more accurate category predictions. Additionally, we also looked into the most popular words across the fundraisers:

As expected, words commonly associated with fundraisers such as ‘help’, ‘support’, ‘community’, etc. came up in this list. Special emphasis must be placed on the fact that ‘nlp’ and ‘nlps’ made it into this list, which likely indicates the relevance of identifying people and organizations within a text description.

Text generation

Text-generation can be automatically done by using computers to generate natural language. Some applications of text generation that are worth noticing are the automatic generation of reports, news, and documentation. We perform two levels of text generation, namely word and character generations. In general, the training model will learn a large enough amount of input data to predict the next word or character. To implement this, we use Tensorflow, Keras, and Long Short-Term Memory networks (LSTM). In brief, Tensorflow is a welcoming and open Machine Learning platform for everyone. Keras is a Deep Learning API provided by Python.

Before working on text generation, we perform data cleaning by removing all special characters, namely [ / ( ) { } \ [ \ ] \ | @ , ; ] , and excluding stop words with the aid of nltk package. We also extract each text (either a word or a character) from sentences coming from the ‘text’ column of the data frame.

For a word-level generation, we use a set of every fifty-word to predict the next word in a sentence. Splitting our data into smaller sets of 51 words, we save the first 50 words of each set to the training set (X) and put the last word into test data (y). To enhance training efficiency, the maximum number of words in total will not exceed 200,000. Implementing this idea, we create a list called lines containing all training sequences. We will build a LSTM model using Tensorflow. Additionally, Tokenizer from tensorflow.keras.preprocessing.text is applied to label each word with an integer. Meanwhile, to_categorical transforms those integer labels to a binary class matrix. More detail about the imported libraries will be provided shortly.

Since Tokenizer plays a role in vectorizing the text corpus, we add oov_token = “<OOV>” inside a Tokenizer object to enable our model to learn new related words outside the training data. Additionally, fit_on_texts() updates the internal vocabulary, whereas texts_to_sequences() turns texts to integer sequences.

Now, let’s convert the sequences list into a numpy array to enhance training performance

Here, one-hot encoding (“one” and “zero” conversions) is used to turn integers of the ‘y’ array into ‘1.0’ and ‘0.0’ only

Finally, we are ready to train the model. Our LSTM model is a Sequential() model that embeds an input of length 50 words going through 100 hidden layers twice and encounters 100 units in the Dense layer. We add an activation function to aid neural networks in learning how to fit a curve instead of a linear line. The activation function is relu (rectified linear units) which has commonly been used as max(0, x).

Another activation function called “softmax” is applied to form a vector of probabilities from a vector of input integer numbers.

For some better understanding of the model, we could ask for the model’s summary via model.summary(). Finally the model is fitted with a batch size of 256 and 100 numbers of epochs

. This will conclude our text-generation on a word level.

For a character-level generation, the first steps are similar to those of text-generation on the word level: We do some data preprocessing, tokenization, and vectorization as mentioned earlier. The maximum number of characters in this case would be 40, meaning that we have our model learn the first 40 characters to predict the 41st character.

Now, we are prepared to build another LSTM model by using Sequential() provided by Tensorflow. The model comprises 128 hidden layers followed by a 40-unit Dense layer with an activation function ‘softmax’. We adjust the learning rate to be 0.01 and use ‘categorical_crossentropy’ as our loss function.

We fit the model with a batch size of 128 and 60 epochs; then we use callbacks to print the output after each run.

After all, text generation on the word level improves at the end of 100 epochs. More meaningful words are presented, based on the prior patterns. However, character-level text generation decreases their meaning at the end. Especially, when we increase the diversity of how strictly the character should match the training data, we obtain almost no real words. Therefore, more training processes should be done to enhance our model performance.