7 Data Set-up: Tidying and Feature Selection

Before you can begin working with your data you must make sure that each row is a single observation, and each column is a single variable/predictor. This type of data set-up or “wrangling” is known as “tidy data”. There are many readily available tutorials and textbooks that help you understand tidy data and how to clean and wrangle data to make it tidy. I recommend the Tidyverse chapter in the R data science textbook to start. Thankfully, the dataset from ManyDogs is already tidy so for this tutorial we can skip this step.

After your data is tidy, the next step before is to complete feature selection. Feature selection is a fancy term for removing variables you aren’t going to analyze and creating new ones by computing any necessary variables. Many of the changes you complete in this step are decided on through domain knowledge: applying what you know about the field of research to make judgement calls on what is and isn’t important to the model/data. The rest of our decisions depend on our specific research hypotheses. Anything in the data set that does not specifically pertain to our research hypotheses need to be eliminated to increase statistical power and to compute the model faster to save computing resources. Commonly deleted variables at this stage might include meta-data such as time of day when a survey was completed, or individual scale items when we have calculated the total scores. When domain knowledge doesn’t suffice because you are working in a relatively new field, or past studies have conflicting information, you will want to let machine learning algorithms help choose what to keep and eliminate. See the regularization section for more information.

In our ManyDogs example, our feature selection will include transforming all the levels of scales from words to values, add 0s to denote no household dogs other than the one in the study, computing average scores for each section of the CBARQ, and computing our binary outcome variables. We will then cut any remaining columns that will not be needed for analysis.

Reminder: The following code should be copied and pasted to get you to the needed dataset that we will be using in this tutorial.

#The following code creates a new dataframe where all of the scales are changed from words to numbers so they can be used as discrete categorical data in the models we will create in this tutorial
manydogs_data_transformed <- manydogs_data |>
  mutate(across(c(cbarq_train_1:cbarq_train_8), ~case_when(   #Change all scales from words to numbers
    . == "Never" ~0,
    . == "Seldom" ~1,
    . == "Sometimes" ~2,
    . == "Usually" ~3,
    . == "Always" ~4
  ))) |> 
  mutate(across(c(cbarq_aggression_1:cbarq_aggression_27), ~case_when(
    . == "No aggression" ~ 0, 
    . == "Mild aggression" ~ 1,
    . == "Moderate aggression" ~2,
    . == "High aggression" ~3, 
    . == "Serious aggression" ~4
  ))) |> 
  mutate(across(c(cbarq_fear_1:cbarq_fear_18), ~case_when(
    . == "No fear" ~ 0, 
    . == "Mild fear" ~ 1,
    . == "Moderate fear" ~2,
    . == "High fear" ~3, 
    . == "Extreme fear" ~4
  ))) |> 
  mutate(across(c(cbarq_separation_1:cbarq_separation_8), ~case_when(
    . == "Never" ~0,
    . == "Seldom" ~1,
    . == "Sometimes" ~2,
    . == "Usually" ~3,
    . == "Always" ~4
  ))) |> 
  mutate(across(c(cbarq_excitability_1:cbarq_excitability_6), ~case_when(
    . == "No excitability" ~ 0, 
    . == "Mild excitability" ~ 1,
    . == "Moderate excitability" ~2,
    . == "High excitability" ~3, 
    . == "Extreme excitability" ~4
  ))) |> 
  mutate(across(c(cbarq_attachment_1:cbarq_attachment_6), ~case_when(
    . == "Never" ~0,
    . == "Seldom" ~1,
    . == "Sometimes" ~2,
    . == "Usually" ~3,
    . == "Always" ~4
  ))) |> 
  mutate(across(c(cbarq_miscellaneous_1:cbarq_miscellaneous_27), ~case_when(
    . == "Never" ~0,
    . == "Seldom" ~1,
    . == "Sometimes" ~2,
    . == "Usually" ~3,
    . == "Always" ~4
  ))) |> 
  mutate(sex = case_when(sex == "Female" ~1, sex == "Male" ~2),
         desexed = case_when(desexed == "Yes" ~1, desexed == "No" ~2),
         purebred = case_when(purebred == "Yes"~1, purebred == "No" ~2),
         gaze_follow = case_when(gaze_follow == "Never" ~ 1, 
                                 gaze_follow == "Seldom" ~2,
                                 gaze_follow == "Sometimes"~3,
                                 gaze_follow == "Usually" ~4,
                                 gaze_follow == "Always" ~5),
         num_household_dogs = ifelse(is.na(num_household_dogs), 0, num_household_dogs)) |>  #add 0s to indicate when no other dogs were in the household
  mutate(across(c(cbarq_train_1:cbarq_train_8,cbarq_aggression_1:cbarq_miscellaneous_27), as.numeric))  #change character columns to numeric columns


#Create a data frame that calculates composite scores for each subset of the scale and computes our three binary outcome variables
manydogs_feature_selection <- manydogs_data_transformed |> 
  mutate(training_score = rowMeans(select(manydogs_data_transformed,   #compute average scores of training
                             starts_with("cbarq_train_")), na.rm = TRUE),
         aggression_score= rowMeans(select(manydogs_data_transformed,   #compute average scores of aggression
                             starts_with("cbarq_aggression_")), na.rm = TRUE),
         fear_score= rowMeans(select(manydogs_data_transformed,    #compute average scores of fear
                         starts_with("cbarq_fear_")), na.rm = TRUE),
         separation_score= rowMeans(select(manydogs_data_transformed,    ##compute average scores of separation issues
                              starts_with("cbarq_separation_")), na.rm = TRUE),
         excitability_score = rowMeans(select(manydogs_data_transformed,    #compute average scores of excitability
                                 starts_with("cbarq_excitability_")), na.rm = TRUE),
         attachment_score= rowMeans(select(manydogs_data_transformed,     #compute average scores of attachment
                              starts_with("cbarq_attachment_")), na.rm = TRUE),
         miscellaneous_score= rowMeans(select(manydogs_data_transformed,    #compute average scores of miscellaneous behavior issues
                                 starts_with("cbarq_miscellaneous_")), na.rm = TRUE),
         ostensive = rowMeans(select(manydogs_data_transformed, 
                                     starts_with("ostensive_")), na.rm = TRUE), #create proportion correct on ostensive task
         ostensive_binary = ifelse(ostensive <= 0.5, 0, 1), #create column that notes if a dog performed over or under chance at the ostensive task
         nonostensive = rowMeans(select(manydogs_data_transformed,
                                        starts_with("nonostensive_")), na.rm = TRUE), #create proportion correct on nonostensive task
         nonostensive_binary = ifelse(nonostensive <= 0.5, 0, 1),   #create column that notes if a dog performed over or under chance at the nonostensive task
         nonos_best = ifelse(nonostensive > ostensive, 1, 0)) |>   #create column that notes if the dog was better at the nonostensive task
  select(c(sex:purebred,gaze_follow,training_score:nonos_best))#grab the columns we will be using for the analysis in this tutorial