Made with ML

Stratify

For our multi-class task (where each project has exactly one tag), we want to ensure that the data splits have similar class distributions. We can achieve this by specifying how to stratify the split by using the stratify keyword argument with sklearn’s train_test_split() function.

train_df, val_df = train_test_split(df, stratify=df.tag, test_size=test_size, random_state=1234)

Iterate on data

Instead of using a fixed dataset and iterating on the models, we could keep the model constant and iterate on the dataset. This is useful to improve the quality of our datasets.

  • remove or fix data samples (false positives & negatives)
  • prepare and transform features
  • expand or consolidate classes
  • incorporate auxiliary datasets
  • identify unique slices to boost

Embeddings

The main idea of embeddings is to have fixed length representations for the tokens in a text regardless of the number of tokens in the vocabulary. With one-hot encoding, each token is represented by an array of size vocab_size, but with embeddings, each token now has the shape embed_dim. The values in the representation will are not fixed binary values but rather, changing floating points allowing for fine-grained learned representations.