- Calculate the percentage of missing values in each column and sort them in descending order.
- Missing values and outliers are not problems to be fixed! They are facts.
- During EDA you must not “fix” them because you have to deal with your data and problem as it is.
- If you see missing values, just report them.
- Identify and understand your target variable.
- Understand the type of the target variable: binary, categorical, or numeric.
- Examine the distribution of the target variable.
- For a binary variable (which needs to be converted into 0s and 1s if it is in string format), the mean (a proportion of 1s) is simply used.
- For a categorical variable, value counts are used.
- For a numeric variable, a histogram or a pandas' describe table is used.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import torch | |
from datasets import load_dataset | |
from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training | |
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments | |
from trl import SFTTrainer | |
def train(): | |
train_dataset = load_dataset("tatsu-lab/alpaca", split="train") | |
tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", trust_remote_code=True) |