r4dm

Calculate the percentage of missing values in each column and sort them in descending order.
1. Missing values and outliers are not problems to be fixed! They are facts.
2. During EDA you must not “fix” them because you have to deal with your data and problem as it is.
3. If you see missing values, just report them.
Identify and understand your target variable.
1. Understand the type of the target variable: binary, categorical, or numeric.
2. Examine the distribution of the target variable.
  1. For a binary variable (which needs to be converted into 0s and 1s if it is in string format), the mean (a proportion of 1s) is simply used.
  2. For a categorical variable, value counts are used.
For a numeric variable, a histogram or a pandas' describe table is used.

	import torch
	from datasets import load_dataset
	from peft import LoraConfig, get_peft_model, prepare_model_for_int8_training
	from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
	from trl import SFTTrainer


	def train():
	train_dataset = load_dataset("tatsu-lab/alpaca", split="train")
	tokenizer = AutoTokenizer.from_pretrained("Salesforce/xgen-7b-8k-base", trust_remote_code=True)