To split a dataset into training and testing sets using train_test_split()
from scikit-learn
, you can follow these steps. This function is very useful when you need to split data for model evaluation.
Here’s a basic guide:
1. Import Necessary Libraries
You need to import train_test_split
from sklearn.model_selection
.
from sklearn.model_selection import train_test_split
2. Prepare Your Data
Let’s assume you have a dataset (features X
and target labels y
). Typically, X
is a 2D array of feature data, and y
is the 1D array of labels.
Example:
import numpy as np
# Features (X) - Example 2D array
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
# Labels (y) - Example 1D array
y = np.array([0, 1, 0, 1, 0])
3. Split the Dataset
You can now split the data into training and test sets. By default, train_test_split()
splits the data randomly and typically gives you a training set of 75% and a testing set of 25%. You can adjust the split ratio with the test_size
parameter.
# Splitting the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
Parameters of train_test_split()
:
X
: Features data.y
: Target labels.test_size
: Proportion of the dataset to include in the test split (it can be a float between 0.0 and 1.0, or an integer representing the absolute number of test samples).train_size
: Proportion of the dataset to include in the training split (optional, if not provided, the rest of the data will be used for training).random_state
: Controls the shuffling of data. Setting it to a fixed value (e.g.,42
) ensures reproducibility.
4. Use the Split Data
Now you can use X_train
, X_test
, y_train
, and y_test
for model training and evaluation.
# Example usage
print("Training features:\n", X_train)
print("Test features:\n", X_test)
print("Training labels:\n", y_train)
print("Test labels:\n", y_test)
Example Output:
Training features:
[[ 7 8]
[ 1 2]
[ 3 4]
[ 5 6]]
Test features:
[[ 9 10]]
Training labels:
[1 0 1 0]
Test labels:
[0]
Additional Options:
shuffle=True
: This shuffles the data before splitting (default behavior).stratify=y
: This ensures that the split maintains the same distribution of labels in both the training and testing sets, which is especially useful for imbalanced datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
This is how you can split your dataset into training and testing sets using train_test_split()
in scikit-learn.