Wednesday, January 22, 2025
HomeTechHow to split the Dataset With scikit-learn's train_test_split()

How to split the Dataset With scikit-learn’s train_test_split()

To split a dataset into training and testing sets using train_test_split() from scikit-learn, you can follow these steps. This function is very useful when you need to split data for model evaluation.

Here’s a basic guide:

1. Import Necessary Libraries

You need to import train_test_split from sklearn.model_selection.

from sklearn.model_selection import train_test_split

2. Prepare Your Data

Let’s assume you have a dataset (features X and target labels y). Typically, X is a 2D array of feature data, and y is the 1D array of labels.

Example:

import numpy as np

# Features (X) - Example 2D array
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])

# Labels (y) - Example 1D array
y = np.array([0, 1, 0, 1, 0])

3. Split the Dataset

You can now split the data into training and test sets. By default, train_test_split() splits the data randomly and typically gives you a training set of 75% and a testing set of 25%. You can adjust the split ratio with the test_size parameter.

# Splitting the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Parameters of train_test_split():

  • X: Features data.
  • y: Target labels.
  • test_size: Proportion of the dataset to include in the test split (it can be a float between 0.0 and 1.0, or an integer representing the absolute number of test samples).
  • train_size: Proportion of the dataset to include in the training split (optional, if not provided, the rest of the data will be used for training).
  • random_state: Controls the shuffling of data. Setting it to a fixed value (e.g., 42) ensures reproducibility.
See also  What is macOS?

4. Use the Split Data

Now you can use X_train, X_test, y_train, and y_test for model training and evaluation.

# Example usage
print("Training features:\n", X_train)
print("Test features:\n", X_test)
print("Training labels:\n", y_train)
print("Test labels:\n", y_test)

Example Output:

Training features:
 [[ 7  8]
 [ 1  2]
 [ 3  4]
 [ 5  6]]
Test features:
 [[ 9 10]]
Training labels:
 [1 0 1 0]
Test labels:
 [0]

Additional Options:

  • shuffle=True: This shuffles the data before splitting (default behavior).
  • stratify=y: This ensures that the split maintains the same distribution of labels in both the training and testing sets, which is especially useful for imbalanced datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

This is how you can split your dataset into training and testing sets using train_test_split() in scikit-learn.

RELATED ARTICLES
0 0 votes
Article Rating

Leave a Reply

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
- Advertisment -

Most Popular

Recent Comments

0
Would love your thoughts, please comment.x
()
x