How to split the Dataset With scikit-learn’s train_test_split()

January 22, 2025

1

To split a dataset into training and testing sets using train_test_split() from scikit-learn, you can follow these steps. This function is very useful when you need to split data for model evaluation.

Here’s a basic guide:

1. Import Necessary Libraries

You need to import train_test_split from sklearn.model_selection.

from sklearn.model_selection import train_test_split

2. Prepare Your Data

Let’s assume you have a dataset (features X and target labels y). Typically, X is a 2D array of feature data, and y is the 1D array of labels.

Example:

import numpy as np

# Features (X) - Example 2D array
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])

# Labels (y) - Example 1D array
y = np.array([0, 1, 0, 1, 0])

3. Split the Dataset

You can now split the data into training and test sets. By default, train_test_split() splits the data randomly and typically gives you a training set of 75% and a testing set of 25%. You can adjust the split ratio with the test_size parameter.

# Splitting the data into training and testing sets (75% train, 25% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

Parameters of `train_test_split()`:

X: Features data.
y: Target labels.
test_size: Proportion of the dataset to include in the test split (it can be a float between 0.0 and 1.0, or an integer representing the absolute number of test samples).
train_size: Proportion of the dataset to include in the training split (optional, if not provided, the rest of the data will be used for training).
random_state: Controls the shuffling of data. Setting it to a fixed value (e.g., 42) ensures reproducibility.

4. Use the Split Data

Now you can use X_train, X_test, y_train, and y_test for model training and evaluation.

# Example usage
print("Training features:\n", X_train)
print("Test features:\n", X_test)
print("Training labels:\n", y_train)
print("Test labels:\n", y_test)

Example Output:

Training features:
 [[ 7  8]
 [ 1  2]
 [ 3  4]
 [ 5  6]]
Test features:
 [[ 9 10]]
Training labels:
 [1 0 1 0]
Test labels:
 [0]

Additional Options:

shuffle=True: This shuffles the data before splitting (default behavior).
stratify=y: This ensures that the split maintains the same distribution of labels in both the training and testing sets, which is especially useful for imbalanced datasets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

This is how you can split your dataset into training and testing sets using train_test_split() in scikit-learn.

How to split the Dataset With scikit-learn’s train_test_split()

1. Import Necessary Libraries

2. Prepare Your Data

3. Split the Dataset

Parameters of `train_test_split()`:

4. Use the Split Data

Example Output:

Additional Options:

HTML Div Tag

When Was the Internal Combustion Engine Invented?

Introduction to Amazon Web Services

Leave a ReplyCancel reply

Most Popular

What are the top 10 colleges of DU (Delhi University)

What Colour Are Warning Signs Typically?

What is the Unit Used to Measure Air Pressure?

Was the last Titanic Survivor in the Titanic Movie?

Recent Comments

How would Life be if Telephone was not Invented?

How to Print without newline in Python?

What’s the Average Typing Speed? A Look at Typists and Enthusiasts

How to split the Dataset With scikit-learn’s train_test_split()

1. Import Necessary Libraries

2. Prepare Your Data

3. Split the Dataset

Parameters of train_test_split():

4. Use the Split Data

Example Output:

Additional Options:

Related posts:

Leave a ReplyCancel reply

Most Popular

Recent Comments

Parameters of `train_test_split()`: