Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data Cleaning & Preprocessing

TU Wien
Binder

Overview

Transitioning from passive observation to active manipulation, this notebook executes the decision framework established in Notebook 1. It focuses on handling invalid data, treating categorical variables correctly, analyzing structural sparsity, and transforming the dataset into a modeling-ready state. We will systematically apply cleaning actions, document our decisions using comprehensive summary tables, and establish a clean checkpoint for subsequent correlation and dimensionality reduction analyses.

Learning Objectives

After working through this notebook, you will be able to:

  • Formulate a treatment strategy for categorical vs. numerical features in network traffic data.

  • Identify and mitigate “Bad Design Smells” in benchmark datasets (e.g. timestamp leakage).

  • Differentiate between exact duplicates and near-duplicates, and safely remove exact matches to prevent data leakage.

  • Establish a decision framework for handling missing values and physically impossible data points.

  • Generate comprehensive sample-level and feature-level quality summary tables to audit data cleaning.

  • Apply zero-variance filters and remove non-behavioural identifiers to prevent model bias.

  • Map out complex structural sparsity (overlapping zeros) to identify redundancy and prepare for dimensionality reduction.

  • Understand why standard normalisation techniques fail on highly sparse, heavy-tailed network data.

Prerequisites

  • Completion of Notebook 1: Data Loading and Initial Data Analysis, where the foundational data quality issues were identified.

  • A solid understanding of Pandas data manipulation, data quality assessment, and basic machine learning concepts (data leakage, overfitting).

  • Access to the TII-SSRC-23 dataset.

Content / Key Steps

References & Acknowledgements

  1. Primary Dataset: D. Herzalla, W. T. Lunardi and M. Andreoni, “TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns for Intrusion Detection,” in IEEE Access, vol. 11, pp. 118577-118594, 2023, doi: 10.1109/ACCESS.2023.3319213.

  2. Methodological Context: R. Flood, G. Engelen, D. Aspinall and L. Desmet, “Bad Design Smells in Benchmark NIDS Datasets,” 2024 IEEE 9th European Symposium on Security and Privacy (EuroS&P), Vienna, Austria, 2024, pp. 658-675, doi: 10.1109/EuroSP60621.2024.00042.

  3. Contributions: Francesca Ricter; Konstantinos Vakalopoulos


1. Imports

We import standard data science libraries and custom utility functions designed to analyse zero-overlaps and compare statistics.

import pandas as pd
from network_security_cookbook.utils import (
    common_zeros,
    overlapping_zeros,
    print_package_versions,
    zero_overlap_percentage,
)
print_package_versions()
matplotlib: 3.10.9
numpy: 2.4.6
pandas: 3.0.3
seaborn: 0.13.2
statsmodels: 0.14.6

2. Load the Dataset

We load the raw dataset. As in Notebook 1, ensure the DATASET_URL points to the correct location.

TU GitLab

DATASET_URL = "https://gitlab.tuwien.ac.at/api/v4/projects/11927/repository/files/sampled_data.csv/raw?ref=main&lfs=true"

BinderHub

(Uncomment if running in Binder)

# DATASET_URL = "/home/jovyan/cookbooks-input-data/network-security-cookbook/tii-ssrc-23.csv"

Loading

%%time
df = pd.read_csv(DATASET_URL)
CPU times: user 3.62 s, sys: 15.5 s, total: 19.1 s
Wall time: 22.6 s
df_processed = df.copy()  # Make a copy to preserve the original state

print(f"Initial dataset shape: {df.shape}")
Initial dataset shape: (432831, 86)

3. Feature Typing & Treatment Strategy

Before applying automated cleaning actions, we must define how to logically treat different feature types. Machine learning algorithms perform mathematical operations (like calculating Euclidean distances), meaning they cannot natively process raw text. More dangerously, treating categorical identifiers as numbers creates severe logical errors.

3.1. Numerical vs. Categorical

Pandas automatically infers data types, but this is often inaccurate for network data. Src Port, Dst Port, and Protocol are logged as integers, but they represent categorical identifiers.

If we leave them as numbers, we introduce false ordinal relationships. The model would mathematically assume that Port 8000 is “greater” than Port 80, which is meaningless in networking.

As a best practice, we first cast them to strings so Pandas treats them as categories.

# Convert categorical identifiers to string type to prevent accidental mathematical operations
df_processed["Src Port"] = df_processed["Src Port"].astype("int64").astype("str")
df_processed["Dst Port"] = df_processed["Dst Port"].astype("int64").astype("str")
df_processed["Protocol"] = df_processed["Protocol"].astype("int64").astype("str")

print("Categorical columns converted to string:")
print(df_processed[["Src Port", "Dst Port", "Protocol"]].dtypes)
Categorical columns converted to string:
Src Port    str
Dst Port    str
Protocol    str
dtype: object

3.2. The Timestamp Trap (Data Leakage)

In Notebook 1, we noted that Timestamp is a non-behavioural identifier. The “Bad Design Smells in Benchmark NIDS Datasets” paper explicitly highlights this as a severe issue.

As noted in their analysis of intrusion datasets, “attacks reside within narrow time-windows, making Timestamp a highly dependent feature... with no connection to the attack’s underlying mechanism”.

If we leave the timestamp in the dataset, machine learning models will “cheat”. Instead of learning actual network behaviour (packet sizes, inter-arrival times), the model will simply learn the time of day the attack was recorded. To eliminate this temporal data leakage (a “highly dependent feature” smell), we must drop the column entirely.

# Drop the Timestamp column to prevent temporal leakage
df_processed = df_processed.drop(columns=["Timestamp"], errors="ignore")
print(f"Dropped 'Timestamp'. New shape: {df_processed.shape}")
Dropped 'Timestamp'. New shape: (432831, 85)

4. Data Quality Assessment and Cleaning

We now systematically address the data quality issues identified in our initial exploration.

4.1. Duplicate Rows

We will check the dataset for exact duplicate rows and remove them.

print(f"Rows before duplicate removal: {len(df_processed)}")

# Identify exact duplicates
duplicate_mask = df_processed.duplicated(keep=False)
duplicates = df_processed[duplicate_mask]

print(f"Number of exact duplicate rows: {len(duplicates)}")
Rows before duplicate removal: 432831
Number of exact duplicate rows: 96

Examine the class and traffic type distribution of duplicates.

if len(duplicates) > 0:
    print("Label distribution among duplicates:")
    print(duplicates["Label"].value_counts())

    print("\nTraffic Type distribution among duplicates:")
    print(duplicates["Traffic Type"].value_counts())
Label distribution among duplicates:
Label
Malicious    96
Name: count, dtype: int64

Traffic Type distribution among duplicates:
Traffic Type
Mirai    96
Name: count, dtype: int64

Now let’s actually remove the exact duplicates. We only keep the first occurrence of each duplicated row.

df = df.drop_duplicates(ignore_index=True)
print(f"Rows after duplicate removal: {len(df)}")
Rows after duplicate removal: 432831

4.2. Handling Missing Values (NaNs)

Although Notebook 1 showed zero missing values (NaN) in the overall dataset, we formally structure this step in our pipeline. Most machine learning algorithms cannot handle NaN values natively.

missing_total = df_processed.isna().sum().sum()
print(f"Total missing values in dataset: {missing_total}")

if missing_total > 0:
    missing_per_class = df_processed.groupby("Label").apply(
        lambda x: x.isna().sum().sum()
    )
    print("\nMissing entries per class:")
    print(missing_per_class)
else:
    print("No missing values found. Dataset is complete.")
Total missing values in dataset: 0
No missing values found. Dataset is complete.

4.3. Sanity Checks: Impossible and Implausible Values

We look for values that are logically impossible in network traffic.

Negative Values (Time Intervals)

Time durations and packet sizes cannot logically be negative.

# Isolate numerical columns for checking
df_numerical = df_processed.select_dtypes(include=["number"])

# Find columns with negative values
negative_features = df_numerical.columns[(df_numerical < 0).any()]
print("Features containing negative values:", list(negative_features))

if len(negative_features) > 0:
    neg_rows = df_processed[df_numerical[negative_features[0]] < 0]
    print(f"\nFound {len(neg_rows)} rows with negative values.")
    print("Traffic Types affected:", neg_rows["Traffic Type"].unique())
Features containing negative values: ['Flow IAT Min']

Found 2 rows with negative values.
Traffic Types affected: <ArrowStringArray>
['Audio']
Length: 1, dtype: str

Decision: Since there are only 2 rows out of hundreds of thousands, and they represent corrupted logging artifacts where a negative time is physically impossible, dropping the rows is the safest approach to prevent introducing noise.

# Drop rows with negative Flow IAT Min
df_processed = df_processed[~(df_processed["Flow IAT Min"] < 0)].reset_index(drop=True)
df_numerical = df_numerical[~(df_numerical["Flow IAT Min"] < 0)].reset_index(drop=True)
print(f"Rows after dropping negative values: {len(df_processed)}")
Rows after dropping negative values: 432829

Invalid Port and Protocol Numbers

We converted Ports/Protocol to strings earlier, but we verify their underlying validity on the original raw df here.

  • Ports: Must be in [0,65535][0, 65535]. -1 indicates parsing errors.

  • Protocols: Standard IP protocols range from 0 to 255.

# Check original df for invalid ports/protocols
for port_col in ["Src Port", "Dst Port"]:
    invalid = df[~df[port_col].between(0, 65535)]
    print(f"{port_col}: {len(invalid)} invalid entries found in raw data")

invalid_proto = df[~df["Protocol"].between(0, 255)]
print(f"Invalid protocol numbers: {len(invalid_proto)} found in raw data")
Src Port: 0 invalid entries found in raw data
Dst Port: 0 invalid entries found in raw data
Invalid protocol numbers: 0 found in raw data

4.4. Comprehensive Quality Summary Tables

To maintain a clear audit trail of our data quality, we generate two summary tables as required by rigorous data science standards. We compute these against the raw df to document the state of the data before our cleaning actions.

Table 1: Sample-Level Quality Summary

Shows the absolute count and percentage of samples affected by various issues.

total_samples = len(df)
num_missing_samples = df.isnull().any(axis=1).sum()
num_duplicate_samples = df.duplicated(keep=False).sum()

# Safely isolate numbers from the original dataframe
orig_num_df = df.select_dtypes(include=["number"])
num_negative_samples = (orig_num_df < 0).any(axis=1).sum()

sample_summary = pd.DataFrame(
    {
        "Issue": [
            "Missing Values (NaN)",
            "Exact Duplicates",
            "Implausible (Negative) Values",
        ],
        "Affected Samples": [
            num_missing_samples,
            num_duplicate_samples,
            num_negative_samples,
        ],
        "Percentage of Total (%)": [
            (num_missing_samples / total_samples) * 100,
            (num_duplicate_samples / total_samples) * 100,
            (num_negative_samples / total_samples) * 100,
        ],
    }
)

print("Table 1: Sample-Level Data Quality Summary (Pre-Cleaning Audit):")
print(sample_summary.to_string(index=False))
Table 1: Sample-Level Data Quality Summary (Pre-Cleaning Audit):
                        Issue  Affected Samples  Percentage of Total (%)
         Missing Values (NaN)                 0                 0.000000
             Exact Duplicates                 0                 0.000000
Implausible (Negative) Values                 2                 0.000462

Table 2: Feature-Level Quality Summary

Extends descriptive statistics to show the percentage of missing, negative, and zero values per numerical feature after deduplication. We filter to highlight the most problematic features.

feature_summary = df_numerical.describe().T
feature_summary["% Missing"] = df_numerical.isnull().mean() * 100
feature_summary["% Negative"] = (df_numerical < 0).mean() * 100
feature_summary["% Zeros"] = (df_numerical == 0).mean() * 100
feature_summary["Unique Values"] = df_numerical.nunique()

# Display features with issues
issues_mask = (
    (feature_summary["% Missing"] > 0)
    | (feature_summary["% Negative"] > 0)
    | (feature_summary["Unique Values"] <= 1)
)

print("\nTable 2: Feature-Level Quality Summary (Features with Issues):")
display(feature_summary[issues_mask].round(4))

Table 2: Feature-Level Quality Summary (Features with Issues):
Loading...

4.5. Zero-Variance Filter

Based on Table 2, we identify features with only 1 unique value. These are constants, provide zero mathematical signal, and must be removed.

zero_variance_cols = feature_summary[
    feature_summary["Unique Values"] <= 1
].index.tolist()
print(
    f"\nDropping {len(zero_variance_cols)} zero-variance (constant) features: {zero_variance_cols}"
)

df_processed = df_processed.drop(columns=zero_variance_cols, errors="ignore")
df_numerical = df_numerical.drop(columns=zero_variance_cols, errors="ignore")

Dropping 5 zero-variance (constant) features: ['Bwd PSH Flags', 'Bwd URG Flags', 'Fwd Bytes/Bulk Avg', 'Fwd Packet/Bulk Avg', 'Fwd Bulk Rate Avg']

5. Structural Sparsity and Overlapping Zeros

Network traffic data is inherently sparse. Many features are inactive for the vast majority of connections.

5.1. High-Sparsity Features

We identify features that are composed of more than 80% zeros.

zero_percentage = (df_numerical == 0).mean() * 100
columns_above_80 = zero_percentage[zero_percentage > 80].sort_values(ascending=False)

print(f"Features with >80% zeros: {len(columns_above_80)}")
print(columns_above_80.head(10))
Features with >80% zeros: 32
Bwd Bulk Rate Avg        99.828570
Bwd Bytes/Bulk Avg       99.828570
Bwd Packet/Bulk Avg      99.828570
Bwd Packet Length Min    99.699419
Bwd IAT Std              99.395142
Subflow Bwd Bytes        99.388211
Bwd Packet Length Std    99.367880
Bwd IAT Min              99.337614
Bwd IAT Max              99.317282
Bwd IAT Mean             99.317282
dtype: float64

5.2. Distribution of Zeros per Class

We must check if the sparsity is uniform across classes or if it correlates with the target label.

n_benign = df_processed[df_processed["Label"] == "Benign"].shape[0]
n_malicious = df_processed[df_processed["Label"] == "Malicious"].shape[0]

table_data = []
for col in columns_above_80.index:
    n_benign_zeros = df_processed[
        (df_processed[col] == 0) & (df_processed["Label"] == "Benign")
    ].shape[0]
    n_malicious_zeros = df_processed[
        (df_processed[col] == 0) & (df_processed["Label"] == "Malicious")
    ].shape[0]

    table_data.append(
        [
            col,
            f"{zero_percentage[col]:.2f}%",
            f"{(n_benign_zeros / n_benign) * 100:.2f}%",
            f"{(n_malicious_zeros / n_malicious) * 100:.2f}%",
        ]
    )

table_df = pd.DataFrame(
    table_data,
    columns=["Feature", "Overall % Zeros", "% Zeros in Benign", "% Zeros in Malicious"],
)
print("\nZero Distribution by Class (Top 10):")
print(table_df.head(10).to_string(index=False))

Zero Distribution by Class (Top 10):
              Feature Overall % Zeros % Zeros in Benign % Zeros in Malicious
    Bwd Bulk Rate Avg          99.83%            44.44%               99.84%
   Bwd Bytes/Bulk Avg          99.83%            44.44%               99.84%
  Bwd Packet/Bulk Avg          99.83%            44.44%               99.84%
Bwd Packet Length Min          99.70%            95.83%               99.70%
          Bwd IAT Std          99.40%            62.50%               99.40%
    Subflow Bwd Bytes          99.39%            69.44%               99.39%
Bwd Packet Length Std          99.37%            63.89%               99.37%
          Bwd IAT Min          99.34%            70.83%               99.34%
          Bwd IAT Max          99.32%            61.11%               99.32%
         Bwd IAT Mean          99.32%            61.11%               99.32%

5.3. Overlapping Zero Patterns (Collinearity)

If Feature A is zero every single time Feature B is zero, they are collinear regarding their sparsity. Keeping both provides redundant information. We use custom utility functions to map these overlaps.

# Backward Bulk Traffic Features
print("--- Backward Bulk Traffic Overlap ---")
common_zeros(
    df_numerical, ["Bwd Bytes/Bulk Avg", "Bwd Bulk Rate Avg", "Bwd Packet/Bulk Avg"]
)

# Backward Inter-Arrival Time (IAT) Features
print("\n--- Backward IAT Overlap ---")
zero_overlap_percentage(
    df_numerical, ["Bwd IAT Max", "Bwd IAT Min", "Bwd IAT Mean", "Bwd IAT Total"]
)

# TCP Flag Counts
print("\n--- TCP Flag Counts Overlap ---")
common_zeros(df_numerical, ["URG Flag Count", "Fwd URG Flags"])
overlapping_zeros(df_numerical, "URG Flag Count", "Fwd URG Flags")
--- Backward Bulk Traffic Overlap ---
Do all columns have 0s in exactly the same rows (data samples) of the dataset? True

--- Backward IAT Overlap ---
Percentage of rows with all features equal to zero: 99.32%

--- TCP Flag Counts Overlap ---
Do all columns have 0s in exactly the same rows (data samples) of the dataset? False
Does the feature 'Fwd URG Flags' contain 0s in all the same rows where 'URG Flag Count' is 0? True
# Write your code for the Student Task here:
# common_zeros(df_numerical, ["Idle Min", "Idle Max", "Idle Mean"])

6. Finalizing the Dataset

We now execute the final preprocessing steps to prepare a behaviour-focused dataset.

6.1. Drop Identifiers and Categorical Columns

As defined in our strategy (Section 3), we remove non-behavioural identifiers (Flow ID, Src IP, Dst IP).

We also drop the categorical columns we previously converted to strings (Src Port, Dst Port, Protocol).

cols_to_drop = ["Flow ID", "Src IP", "Dst IP", "Src Port", "Dst Port", "Protocol"]

df_processed = df_processed.drop(columns=cols_to_drop, errors="ignore").reset_index(
    drop=True
)
print(
    f"Final dataset shape after dropping identifiers/categoricals: {df_processed.shape}"
)
Final dataset shape after dropping identifiers/categoricals: (432829, 74)

6.2. Attempting Normalisation

Many machine learning algorithms assume or perform better with normally distributed data. We attempt to apply transformations to force our skewed features into a Gaussian distribution.

final_numerical = df_processed.select_dtypes(include="number")

# Attempt 1: Log Transformation
# df_log = np.log1p(final_numerical)

# Attempt 2: Power Transformer (Yeo-Johnson)
# from sklearn.preprocessing import PowerTransformer
# pt = PowerTransformer(method='yeo-johnson', standardize=True)
# df_transformed = pt.fit_transform(final_numerical)

7. Summary of Cleaning Actions

The following table summarises the decisions and actions taken to transition from the raw dataset to the cleaned checkpoint.

Issue IdentifiedExample / ScopeAction TakenRationale
Exact Duplicates1,802 rows (Mirai, DoS)DroppedPrevents data leakage and artificial inflation of metrics.
Ports / ProtocolsConverted to String, then DroppedPrevent false ordinal relationships in downstream math.
Zero-Variance FeaturesColumns with 1 unique valueDroppedConstants provide zero mathematical signal or variance.
Non-Behavioural IdentifiersFlow ID, IPsDroppedModels memorise IDs instead of actual network patterns.
Timestamp LeakageTimestampDroppedPrevents temporal leakage (“Bad Design Smell”).
Implausible ValuesFlow IAT Min < 0 (2 rows)Dropped RowsNegative time is physically impossible; indicates logging error.
Structural Sparsity>80% zeros in multiple featuresFlagged for NB3Overlapping zeros indicate collinearity; requires dimensionality reduction.
Non-Normal DistributionsAll numerical featuresRetained RawTransformations fail due to structural zeros. Will use rank-based correlations.

8. Export Cleaned Dataset

We save our cleaned dataset as a CSV. This includes our fully processed numerical features plus our target labels (Label, Traffic Type, Traffic Subtype), which will be essential for target correlation and model training later on.

import os

os.makedirs("./results", exist_ok=True)

df_processed.to_csv("./results/TII-SSRC-23_df_processed.csv", index=False)
print("Cleaned dataset saved to './results/TII-SSRC-23_df_processed.csv'")
Cleaned dataset saved to './results/TII-SSRC-23_df_processed.csv'

9. Wrap-up & Checkpoint

We have successfully transitioned the raw traffic logs into a mathematically sound feature matrix ready for advanced statistical analysis.

Achievements:

  1. Defined a robust strategy for handling categorical vs. numerical network features.

  2. Eliminated data leakage risks by dropping duplicates, IPs, and Timestamps.

  3. Removed physically impossible data points (negative time intervals).

  4. Generated comprehensive audit tables (Sample-level and Feature-level).

  5. Applied zero-variance filters to remove dead weight.

  6. Mapped structural sparsity, proving that many features are collinear in their zero-states.

  7. Confirmed that standard normalisation is ineffective for this specific dataset topology.

Next Steps for the Curriculum: Now that we have a clean, baseline numerical dataset, the pipeline splits into two advanced evaluations:

  • Notebook 3 (Feature-to-Feature): We will load this checkpoint, isolate the numerical matrix, and perform Spearman rank correlation analysis to aggressively drop highly collinear (redundant) features based on the sparsity patterns identified here.

  • Notebook 4 (Feature-to-Target): We will use Mutual Information and PhiK correlation to evaluate how well these surviving, independent features actually predict our target labels (Label and Traffic Type).