
This curated collection of interactive Jupyter Notebooks teaches students how to evaluate and prepare network traffic data for machine learning. Through four hands-on modules, learners tackle real-world challenges like extreme sparsity, anomalies and class imbalance. Students master exploratory data analysis, data cleaning, dimensionality reduction and target correlation to isolate predictive features.
The Cookbook Collection¶
Our curriculum is structured into progressive series, guiding students from evaluating raw data to advanced threat detection.
A. Network Traffic Dataset Checksuite (Current Series): Focuses on data quality, exploratory data analysis and feature selection before modeling.
B. Anomaly Detection in Network Traffic Datasets (Upcoming Series): Applies machine learning techniques to identify deviations and unusual behaviour in cleaned network data.
C. Attack Detection (Upcoming Series): Focuses on identifying and classifying specific, targeted network attacks.
A. Network Traffic Dataset Checksuite¶
Alternative Title: Evaluating and Optimising Network Traffic Datasets
This Checksuite bridges the gap between raw data collection and analytical modelling. It teaches students how to perform rigorous data preprocessing and feature selection on network intrusion detection datasets.
Through hands-on modules covering data cleaning, sparsity mapping and advanced correlation analysis, students learn to address real-world data quality issues to build a robust, model-ready dataset.
Before analysing network traffic data (such as the TII-SSRC-23 dataset), it is essential to ensure data integrity.
Feeding redundant, noisy or irrelevant features into any algorithm leads to overfitting, slow processing times and
poor generalisation. Therefore, this Checksuite performs a rigorous evaluation of dataset features and their
statistical correlations with target classes (specifically Label and Traffic Type).
Motivation¶
In real-world scenarios, especially in network intrusion detection, data is rarely clean, normally distributed or evenly balanced. Failing to address issues such as extreme scale differences, class imbalance or hidden feature redundancies can lead to severe analytical biases.
Rather than merely focusing on how to write the code, this course emphasises methodological rigor. It teaches students to think critically about data quality, structural sparsity and collinearity before they even start to build a predictive model.
Learning Objectives¶
By the end of this cookbook, you will be able to:
Conduct Exploratory Data Analysis (EDA): Assess dataset dimensionality, class balance and distinguish between categorical and numerical features.
Ensure Data Quality: Identify and mitigate anomalies, duplicates, missing values and physically impossible records.
Handle Data Sparsity & Distributions: Recognize non-normal distributions and structural sparsity, utilising appropriate non-parametric methods.
Perform Feature Selection: Use Spearman’s rank, Mutual Information, and Phi-k () correlation to eliminate redundant features and isolate the most predictive signals.
Structure¶
This suite is structured into four progressive analytical modules that utilise the TII-SSRC-23 dataset as a real-world case study.
Data Loading and Initial Data Analysis: Focuses on assessing dataset dimensionality and class balance, separating categorical and numerical features, and conducting rigorous sanity checks for anomalies. It also visualises feature distributions using objective mathematical binning.
Data Cleaning & Preprocessing: Focuses on actively resolving identified data quality issues. Students learn to safely drop duplicates and unplausible values, apply zero-variance filters and map complex structural sparsity (overlapping zero patterns) to prepare the data for dimensionality reduction.
Feature-to-Feature Collinearity Analysis: Focuses on dimensionality reduction. Students evaluate collinearity using Spearman’s rank correlation and programmatically dismantle a large correlation matrix by analysing features in logical groups to remove redundancy.
Target Correlation & Feature Selection: Focuses on evaluating predictive power. Students investigate which features carry a true signal versus noise by using Mutual Information for the binary label and Phi-k correlation for the multi-class traffic type.
Authors¶
Network Security Cookbook Development Team, Reserach Unit of Networks, Institute of Telecommunications
Faculty of Electrical Engineering and Information Technology, TU Wien
Running the Notebooks¶
Running on TU Cookbooks Binder¶
Go to TU Cookbooks Binder: https://
Running on Your Own Machine¶
If you are interested in running this material locally on your computer, you will need to follow this workflow:
Clone the
https://gitlab.tuwien.ac.at/cookbooks/public/network_security_cookbookrepository:git clone https://gitlab.tuwien.ac.at/cookbooks/network_security_cookbookMove into the
network_security_cookbookdirectorycd network_security_cookbookCreate a virtual environment with all the required libraries and dependencies For that a couple of Options are available:
uv
conda
Docker
uv¶
Create and activate your virtual environment
uv sync --all-extrasStart up Jupyterlab Server in the notebooks directory
uv run jupyter lab notebooks/Conda¶
Create and activate your conda environment from the environment.yml file
conda env create -f environment.yml
conda activate tucookbooksMove into the notebooks directory and start up Jupyterlab
jupyter lab notebooks/Docker¶
When available simply run docker compose to start a Jupyter Lab instance.
Start Jupyter Lab from a Docker container:
docker compose upCopy Jupyter Lab URL to web browser:
Stop and remove the container
Stop Docker container with
CTRL + CRemove container:
docker compose down