Network Anomaly Detection

Link to Code

Aim: Given some netflow network records, detect anomalous behavior (ex: port scanning, DoS, etc.)

Source:

  • analyzer_clean.py: batches flows every 10s, and sends the batch for outlier detection. Checks outlier ip_addresses (src and dst combined) to issue alerts.
  • pandas_analysis.py: extract relevant continuous feature set and implement IQR outlier detection for a batch of flows. Returns the src and dst addresses that were outliers based on number of connections and destination ports in the batch.

Additional Code:

  • analyzer.py: Original (unclean) code for outlier detection in multiple steps. Partially implemented and untested.
    1. Check basics: packet lengths, local IPs, connection state, protocols, etc.
    2. Check src ip and dst ip against a blacklist set in memory
    3. Check dst_port and index first use for IP address + aggregate bytes
    4. Check IP address and number of ports, protocols, bytes
    5. Aggregate flows for (src_ip, dst_ip) pair every T sec and detect outlies statistically
    6. TODO: cluster input in batches using ‘rbf’
    7. TODO: train RNN based LSTM with good data, predict output
    8. TODO: aggregate output of all detectors for each flow and produce trustworthy probability
  • blacklist_update.py: code to update a static blacklist of malicious IP addresses

Data Files:

  • data.csv: test file netflow data
  • blacklist_ips.csv: offline csv of bad IP addresses generated by blacklist_update.py
  • df_flow_features.pkl: pandas object with features extracted from the whole dataset grouped by flow tuple for ML
  • df_src_dst_sampled_10s.pkl: pandas object with complete dataset sampled to 10s intervals for testing ML

Jupyter-Notebooks

netflow-data-scratch-file.ipynb

  • explore netflow data
  • group by given features and explore distribution and statistics
  • loaded full chunk into memory for exploration
  • histograms of destination port usage
  • grouping by srcip-dstip
  • grouping by flow (srcip, sport, dstip, dport, proto)

anomaly-detectors.ipynb

  • feature extraction after grouping by flow
  • input raw features: [‘ts’,‘ip_protocol’,‘state’,‘src_ip’,‘src_port’,‘dst_ip’,‘dst_port’,‘src_tx’,‘dst_tx’]
  • output features: with (srcip, sport, dstip, dport, proto) as groupby key number of entries/connections: count(entries) grouped time_first_seen = ts1 time_last_seen = ts2 total time for flow (first_seen - current/end) bytes_up (sum src_tx) bytes_dw throughput_up (total bytes/total time) throughput_dw first_state laststate = (state) state(state) : number of connections per state cidr_src_ip: get supernet string using ipaddr cidr_dst_ip pvt_srcip: private IPs should be declared as private (True/False) pvt_dstip dport_80, dport_8080, dport_443, dport_22 (orthogonal)

    Clustering

  • 2 components of PCA basically covered all the variance completely

  • k-means successfully found clusters for k=5, but features are not good for kmeans

  • DBSCAN performed best and separated outliers clearly on plotting. Without true labels couldn’t confirm for all.

anomalous-ip-detector.ipynb

  • group by src_ip, dst_ip on whole data set
  • sample the data every 10s to get 420 * (number of ip pairs) samples ~ 342093 samples
  • extract continuous and categorical features for ML: [‘bytes_dw’, ‘bytes_dw’, ‘num_conns’, ‘num_flows’, ‘num_dst_port’, ‘num_src_port’, ‘tcp_conns’, ‘udp_conns’, ‘cidr_src_ip’, ‘cidr_dst_ip’, ‘pvt_src_ip’, ‘pvt_dst_ip’]

    Clustering

  • PCA again showed very good results, but required 4 components to cover 94% variance.

  • k-means with multiple values showed low error for k=4 and lowest for k=8. Based on the data shape, rbf was applied.

  • DBSCAN and spectral clustering couldn’t work properly due to memory issues.

flow-analyzer.ipynb

  • for implementing and testing pandas_analyzer.py
  • grouped batches of flows every 10s and extracted multiple numerical features
  • 4 main were used in testing for initial design: [num_dst_ports, num_conns, bytes_up, bytes_dw]
  • analyzed each flow as a list and tested IQR against histograms of extracted features
  • group by flow (srcip, sport, dstip, dport, proto)

Results:

  • Almost certain that 192.168.100.96 had the worst behavior.
  • Multiple IPs had too much activity on non-mainstream ports
  • 15 dst_ips used more than a 100 unique ports each.
  • Final simple implementation based on statistical outlier detection:
  • only flow agg IQR with 2 features [num_dst_ports, num_conns]: Total Number of Alerts: 65752
  • only flow agg IQR with num_dst_ports: Total Number of Alerts: 14366
  • only flow agg IQR with num_conns: Total Number of Alerts: 60617
  • only flow agg IQR with all features (including total bytes_up and total bytes_dw size): Total Number of Alerts: 6492
  • All detectors in analyzer.py (basic, blacklist, port agg, ip_addr agg, flow agg): Total Number of Alerts: 92656
  • PCA showed good results in covering the variance of extracted features
  • Clustering worked better for features aggregated by flow tuples, not those aggregated by dst_ip, src_ip
  • Not yet tried machine learning while batch processing
  • Not yet tried unsupervised neural networks for this dataset due to lack of labels and guaranteed good data
  • Last step should be ensemble based trust - instead of adding alerts at each detector, calculate the probability from all filters
  • IQR test is statistically sound, but there are much better algorithms (eg: hosp) that can replace it for histogram based outlier detection. Additionally, batch processing of flows was essentially stateless, but number of new_ports is an important feature to detect port-scans that requires previous state. This is currently missing and should be added.
Avatar
Sarthak Grover
Computer Networks Researcher

Network analysis researcher and security expert