Network Anomaly Detection

Last updated on Oct 13, 2019

Link to Code

Aim: Given some netflow network records, detect anomalous behavior (ex: port scanning, DoS, etc.)

Source:

analyzer_clean.py: batches flows every 10s, and sends the batch for outlier detection. Checks outlier ip_addresses (src and dst combined) to issue alerts.
pandas_analysis.py: extract relevant continuous feature set and implement IQR outlier detection for a batch of flows. Returns the src and dst addresses that were outliers based on number of connections and destination ports in the batch.

analyzer.py: Original (unclean) code for outlier detection in multiple steps. Partially implemented and untested.
1. Check basics: packet lengths, local IPs, connection state, protocols, etc.
2. Check src ip and dst ip against a blacklist set in memory
3. Check dst_port and index first use for IP address + aggregate bytes
4. Check IP address and number of ports, protocols, bytes
5. Aggregate flows for (src_ip, dst_ip) pair every T sec and detect outlies statistically
6. TODO: cluster input in batches using ‘rbf’
7. TODO: train RNN based LSTM with good data, predict output
8. TODO: aggregate output of all detectors for each flow and produce trustworthy probability
blacklist_update.py: code to update a static blacklist of malicious IP addresses

data.csv: test file netflow data
blacklist_ips.csv: offline csv of bad IP addresses generated by blacklist_update.py
df_flow_features.pkl: pandas object with features extracted from the whole dataset grouped by flow tuple for ML
df_src_dst_sampled_10s.pkl: pandas object with complete dataset sampled to 10s intervals for testing ML

feature extraction after grouping by flow
input raw features: [‘ts’,‘ip_protocol’,‘state’,‘src_ip’,‘src_port’,‘dst_ip’,‘dst_port’,‘src_tx’,‘dst_tx’]
output features: with (srcip, sport, dstip, dport, proto) as groupby key number of entries/connections: count(entries) grouped time_first_seen = ts1 time_last_seen = ts2 total time for flow (first_seen - current/end) bytes_up (sum src_tx) bytes_dw throughput_up (total bytes/total time) throughput_dw first_state laststate = (state) state(state) : number of connections per state cidr_src_ip: get supernet string using ipaddr cidr_dst_ip pvt_srcip: private IPs should be declared as private (True/False) pvt_dstip dport_80, dport_8080, dport_443, dport_22 (orthogonal)

Clustering
2 components of PCA basically covered all the variance completely
k-means successfully found clusters for k=5, but features are not good for kmeans
DBSCAN performed best and separated outliers clearly on plotting. Without true labels couldn’t confirm for all.

group by src_ip, dst_ip on whole data set
sample the data every 10s to get 420 * (number of ip pairs) samples ~ 342093 samples
extract continuous and categorical features for ML: [‘bytes_dw’, ‘bytes_dw’, ‘num_conns’, ‘num_flows’, ‘num_dst_port’, ‘num_src_port’, ‘tcp_conns’, ‘udp_conns’, ‘cidr_src_ip’, ‘cidr_dst_ip’, ‘pvt_src_ip’, ‘pvt_dst_ip’]

Clustering
PCA again showed very good results, but required 4 components to cover 94% variance.
k-means with multiple values showed low error for k=4 and lowest for k=8. Based on the data shape, rbf was applied.
DBSCAN and spectral clustering couldn’t work properly due to memory issues.

for implementing and testing pandas_analyzer.py
grouped batches of flows every 10s and extracted multiple numerical features
4 main were used in testing for initial design: [num_dst_ports, num_conns, bytes_up, bytes_dw]
analyzed each flow as a list and tested IQR against histograms of extracted features
group by flow (srcip, sport, dstip, dport, proto)

Almost certain that 192.168.100.96 had the worst behavior.
Multiple IPs had too much activity on non-mainstream ports
15 dst_ips used more than a 100 unique ports each.
Final simple implementation based on statistical outlier detection:
only flow agg IQR with 2 features [num_dst_ports, num_conns]: Total Number of Alerts: 65752
only flow agg IQR with num_dst_ports: Total Number of Alerts: 14366
only flow agg IQR with num_conns: Total Number of Alerts: 60617
only flow agg IQR with all features (including total bytes_up and total bytes_dw size): Total Number of Alerts: 6492
All detectors in analyzer.py (basic, blacklist, port agg, ip_addr agg, flow agg): Total Number of Alerts: 92656
PCA showed good results in covering the variance of extracted features
Clustering worked better for features aggregated by flow tuples, not those aggregated by dst_ip, src_ip
Not yet tried machine learning while batch processing
Not yet tried unsupervised neural networks for this dataset due to lack of labels and guaranteed good data
Last step should be ensemble based trust - instead of adding alerts at each detector, calculate the probability from all filters
IQR test is statistically sound, but there are much better algorithms (eg: hosp) that can replace it for histogram based outlier detection. Additionally, batch processing of flows was essentially stateless, but number of new_ports is an important feature to detect port-scans that requires previous state. This is currently missing and should be added.

Network analysis researcher and security expert