|
Overview
My general interests lie in computer networks and network security.
Among other things, I am interested in the design and development of
algorithms and tools for enabling novel analysis and understanding
of network traffic. I employ techniques from Programming Languages
and other disciplines where possible to aid in my work.
As part of my dissertation work, I developed
the Extended Finite Automata (XFA) model and technique for compact,
memory-efficient regular-expression-based matching of network payloads.
This technique can be used, for example, by network intrusion detection
systems to perform signature matching at line-speeds while avoiding
memory blowup. This work is still ongoing, but you can read about XFAs
in my dissertation as well as in my
Sigcomm and
Oakland papers.
Publications
Refereed Conference Publications
- Improving NFA-based Signature Matching using Ordered Binary Decision Diagrams,
Liu Yang, Rezwana Karim, Vinod Ganapathy, Randy Smith
Recent Advances in Intrusion Detection (RAID) 2010,
(pdf)
(slides)
(abstract ∨∧)
Network intrusion detection systems (NIDS) make extensive use of
regular expressions as attack signatures. Internally, NIDS represent
and operate these signatures using finite automata. Existing
representations of finite automata present a well-known time-space
tradeoff: Deterministic automata (DFAs) provide fast matching but are
memory intensive, while non-deterministic automata (NFAs) are
space-efficient but are several orders of magnitude slower than
DFAs. This time/space tradeoff has motivated much recent research,
primarily with a focus on improving the space-efficiency of DFAs,
often at the cost of reducing their performance.
This paper presents NFA-OBDDs, a new representation of regular
expressions that uses ordered binary decision diagrams (OBDDs)
to retain the space-efficiency of NFAs while drastically
improving their time-efficiency. Experiments using Snort HTTP
and FTP signature sets show that an NFA-OBDD-based
representation of regular expressions can outperform
traditional NFAs by up to three orders of magnitude and is
competitive with a variant of DFAs, while still retaining the
space-efficiency of NFAs.
- Multi-Byte Regular Expression Matching with Speculation,
Daniel Luchaup, Randy Smith, Cristian Estan, and Somesh Jha,
Recent Advances in Intrusion Detection (RAID) 2009,
(pdf)
(abstract ∨∧)
Intrusion prevention systems (IPSs) determine whether incoming
traffic matches a database of signatures, where each signature in
the database represents an attack or a vulnerability. IPSs need
to keep up with ever-increasing line speeds, which leads to the
use of custom hardware and specialized programmable
architectures. A major bottleneck that IPSs face is that they
match incoming packets one byte at a time, which limits their
throughput and latency. In this paper, we present a method for
matching which processes multiple bytes in parallel using
speculation. We break the packet in several chunks,
opportunistically match them in parallel and if the speculation
is wrong, correct it later. We present algorithms that apply
speculation in single-threaded software running on commodity
processors as well algorithms for parallel hardware. Experimental
results show that speculation leads to improvements in latency
and throughput in both cases.
- Protocol Normalization using Attribute Grammars,
Drew Davidson, Randy Smith, Nic Doyle, and Somesh Jha,
14th European Symposium on Research in Computer Security
(ESORICS) 2009
(pdf)
(abstract ∨∧)
Protocol parsing is an essential step in several
networking-related tasks. For instance, parsing network traffic
is an essential step for Intrusion Prevention Systems
(IPSs). The task of developing parsers for protocols is
challenging because network protocols often have features that
cannot be expressed in a context-free grammar. We address the
problem of parsing protocols by using attribute grammars (AGs),
which allows us to factor features that are not context-free as
attributes. We investigate this approach in the context of
protocol normalization, which is an essential task in
IPSs. Normalizers generated using systematic techniques, such
as ours, are more robust and resilient to attacks. Our
experience is that such normalizers incur an acceptable level
of overhead (approximately 15\% in the worst case) and are
straightforward to implement.
- Evaluating GPUs for Network Packet Signature Matching,
Randy Smith, Neelam Goyal, Justin Ormont, Cristian Estan, and Karthikeyan Sankaralingam,
International Symposium on Performance Analysis of systems and Software (ISPASS) 2009
(pdf)
(slides)
(abstract ∨∧)
Modern network devices employ deep packet inspection to
enable sophisticated services such as intrusion detection, traffic
shaping, and load balancing. At the heart of such services
is a signature matching engine that must match packet payloads
to multiple signatures at line rates. However, the recent
transition to complex regular-expression based signatures
coupled with ever-increasing network speeds has rapidly increased
the performance requirements of signature matching.
Solutions to meet these requirements range from hardwarecentric
ASIC/FPGA implementations to software implementations
using high-performance microprocessors.
In this paper, we propose a programmable signature matching
system prototyped on an Nvidia G80 GPU. We first present
a detailed architectural and microarchitectural analysis, showing
that signature matching is well suited for SIMD processing
because of regular control flow and parallelism available at
the packet level. Next, we examine two approaches for matching
signatures: standard deterministic finite automata (DFAs)
and extended finite automata (XFAs), which use far less memory
than DFAs but require specialized auxiliary memory and
small amounts of computation in most states. We implement a
fully functional prototype on the SIMD-based G80 GPU. This
system out-performs a Pentium4 by up to 9X and a Niagarabased
32-threaded system by up to 2.3X and shows that GPUs
are a promising candidate for signature matching.
- Efficient Signature Matching with Multiple Alphabet Compression Tables,
Shijin Kong, Randy Smith, Cristian Estan,
Securecomm 2008
(pdf)
(slides)
(abstract ∨∧)
Signature matching is a performance critical operation in
intrusion prevention systems. Modern systems express
signatures as regular expressions and use Deterministic Finite
Automata (DFAs) to efficiently match them against the input.
In principle, DFAs can be combined so that all signatures can
be examined in a single pass over the input. In practice,
however, combining DFAs corresponding to intrusion prevention
signatures results in memory requirements that far exceed
feasible sizes. We observe for such signatures that distinct
input symbols often have identical behavior in the DFA. In
these cases, an Alphabet Compression Table (ACT) can be used
to map such groups of symbols to a single symbol to reduce the
memory requirements.
In this paper, we explore the use of multiple alphabet
compression tables as a lightweight method for reducing the
memory requirements of DFAs. We evaluate this method on
signature sets used in Cisco IPS and Snort. Compared to
uncompressed DFAs, multiple ACTs achieve memory savings
between a factor of 4 and a factor of 70 at the cost of an
increase in run time that is typically between 35% and 85%.
Compared to another recent compression technique, D2FAs, ACTs
are between 2 and 3.5 times faster in software, and in some
cases use less than one tenth of the memory used by
D2FAs. Overall, for all signature sets and compression methods
evaluated, multiple ACTs offer the best memory versus run-time
trade-offs.
- Deflating the Big Bang: Fast and Scalable Deep Packet Inspection with Extended Finite Automata,
Randy Smith, Cristian Estan, Somesh Jha, Shijin Kong,
Sigcomm 2008
(pdf)
(slides)
(abstract ∨∧)
Deep packet inspection is playing an increasingly important
role in the design of novel network services. Regular
expressions are the language of choice for writing signatures,
but standard DFA or NFA representations are unsuitable for
high-speed environments, requiring too much memory, too much
time, or too much per-flow state. DFAs are fast and can be
readily combined, but doing so often leads to statespace
explosion. NFAs, while small, require large per-flow state and
are slow.
We propose a solution that simultaneously addresses all
these problems. We start with a first-principles
characterization of state-space explosion and give conditions
that eliminate it when satisfied. We show how auxiliary
variables can be used to transform automata so that they
satisfy these conditions, which we codify in a formal model
that augments DFAs with auxiliary variables and simple
instructions for manipulating them. Building on this model, we
present techniques, inspired by principles used in compiler
optimization, that systematically reduce runtime and per-flow
state. In our experiments, signature sets from Snort and
Cisco Systems achieve state-space reductions of over four
orders of magnitude, per-flow state reductions of up to a
factor of six, and runtimes that approach DFAs.
- XFA: Faster Signature Matching with Extended Automata,
Randy Smith, Cristian Estan, Somesh Jha,
2008 IEEE Symposium on Security and Privacy (Oakland 2008)
(pdf)
(slides)
(abstract ∨∧)
Automata-based representations and related algorithms have
been applied to address several problems in information
security, and often the automata had to be augmented with
additional information. For example, extended finite-state
automata (EFSA) augment finitestate automata (FSA) with
variables to track dependencies between arguments of system
calls. In this paper, we introduce extended finite automata
(XFAs) which augment FSAs with finite scratch memory and
instructions to manipulate this memory. Our primary motivation
for introducing XFAs is signature matching in Network
Intrusion Detection Systems (NIDS). Representing NIDS
signatures as deterministic finite-state automata (DFAs)
results in very fast signature matching but for several
classes of signatures DFAs can blowup in space. Using
nondeterministic finite-state automata (NFA) to represent NIDS
signatures results in a succinct representation but at the
expense of higher time complexity for signature matching. In
other words, DFAs are time-efficient but space-inefficient,
and NFAs are spaceefficient but time-inefficient. In our
experiments we have noticed that for a large class of NIDS
signatures XFAs have time complexity similar to DFAs and space
complexity similar to NFAs. For our test set, XFAs use 10
times less memory than a DFA-based solution, yet achieve 20
times higher matching speeds.
- Backtracking Algorithmic Complexity Attacks Against a NIDS,
Randy Smith, Cristian Estan, and Somesh Jha,
Twenty-Second Annual Computer Security Applications Conference
(ACSAC 2006)
(pdf)
(slides)
(abstract ∨∧)
Best Paper award winner!
Network Intrusion Detection Systems (NIDS) have become
crucial to securing modern networks. To be effective, a NIDS
must be able to counter evasion attempts and operate at or
near wire-speed. Failure to do so allows malicious packets to
slip through a NIDS undetected. In this paper, we explore NIDS
evasion through algorithmic complexity attacks. We present a
highly effective attack against the Snort NIDS, and we provide
a practical algorithmic solution that successfully thwarts the
attack. This attack exploits the behavior of rule matching,
yielding inspection times that are up to 1.5 million times
slower than that of benign packets. Our analysis shows that
this attack is applicable to many rules in Snortâs ruleset,
rendering vulnerable the thousands of networks protected by
it. Our countermeasure confines the inspection time to within
one order of magnitude of benign packets. Experimental results
using a live system show that an attacker needs only 4.0 kbps
of bandwidth to perpetually disable an unmodified NIDS,
whereas all intrusions are detected when our countermeasure is
used.
This work was highlighted in an article on
crn.com (DDJ also carried the article). Bugtraq information is available.
-
Copy Detection Systems for Digital Documents,
Douglas M. Campbell, Wendy R. Chen, Randy D. Smith,
Advances in Digital Libraries 2000: 78-88.
(pdf)
(abstract ∨∧)
Partial or total duplication of document content is common
to large digital libraries. In this paper, we present a copy
detection system to automate the detection of duplication in
digital documents. The system we present is sentence-based
and makes three contributions: it proposes an intuitive
definition of similarity between documents; it produces the
distribution of overlap that exists between overlapping
documents; it is resistant to inaccuracy due to large
variations in document size. We report the results of several
experiments that illustrate the behavior and functionality of
the system.
- Ontology-Based Extraction and Structuring of Information from Data-Rich
Unstructured Documents,
David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle,
CIKM 1998: 52-59.
(ps) (pdf)
(abstract ∨∧)
We present a new approach to extracting information from
unstructured documents based on an application ontology that
describes a domain of interest. Starting with such an
ontology, we formulate rules to extract constants and context
keywords from unstructured documents. For each unstructured
document of interest, we extract its constants and keywords
and apply a recognizer to organize extracted constants as
attribute values of tuples in a generated database schema. To
make our approach general, we fix all the processes and change
only the ontological description for a different application
domain. In experiments we conducted on two different types of
unstructured documents taken from the Web, our approach
attained recall ratios in the 80% and 90% range and precision
ratios near 98%.
-
A Conceptual-Modeling Approach to Extracting Data from the Web,
David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle,
Yiu-Kai Ng, Dallan Quass, Randy D. Smith,
ER 1998: 78-91.
(ps) (pdf)
(abstract ∨∧)
Electronically available data on the Web is exploding at an
ever increasing pace. Much of this data is unstructured,
which makes searching hard and traditional database querying
impossible. Many Web documents, however, contain an abundance
of recognizable constants that together describe the essence
of a documentâs content. For these kinds of data-rich
documents (e.g., advertisements, movie reviews, weather
reports, travel information, sports summaries, financial
statements, obituaries, and many others) we can apply a
conceptual-modeling approach to extract and structure
data. The approach is based on an ontologyâa conceptual model
instanceâthat describes the data of interest, including
relationships, lexical appearance, and context keywords. By
parsing the ontology, we can automatically produce a database
scheme and recognizers for constants and keywords, and then
invoke routines to recognize and extract data from
unstructured documents and structure it according to the
generated database scheme. Experiments show that it is
possible to achieve good recall and precision ratios for
documents that are rich in recognizable constants and narrow
in ontological breadth.
Refereed Workshop Publications
-
Detecting and Measuring Similarity in Code Clones,
Randy Smith and Susan Horwitz,
in the Third International Workshop on Detection of Software Clones (IWSC 2009),
(pdf)
(slides)
(abstract ∨∧)
Most previous work on code-clone detection has focused on
finding identical clones, or clones that are identical up to
identifiers and literal values. However, it is often important to
find similar clones, too. One challenge is that the definition of
similarity depends on the context in which clones are being
found. Therefore, we propose new techniques for finding similar
code blocks and for quantifying their similarity. Our techniques
can be used to find clone clusters, sets of code blocks all
within a user-supplied similarity threshold of each
other. Also, given one code block, we can find all similar blocks
and present them rank-ordered by similarity. Our techniques have
been used in a clone- detection tool for C programs. The ideas
could also be incorporated in many existing clone-detection tools
to provide more flexibility in their definitions of similar
clones.
Journal Publications
-
Fast, memory-efficient regular expression matching with NFA-OBDDs,
Liu Yang, Rezwana Karim, Vinod Ganapathy, Randy Smith
Computer Networks 55: 3376-3393 (2011).
(pdf)
(abstract ∨∧)
Modern network intrusion detection systems (NIDS) employ
regular expressions as attack signatures. Internally, NIDS
represent and operate these regular expressions as finite
automata However, finite automata present a well-known
time/space tradeoff. Deterministic automata (DFAs) provide
fast matching, but DFAs for large signature sets often
consume gigabytes of memory because DFA combination
results in a multiplicative increase in the number of
states. Non-deterministic automata (NFAs) address this
problem and are space-efficient, but are several orders of
magnitude slower than DFAs. This time/space tradeoff has
motivated much recent research, primarily with a focus on
improving the space-efficiency of DFAs, often at the cost
of reducing their performance. We consider an alternative
approach that focuses instead on improving the
time-efficiency of NFA-based signature matching. NFAs are
inefficient because they maintain a frontier of multiple
states at any instant during their operation, each of
which must be processed for every input symbol. We
introduce NFA-OBDDs, which use ordered binary decision
diagrams (OBDDs) to efficiently process sets of NFA
frontier states. Experiments using HTTP and FTP signature
sets from Snort show that NFA-OBDDs can outperform
traditional ditional NFAs by up to three orders of
magnitude, thereby making them competitive with a variant
of DFAs, while still retaining the space-efficiency of
NFAs.
-
Speculative Parallel Pattern Matching,
Daniel Luchaup, Randy Smith, Cristian Estan, Somesh Jha
IEEE Transactions on Information Forensics and Security 6(2):438-451 (2011).
(pdf)
(abstract ∨∧)
Intrusion prevention systems (IPSs) determine whether incoming
traffic matches a database of signatures, where each signature
is a regular expression and represents an attack or a
vulnerability. IPSs need to keep up with ever-increasing line
speeds, which has lead to the use of custom hardware. A major
bottleneck that IPSs face is that they scan incoming packets
one byte at a time, which limits their throughput and
latency. In this paper, we present a method to search for
arbitrary regular expressions by scanning multiple bytes in
parallel using speculation. We break the packet in several
chunks, opportunistically scan them in parallel, and if the
speculation is wrong, correct it later. We present algorithms
that apply speculation in single-threaded software running on
commodity processors as well as algorithms for parallel
hardware. Experimental results show that speculation leads to
improvements in latency and throughput in both cases.
-
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages,
David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle,
Yiu-Kai Ng, Dallan Quass, Randy D. Smith,
Data Knowl. Eng. 31(3): 227-251 (1999).
(ps) (pdf)
(abstract ∨∧)
Electronically available data on the Web is exploding at an
ever increasing pace. Much of this data is unstructured,
which makes searching hard and traditional database querying
impossible. Many Web documents, however, contain an abundance
of recognizable constants that together describe the essence
of a documentâs content. For these kinds of data-rich,
multiple-record documents (e.g. advertisements, movie reviews,
weather reports, travel information, sports summaries,
financial statements, obituaries, and many others) we can
apply a conceptual-modeling approach to extract and structure
data automatically. The approach is based on an ontologyâa
conceptual model instanceâthat describes the data of interest,
including relationships, lexical appearance, and context
keywords. By parsing the ontology, we can automatically
produce a database scheme and recognizers for constants and
keywords, and then invoke routines to recognize and extract
data from unstructured documents and structure it according to
the generated database scheme. Experiments show that it is
possible to achieve good recall and precision ratios for
documents that are rich in recognizable constants and narrow
in ontological breadth. Our approach is less labor-intensive
than other approaches that manually or semiautomatically
generate wrappers, and it is generally insensitive to changes
in Web-page format.
Invited Papers
- Fast Signature Matching Using Extended Finite Automata (XFA),
Randy Smith, Cristian Estan, Somesh Jha, Ida Siahaan,
Fourth International Conference on Information System Security (ICISS), 2008.
Technical Reports
- To CMP or not to CMP: Analyzing Packet Classification on Modern and Traditional Parallel Architectures,
Randy Smith, Dan Gibson, Shijin Kong,
UW-CS Technical Report TR1652, 2009.
(pdf)
(abstract ∨∧)
Packet classification is a central component of modern network functionality, yet satisfactory memory
usage and overall performance remains an elusive challenge at the highest speeds. The recent
emergence of chip multiprocessors and other low-cost, highly parallel processing hardware provides
a promising platform on which to realize increased classification performance. In this paper we analyze
the performance of packet classification in the context of parallel, shared-memory architectures. We begin
with two classic algorithmsâAggregated Bit Vector and HiCutsâand parallelize each of them multiple
ways. We discuss the tradeoffs of different architectures in the context of these algorithms, and we evaluate
the schemes on both chip multiprocessor (CMP) and symmetric multiprocessor (SMP) hardware.
Our experiments show that for CMPs, resource-sharing replaces synchronization scaling as the primary
speedup-limiting bottleneck. Further, while SMPs provide more processing power core-for-core, CMPs
nevertheless provide the best overall performance when all available execution contexts are employed.
Posters
- To CMP or not to CMP: Analyzing
Packet Classification on Modern and Traditional Parallel Architectures (invited poster),
Randy Smith, Dan Gibson, Shijin Kong,
Third ACM/IEEE Symposium on Architectures for Networking and
Communications Systems
(ANCS 2007)
(pdf)
(abstract ∨∧)
Packet classification is central to modern network
function- ality, yet satisfactory memory usage and performance
re- mains elusive at the highest speeds. The recent emergence
of low-cost, highly parallel architectures provides a promis-
ing platform on which to realize increased classification per-
formance. We analyze two classic algorithms (ABV and HiCuts)
in multiple parallel contexts. Our results show that
performance depends strongly on many factors, includ- ing
algorithm choice, hardware platform, and parallelization
scheme. We find that there is no clear âbest solution,â but in
the best cases hardware constraints are mitigated by the
parallelization scheme and vice versa, yielding near-linear
speedups as the degree of parallelization increases.
Theses
- Copy Detection Systems for Digital Documents,
Randy Smith,
Master's Thesis, Brigham Young University, 1999.
(pdf)
(abstract ∨∧)
Partial or total duplication of document content is common
to large digital libraries. Duplication of content may arise
from mirrored data stored in multiple locations, multiple
submissions of a single document, revisions to existing
documents, and plagiarism of documents. In this thesis, we
develop a copy detection system to automate the detection of
duplication in digital documents.
The copy detection system we develop is sentence-based and
makes three contributions. It proposes an intuitive definition
of similarity between documents, it produces the distribution
of overlap that exists between overlapping documents, and it
is resistant to inaccuracy due to large variations in document
size.
We report the results of several experiments that
illustrate the behavior and functionality of the system.
As a part of the copy detection system, we also present a
technique for hashing English sentences that gives
near-perfect collision performance. Whereas perfect hashing
functions can be algorithmically created, they require that
the domain be static. Our method does not require static word
sets, and we have applied our function successfully to domain
sizes ranging into millions of sentences.
The hash values created are the concatenation of two
separate hashing functions. The purpose of the first function
is to capture a representative fingerprint of the entire
sentence, separating the collection of sentences by
similarity. This is accomplished using ngrams to approximate
the word usage, forming the fingerprint out of those that
occur least frequently. The second hash function's purpose is
to separate into distinct buckets those sentences whose
fingerprints coincide.
- Toward Robust Network Payload Inspection
Randy Smith,
PhD. Dissertation, University of Wisconsin--Madison, 2009.
(pdf)
(abstract ∨∧)
A Network Intrusion Detection System (NIDS) resides at
the edge of a network and is tasked with identifying and
removing all occurrences of malicious traffic traversing
the network. At its heart is a signature matching engine
that compares each byte of incoming and outgoing traffic
to signatures representing known vulnerabilities or
exploits and flags or drops traffic that matches a
signature.
Signature Matching is fundamentally a language
recognition problem. Signatures are commonly represented
as regular expressions, which can be matched
simultaneously by finite automata. Unfortunately,
standard nondeterministic finite automata (NFAs) and
deterministic finite automata (DFAs) induce a space-time
tradeoff when combined and are unsuitable for NIDS
use. NFAs are small and compact but too slow, whereas DFAs
are fast but consume too much memory. Other alternatives
such as filtering induce a tradeoff between accuracy and
execution time.
We argue that the tradeoffs associated with signature
matching are not fundamental obstacles, but instead result
from limitations in existing models. We posit that with
richer, more complex matching models, we can devise
mechanisms that obviate these tradeoffs and have
acceptable memory and performance profiles.
In an analysis of an existing NIDS, we show that the use
of filters induces worst-case behavior that is six orders
of magnitude slower than the average case and can be
invoked by an adversary to enable perpetual evasion. We
present an enhanced model, with an acceptably small
increase in memory, that brings the worst case to within
one order of magnitude of the average case.
Next, for DFA-based matching we present a
first-principles characterization of the state-space
explosion and subsequent memory exhaustion that occurs
when DFAs are combined, and we give conditions that
eliminate it when satisfied. We then show that through the
careful inclusion of auxiliary state variables, we can
transform automata so that they satisfy these conditions
and do not lead to memory exhaustion. With this enriched
model, we achieve matching speeds approaching DFAs with
memory footprints similar to NFAs.
A NIDS operates in a constrained, adversarial
environment. This work makes contributions towards
enabling robust signature matching in such
environments.
|
|
|