|
|
Overview
My general interests lie in computer networks and network security.
Among other things, I am interested in the design and development of
algorithms and tools for enabling novel analysis and understanding
of network traffic. I employ techniques from Programming Languages
and other disciplines where possible to aid in my work.
As part of my dissertation work, I developed
the Extended Finite Automata (XFA) model and technique for compact,
memory-efficient regular-expression-based matching of network payloads.
This technique can be used, for example, by network intrusion detection
systems to perform signature matching at line-speeds while avoiding
memory blowup. This work is still ongoing, but you can read about XFAs
in my recent Sigcomm and
Oakland papers.
Publications
Refereed Conference Publications
- Multi-Byte Regular Expression Matching with Speculation,
Daniel Luchaup, Randy Smith, Cristian Estan, and Somesh Jha,
Recent Advances in Intrusion Detection (RAID) 2009>,
(to appear)
(abstract ∨∧)
Intrusion prevention systems (IPSs) determine whether incoming
traffic matches a database of signatures, where each signature in
the database represents an attack or a vulnerability. IPSs need
to keep up with ever-increasing line speeds, which leads to the
use of custom hardware and specialized programmable
architectures. A major bottleneck that IPSs face is that they
match incoming packets one byte at a time, which limits their
throughput and latency. In this paper, we present a method for
matching which processes multiple bytes in parallel using
speculation. We break the packet in several chunks,
opportunistically match them in parallel and if the speculation
is wrong, correct it later. We present algorithms that apply
speculation in single-threaded software running on commodity
processors as well algorithms for parallel hardware. Experimental
results show that speculation leads to improvements in latency
and throughput in both cases.
- Protocol Normalization using Attribute Grammars,
Drew Davidson, Randy Smith, Nic Doyle, and Somesh Jha,
14th European Symposium on Research in Computer Security
(ESORICS 2009)
(to appear)
- Evaluating GPUs for Network Packet Signature Matching,
Randy Smith, Neelam Goyal, Justin Ormont, Cristian Estan, and Karthikeyan Sankaralingam,
International Symposium on Performance Analysis of systems and Software (ISPASS)
(pdf)
(slides)
(abstract ∨∧)
Modern network devices employ deep packet inspection to
enable sophisticated services such as intrusion detection, traffic
shaping, and load balancing. At the heart of such services
is a signature matching engine that must match packet payloads
to multiple signatures at line rates. However, the recent
transition to complex regular-expression based signatures
coupled with ever-increasing network speeds has rapidly increased
the performance requirements of signature matching.
Solutions to meet these requirements range from hardwarecentric
ASIC/FPGA implementations to software implementations
using high-performance microprocessors.
In this paper, we propose a programmable signature matching
system prototyped on an Nvidia G80 GPU. We first present
a detailed architectural and microarchitectural analysis, showing
that signature matching is well suited for SIMD processing
because of regular control flow and parallelism available at
the packet level. Next, we examine two approaches for matching
signatures: standard deterministic finite automata (DFAs)
and extended finite automata (XFAs), which use far less memory
than DFAs but require specialized auxiliary memory and
small amounts of computation in most states. We implement a
fully functional prototype on the SIMD-based G80 GPU. This
system out-performs a Pentium4 by up to 9X and a Niagarabased
32-threaded system by up to 2.3X and shows that GPUs
are a promising candidate for signature matching.
- Efficient Signature Matching with Multiple Alphabet Compression Tables,
Shijin Kong, Randy Smith, Cristian Estan,
Securecomm 2008
(pdf)
(slides)
(abstract ∨∧)
Signature matching is a performance critical operation in
intrusion prevention systems. Modern systems express
signatures as regular expressions and use Deterministic Finite
Automata (DFAs) to efficiently match them against the input.
In principle, DFAs can be combined so that all signatures can
be examined in a single pass over the input. In practice,
however, combining DFAs corresponding to intrusion prevention
signatures results in memory requirements that far exceed
feasible sizes. We observe for such signatures that distinct
input symbols often have identical behavior in the DFA. In
these cases, an Alphabet Compression Table (ACT) can be used
to map such groups of symbols to a single symbol to reduce the
memory requirements.
In this paper, we explore the use of multiple alphabet
compression tables as a lightweight method for reducing the
memory requirements of DFAs. We evaluate this method on
signature sets used in Cisco IPS and Snort. Compared to
uncompressed DFAs, multiple ACTs achieve memory savings
between a factor of 4 and a factor of 70 at the cost of an
increase in run time that is typically between 35% and 85%.
Compared to another recent compression technique, D2FAs, ACTs
are between 2 and 3.5 times faster in software, and in some
cases use less than one tenth of the memory used by
D2FAs. Overall, for all signature sets and compression methods
evaluated, multiple ACTs offer the best memory versus run-time
trade-offs.
- Deflating the Big Bang: Fast and Scalable Deep Packet Inspection with Extended Finite Automata,
Randy Smith, Cristian Estan, Somesh Jha, Shijin Kong,
Sigcomm 2008
(pdf)
(slides)
(abstract ∨∧)
Deep packet inspection is playing an increasingly important
role in the design of novel network services. Regular
expressions are the language of choice for writing signatures,
but standard DFA or NFA representations are unsuitable for
high-speed environments, requiring too much memory, too much
time, or too much per-flow state. DFAs are fast and can be
readily combined, but doing so often leads to statespace
explosion. NFAs, while small, require large per-flow state and
are slow.
We propose a solution that simultaneously addresses all
these problems. We start with a first-principles
characterization of state-space explosion and give conditions
that eliminate it when satisfied. We show how auxiliary
variables can be used to transform automata so that they
satisfy these conditions, which we codify in a formal model
that augments DFAs with auxiliary variables and simple
instructions for manipulating them. Building on this model, we
present techniques, inspired by principles used in compiler
optimization, that systematically reduce runtime and per-flow
state. In our experiments, signature sets from Snort and
Cisco Systems achieve state-space reductions of over four
orders of magnitude, per-flow state reductions of up to a
factor of six, and runtimes that approach DFAs.
- XFA: Faster Signature Matching with Extended Automata,
Randy Smith, Cristian Estan, Somesh Jha,
2008 IEEE Symposium on Security and Privacy (Oakland 2008)
(pdf)
(slides)
(abstract ∨∧)
Automata-based representations and related algorithms have
been applied to address several problems in information
security, and often the automata had to be augmented with
additional information. For example, extended finite-state
automata (EFSA) augment finitestate automata (FSA) with
variables to track dependencies between arguments of system
calls. In this paper, we introduce extended finite automata
(XFAs) which augment FSAs with finite scratch memory and
instructions to manipulate this memory. Our primary motivation
for introducing XFAs is signature matching in Network
Intrusion Detection Systems (NIDS). Representing NIDS
signatures as deterministic finite-state automata (DFAs)
results in very fast signature matching but for several
classes of signatures DFAs can blowup in space. Using
nondeterministic finite-state automata (NFA) to represent NIDS
signatures results in a succinct representation but at the
expense of higher time complexity for signature matching. In
other words, DFAs are time-efficient but space-inefficient,
and NFAs are spaceefficient but time-inefficient. In our
experiments we have noticed that for a large class of NIDS
signatures XFAs have time complexity similar to DFAs and space
complexity similar to NFAs. For our test set, XFAs use 10
times less memory than a DFA-based solution, yet achieve 20
times higher matching speeds.
- Backtracking Algorithmic Complexity Attacks Against a NIDS,
Randy Smith, Cristian Estan, and Somesh Jha,
Twenty-Second Annual Computer Security Applications Conference
(ACSAC 2006)
(pdf)
(slides)
(abstract ∨∧)
Best Paper award winner!
Network Intrusion Detection Systems (NIDS) have become
crucial to securing modern networks. To be effective, a NIDS
must be able to counter evasion attempts and operate at or
near wire-speed. Failure to do so allows malicious packets to
slip through a NIDS undetected. In this paper, we explore NIDS
evasion through algorithmic complexity attacks. We present a
highly effective attack against the Snort NIDS, and we provide
a practical algorithmic solution that successfully thwarts the
attack. This attack exploits the behavior of rule matching,
yielding inspection times that are up to 1.5 million times
slower than that of benign packets. Our analysis shows that
this attack is applicable to many rules in Snortâs ruleset,
rendering vulnerable the thousands of networks protected by
it. Our countermeasure confines the inspection time to within
one order of magnitude of benign packets. Experimental results
using a live system show that an attacker needs only 4.0 kbps
of bandwidth to perpetually disable an unmodified NIDS,
whereas all intrusions are detected when our countermeasure is
used.
This work was highlighted in an article on
crn.com (DDJ also carried the article). Bugtraq information is available.
-
Copy Detection Systems for Digital Documents,
Douglas M. Campbell, Wendy R. Chen, Randy D. Smith,
Advances in Digital Libraries 2000: 78-88.
(pdf)
(abstract ∨∧)
Partial or total duplication of document content is common
to large digital libraries. In this paper, we present a copy
detection system to automate the detection of duplication in
digital documents. The system we present is sentence-based
and makes three contributions: it proposes an intuitive
definition of similarity between documents; it produces the
distribution of overlap that exists between overlapping
documents; it is resistant to inaccuracy due to large
variations in document size. We report the results of several
experiments that illustrate the behavior and functionality of
the system.
- Ontology-Based Extraction and Structuring of Information from Data-Rich
Unstructured Documents,
David W. Embley, Douglas M. Campbell, Randy D. Smith, Stephen W. Liddle,
CIKM 1998: 52-59.
(ps) (pdf)
(abstract ∨∧)
We present a new approach to extracting information from
unstructured documents based on an application ontology that
describes a domain of interest. Starting with such an
ontology, we formulate rules to extract constants and context
keywords from unstructured documents. For each unstructured
document of interest, we extract its constants and keywords
and apply a recognizer to organize extracted constants as
attribute values of tuples in a generated database schema. To
make our approach general, we fix all the processes and change
only the ontological description for a different application
domain. In experiments we conducted on two different types of
unstructured documents taken from the Web, our approach
attained recall ratios in the 80% and 90% range and precision
ratios near 98%.
-
A Conceptual-Modeling Approach to Extracting Data from the Web,
David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle,
Yiu-Kai Ng, Dallan Quass, Randy D. Smith,
ER 1998: 78-91.
(ps) (pdf)
(abstract ∨∧)
Electronically available data on the Web is exploding at an
ever increasing pace. Much of this data is unstructured,
which makes searching hard and traditional database querying
impossible. Many Web documents, however, contain an abundance
of recognizable constants that together describe the essence
of a documentâs content. For these kinds of data-rich
documents (e.g., advertisements, movie reviews, weather
reports, travel information, sports summaries, financial
statements, obituaries, and many others) we can apply a
conceptual-modeling approach to extract and structure
data. The approach is based on an ontologyâa conceptual model
instanceâthat describes the data of interest, including
relationships, lexical appearance, and context keywords. By
parsing the ontology, we can automatically produce a database
scheme and recognizers for constants and keywords, and then
invoke routines to recognize and extract data from
unstructured documents and structure it according to the
generated database scheme. Experiments show that it is
possible to achieve good recall and precision ratios for
documents that are rich in recognizable constants and narrow
in ontological breadth.
Refereed Workshop Publications
-
Detecting and Measuring Similarity in Code Clones,
Randy Smith and Susan Horwitz,
in the Third International Workshop on Detection of Software Clones (IWSC 2009),
(pdf)
(slides)
(abstract ∨∧)
Most previous work on code-clone detection has focused on
finding identical clones, or clones that are identical up to
identifiers and literal values. However, it is often important to
find similar clones, too. One challenge is that the definition of
similarity depends on the context in which clones are being
found. Therefore, we propose new techniques for finding similar
code blocks and for quantifying their similarity. Our techniques
can be used to find clone clusters, sets of code blocks all
within a user-supplied similarity threshold of each
other. Also, given one code block, we can find all similar blocks
and present them rank-ordered by similarity. Our techniques have
been used in a clone- detection tool for C programs. The ideas
could also be incorporated in many existing clone-detection tools
to provide more flexibility in their definitions of similar
clones.
Journal Publications
-
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages,
David W. Embley, Douglas M. Campbell, Y. S. Jiang, Stephen W. Liddle,
Yiu-Kai Ng, Dallan Quass, Randy D. Smith,
Data Knowl. Eng. 31(3): 227-251 (1999).
(ps) (pdf)
(abstract ∨∧)
Electronically available data on the Web is exploding at an
ever increasing pace. Much of this data is unstructured,
which makes searching hard and traditional database querying
impossible. Many Web documents, however, contain an abundance
of recognizable constants that together describe the essence
of a documentâs content. For these kinds of data-rich,
multiple-record documents (e.g. advertisements, movie reviews,
weather reports, travel information, sports summaries,
financial statements, obituaries, and many others) we can
apply a conceptual-modeling approach to extract and structure
data automatically. The approach is based on an ontologyâa
conceptual model instanceâthat describes the data of interest,
including relationships, lexical appearance, and context
keywords. By parsing the ontology, we can automatically
produce a database scheme and recognizers for constants and
keywords, and then invoke routines to recognize and extract
data from unstructured documents and structure it according to
the generated database scheme. Experiments show that it is
possible to achieve good recall and precision ratios for
documents that are rich in recognizable constants and narrow
in ontological breadth. Our approach is less labor-intensive
than other approaches that manually or semiautomatically
generate wrappers, and it is generally insensitive to changes
in Web-page format.
Invited Papers
- Fast Signature Matching Using Extended Finite Automata (XFA),
Randy Smith, Cristian Estan, Somesh Jha, Ida Siahaan,
Fourth International Conference on Information System Security (ICISS), 2008.
Technical Reports
- To CMP or not to CMP: Analyzing Packet Classification on Modern and Traditional Parallel Architectures,
Randy Smith, Dan Gibson, Shijin Kong,
UW-CS Technical Report TR1652, 2009.
(pdf)
(abstract ∨∧)
Packet classification is a central component of modern network functionality, yet satisfactory memory
usage and overall performance remains an elusive challenge at the highest speeds. The recent
emergence of chip multiprocessors and other low-cost, highly parallel processing hardware provides
a promising platform on which to realize increased classification performance. In this paper we analyze
the performance of packet classification in the context of parallel, shared-memory architectures. We begin
with two classic algorithmsâAggregated Bit Vector and HiCutsâand parallelize each of them multiple
ways. We discuss the tradeoffs of different architectures in the context of these algorithms, and we evaluate
the schemes on both chip multiprocessor (CMP) and symmetric multiprocessor (SMP) hardware.
Our experiments show that for CMPs, resource-sharing replaces synchronization scaling as the primary
speedup-limiting bottleneck. Further, while SMPs provide more processing power core-for-core, CMPs
nevertheless provide the best overall performance when all available execution contexts are employed.
Posters
- To CMP or not to CMP: Analyzing
Packet Classification on Modern and Traditional Parallel Architectures (invited poster),
Randy Smith, Dan Gibson, Shijin Kong,
Third ACM/IEEE Symposium on Architectures for Networking and
Communications Systems
(ANCS 2007)
(pdf)
(abstract ∨∧)
Packet classification is central to modern network
function- ality, yet satisfactory memory usage and performance
re- mains elusive at the highest speeds. The recent emergence
of low-cost, highly parallel architectures provides a promis-
ing platform on which to realize increased classification per-
formance. We analyze two classic algorithms (ABV and HiCuts)
in multiple parallel contexts. Our results show that
performance depends strongly on many factors, includ- ing
algorithm choice, hardware platform, and parallelization
scheme. We find that there is no clear âbest solution,â but in
the best cases hardware constraints are mitigated by the
parallelization scheme and vice versa, yielding near-linear
speedups as the degree of parallelization increases.
Theses
- Copy Detection Systems for Digital Documents,
Randy Smith,
Master's Thesis, Brigham Young University, 1999.
(pdf)
(abstract ∨∧)
Partial or total duplication of document content is common
to large digital libraries. Duplication of content may arise
from mirrored data stored in multiple locations, multiple
submissions of a single document, revisions to existing
documents, and plagiarism of documents. In this thesis, we
develop a copy detection system to automate the detection of
duplication in digital documents.
The copy detection system we develop is sentence-based and
makes three contributions. It proposes an intuitive definition
of similarity between documents, it produces the distribution
of overlap that exists between overlapping documents, and it
is resistant to inaccuracy due to large variations in document
size.
We report the results of several experiments that
illustrate the behavior and functionality of the system.
As a part of the copy detection system, we also present a
technique for hashing English sentences that gives
near-perfect collision performance. Whereas perfect hashing
functions can be algorithmically created, they require that
the domain be static. Our method does not require static word
sets, and we have applied our function successfully to domain
sizes ranging into millions of sentences.
The hash values created are the concatenation of two
separate hashing functions. The purpose of the first function
is to capture a representative fingerprint of the entire
sentence, separating the collection of sentences by
similarity. This is accomplished using ngrams to approximate
the word usage, forming the fingerprint out of those that
occur least frequently. The second hash function's purpose is
to separate into distinct buckets those sentences whose
fingerprints coincide.
|
|
|