Commit 2188f049 authored by Xi Yang's avatar Xi Yang
Browse files

Check in the draft of the introduction chapter.

parent 123a4b48
......@@ -2,40 +2,38 @@
\addcontentsline{toc}{chapter}{Abstract}
\vspace{-1em}
%context: connection between profiling and optimizations.
To understand and improve application performance, developers spend a lot of
time on observing problematic runtime behaviors of computer
applications. Understanding the root causes of the performance problems is just
To understand and improve software performance, developers spend a lot of
time on observing problematic run-time behaviors of computer
applications. Understanding root causes of performance problems is
the first step to improving performance. The state of the art is continuous
sample-based profiling. These systems periodically sample software and hardware
events and analyze samples to help developers discover opportunities for
improvements. As for any sampling process, its fidelity depends on the
sample rate. The higher sample rates profilers can achieve, then the finer-grain
behaviors they can observe. The sampling rate gates software and hardware
sample rate. The higher the sample rates profilers can achieve, then the
finer-grain behaviors they can observe. Thus, sample rates gate software and hardware
innovations. If you cannot observe it, you cannot improve it.
%problem:
Unfortunately, the sampling rate of current profilers is too low. Despite the GHz
Unfortunately, the sampling rate of current profilers is too low. Despite the gigahertz
speeds of modern processors, the sampling frequency of
current continuous profilers has been at a standstill --- between 1 KHz and 100
KHz. The Million cycles gap between two sequential samples prevents the
current continuous profilers has been at a standstill --- between\,1 KHz and
100\,KHz. The million cycle gap between two sequential samples prevents the
profilers from observing fine-grain behaviors, therefore, missing many optimization
opportunities.
opportunities. The history of science shows that an order of magnitude or more improvement in
measurement fidelity has always led to fundamental new discoveries.
The history of science shows that an order of magnitude or more improvement in
measurement fidelity always leads to fundamental new
discoveries. My thesis is that increasing the sample rates of continuous
hardware and software profilers by orders of magnitude will lead to fundamental discoveries of new
behaviors, accurate root causes, and sound optimizations.
%Block lock
\textbf{My thesis is that sample rates of continuous hardware and software profilers can be increased by orders of magnitude leading
to fundamental discoveries of new behaviors, accurate root causes, and sound optimizations.}
%contributions
The key contributions of this thesis are: 1) \shim, a continuous profiler that samples
at resolutions as fine as 15 cycles; three to five orders of magnitude finger
than current continuous profilers. 2) Tailer, a tail latency analyzer that
than current continuous profilers. 2) Tailor, a tail latency analyzer that
profiles web services using SHIM profiler and analyzes the root causes of long
tail latency. 3) Elfen, a job scheduler based on the sampling ideas in SHIM, that
borrows idle cycles from underutilized SMT cores for batch workloads without
......@@ -43,10 +41,9 @@ interfering with latency-critical requests.
%significance
My thesis fundamentally alters what software and hardware signals are observable on existing systems, develops
My thesis fundamentally alters which software and hardware signals are observable on existing systems, develops
new approaches of analyzing large volumes of profiling data, and
demonstrates that high frequency profiling has the potential to stimulate new software and
hardware optimizations.
demonstrates that high frequency profiling stimulate new software and hardware optimizations.
%%% Local Variables:
%%% mode: latex
......
\chapter{Elfen}
\label{cha:elfen}
\section{Summary}
\chapter{Introduction}
\label{cha:intro}
%point out the problem
This thesis addresses the challenge of
significantly increasing sample rates of continuous hardware and software
profilers and the opportunities of using the new profiling technique to discover
new optimizations.
\section{Problem Statement}
\label{sec:problemstatement}
\section{Thesis Statement}
\label{sec:thesisstatement}
I believe A is better than B.
%observing and optimizations
Optimizing software is challenging for developers but also extremely rewarding. To design
sound optimizations, developers have to understand root causes of performance
problems, otherwise, imagined optimizations rarely work. Most of the time, the
root causes are not obvious. The better developers can
observe behaviors of software and hardware, the easier they can discover hidden root
causes.
\section{Introduction}
\label{sec:problemstatement}
Put your introduction here. You could use \textbackslash fix\{ABCDEFG.\} to
leave your comments, see the box at the left side. \fix{You have to rewrite your
thesis!!!}
%higher sampling rate -> observating better
Among observing techniques, sample-based continuous profiling that periodically
samples hardware and software events is the most popular
and convenient one. Developers pose hypotheses and configure these profilers to sample interesting events
expecting that sequential samples reveal signals of problematic run-time
behaviors, and that correlations between the events provide information about
where root causes hide. As shown in the Nyquist-Shannon sampling theorem, to
recover the full information of a continuous signal with the frequency B, the
signal has to be sampled at least at frequency 2B. Many interesting software and hardware behaviors occur with extremely
high frequencies, for example, the current executing function may change every a few
hundred cycles. If developers want to observe these behaviors directly, they better have a high sample-rate profiler.
%limitation of current profilers
In last four decades, following Moore's law, the frequency of processors has been increased
from megahertz to gigahertz, and software and hardware have been becoming
more and more complex. However, Moore's law has little impact on
continuous profiling. Today, on gigahertz processors, the sampling frequency
of state-of-the-art continuous profiling tools, such as Intel Vtune and Linux
perf, is fixed at 1\,KHz by default and can be tuned up to 100\,KHz maximally. Such low sampling frequency
gives developers a low fidelity view that binds developers to root causes. We
use an example to demonstrate if we can improve sample rates by orders of magnitude, a high fidelity view
that is fundamentally better than low sampling rates can be shown to
developers. Variations in instruction per cycle (IPC) is an important behavior
for developers to understand how software utilizes CPU resources. Figure~\ref{fig:ipcTimeline}
shows IPC timelines for a same short period of \lusearch benchmark at three
sampling frequencies: 1 KHz, 100 KHz, and 10 MHz, on a 4-way 3.4\,GHz Intel
i7-4700K processors. For each sample, two hardware performance counters, the number of cycles
and the retired instruction, are read to calculate the sampling period's average IPC from two consecutive samples. Sampling at 100\,KHz shows IPC varying slowly over time
whereas sampling at 10\,MHz reveals substantial high-frequency variations in IPC.
\begin{figure}
\includegraphics[width=\columnwidth]{./figs/intro-IPC-timeline.pdf}
\caption{ IPC timeline for Lusearch. Sampling with 10 MHz exposes behavior unseen by existing profilers (red, blue).\label{fig:ipcTimeline}}
\label{fig:ipcTimeline}
\end{figure}
\section{Opportunities and Contributions}
From discovering bacteria to identifying the DNA structure, uncountable examples in
the history of science show improvements in measurement techniques always unlock
fundamental new discoveries. The aim of this thesis is to demonstrate that this is true for
continuous profiling.
This thesis presents \shim, a continuous profiler sampling at resolutions orders
of magnitude finer than existing profilers, and two optimizations based upon on
\shim: (1) \elfen, a job scheduler improves the datacenter utilization
significantly, and (2) \tailor, a tail latency analyzer that helps developers to diagnose and eliminate long tail latency problems of web services.
\subsection{SHIM}
%problem
The reason why sample rates of exiting profiling tools are low is that they take
an interrupt and then sample software and hardware events~\citep{vtune:intel, perf:wiki}. The
possibility of over-whelming the kernel's capacity to service interrupts places
practical limits on their maximum resolution~\citep{perfwarn:source}. This interrupt-based
sampling technique was deeply rooted in single-core processors. Today, the
multicore processors are everywhere. They provide an opportunity to remove this
limitation by freeing sampling from interrupting. On multicore processors, an observer thread on one
core can observe another core's software and hardware events directly without
sending an interrupt to the observed core, thus, the sampling
frequency is only limited by how fast the observer thread reads the events.
In this thesis, we present \shim, a continuous profiler that samples at resolutions
as fine as 15 cycles, orders of magnitude finer than existing profilers. A \shim
observer thread executes simultaneously with the application thread it
observes, but on a separate hardware context, exploiting unutilized hardware on a different
core or on the same core with Simultaneous Multithreading (SMT).
Instead of using interrupts or inserting instrumentation, which
substantially perturb applications, \shim efficiently observes
events in a separate thread by simply reading hardware counters and
memory locations. \shim views time-varying software and hardware events as
\emph{signals}. A \shim observer thread reads hardware
performance counter signals and memory locations that store software signals
(e.g., method and loop identifiers). \shim treats software and hardware
data uniformly by reading (sampling) memory locations and
performance counters together at very high frequencies, e.g., 10s to 1000s of
cycles. The observer thread logs and/or aggregates
signals. Further online or offline analysis acts on this
data. The sampling frequency of \shim is not limited by interrupt handling, but
only by reading performance counters and memory locations.
% Just as in signal processing, high frequency sampling of rates, such as IPC,
% is subject to noise. For instance, slow reads or interrupts may disturb
% the sampling measurement code and thus skew results at
% extremely high frequencies,
% To improve the fidelity of rate
% metrics, we introduce \emph{double-time error correction}
% (\dte), which automatically identifies and discards noisy samples by
% taking redundant timing measurements. \dte separately measures the
% period between the start of two consecutive samples and the period
% between the end of the samples. It uses the global clock as the ground
% truth. Both measurements observe the same application period and one of the
% two sampling periods. Since each sampling period is a fixed number of instructions,
% code, all sampling periods should take the same amount of time. Thus if the periods differ, the measurement was perturbed and \dte
% discards it.
\subsection{Elfen}
%problem
Web services from search to games to stock trading impose strict
Service Level Objectives (SLOs) on tail latency. Meeting these
objectives is challenging because the computational demand of each request is
highly variable and load is bursty. % and diurnal.
Consequently, many servers run at
low utilization (10 to 45\%); turn off simultaneous
multithreading (SMT); and execute only a single service --- wasting
hardware, energy, and money. Although co-running batch jobs with
latency critical % web service
\emph{requests} to utilize multiple SMT hardware contexts (lanes) is
appealing, unmitigated sharing of core resources induces
non-linear effects on tail latency and SLO violations.
Since interactive services are widely deployed in many
datacenters, their poor utilization incurs enormous commensurate
capital and operating costs. Even small improvements
substantially improve profitability.
%elfen
In this thesis, we introduce principled borrowing, a scheduling technique based
on \shim, that dynamically identifies idle cycles and
run batch workloads by borrowing hardware resources from latency critical
workloads without violating SLOs. The
\emph{\Elfen} scheduler executes secondary batch threads
in a reserved SMT \emph{batch lane} mutually exclusively with latency-critical primary
requests which execute in a distinct SMT \emph{request lane}. We instrument
batch threads with \shim sampling instructions that continuously monitor
paired request lanes. Batch threads start to execute only when their paired request lane is idle, quickly
stepping out of the way when the budget is exhausted.
\subsection{Tailor}
%problem
\section{Thesis Outline}
\section{Thesis Structure}
\label{sec:outline}
How many chapters you have? You may have Chapter~\ref{cha:background},
Chapter~\ref{cha:design}, Chapter~\ref{cha:methodology},
Chapter~\ref{cha:result}, and Chapter~\ref{cha:conc}.
The body of this thesis is structured around the three key contributions
outlined above. Chapter~\ref{cha:shim} explains the \shim
profiler, Chapter~\ref{cha:elfen} discusses the \elfen scheduler, and
Chapter~\ref{cha:tailor} introduces the Tailor analyzer.
Finally, Chapter~\ref{cha:conc} concludes the thesis, summarizing how my
contributions address the challenge of substantially improving continuous
profiling and the opportunities of discovering new optimizations, and projecting
future works.
......@@ -19,16 +19,57 @@
% Your macros %
\newcommand{\ttool}{\textsc{Shim}\xspace}
\newcommand{\tool}{\textsc{Shim}\xspace}
\newcommand{\Nnap}{Nanonap\xspace}
\newcommand{\nnap}{\textcode{nanonap}\xspace}
\newcommand{\nnaps}{\textcode{nanonaps}\xspace}
\newcommand{\shim}{\textsc{Shim}\xspace}
\newcommand{\elfen}{\textsc{Elfen}\xspace}
\newcommand{\Elfen}{\textsc{Elfen}\xspace}
\newcommand{\relay}{\textsc{Elfen}\xspace}
\newcommand{\Relay}{\textsc{Elfen}\xspace}
\newcommand{\oursystem}{\textsc{Elfen}\xspace}
\newcommand{\shimbatch}{\textsc{Elfen}\xspace}
\newcommand{\dte}{\textsc{DTE}\xspace} % DTR: double-time
% noise reduction
\newcommand{\anr}{\textsc{DTE}\xspace} %: automatic noise reduction (until we think of something better).
\newcommand{\anr}{\textsc{DTE}\xspace} %: automatic noise reduction (until we
%think of something better).
\newcommand{\tailor}{\textsc{Tailor}\xspace}
\newcommand*{\textbm}[1]{\textsf{#1}}
\newcommand{\specjvm}{\textbm{SPECjvm}\xspace}
\newcommand{\jess}{\textbm{jess}\xspace}
\newcommand{\raytrace}{\textbm{raytrace}\xspace}
\newcommand{\db}{\textbm{db}\xspace}
\newcommand{\javac}{\textbm{javac}\xspace}
\newcommand{\jack}{\textbm{jack}\xspace}
\newcommand{\compress}{\textbm{compress}\xspace}
\newcommand{\mpegaudio}{\textbm{mpegaudio}\xspace}
\newcommand{\mtrt}{\textbm{mtrt}\xspace}
\newcommand{\jbb}{\textbm{jbb2000}\xspace}
\newcommand{\specjbb}{\textbm{SPECjbb2000}\xspace}
\newcommand{\newspecjbb}{\textbm{SPECjbb2005}\xspace}
\newcommand{\psjbb}{\textbm{pseudojbb}\xspace}
\newcommand{\pjbb}{\textbm{pjbb2005}\xspace}
\newcommand{\dacapo}{\textbm{DaCapo}\xspace}
\newcommand{\dacapover}{\textbm{DaCapo v.06-10-MR2}\xspace}
\newcommand{\antlr}{\textbm{antlr}\xspace}
\newcommand{\bloat}{\textbm{bloat}\xspace}
\newcommand{\eclipse}{\textbm{eclipse}\xspace}
\newcommand{\fop}{\textbm{fop}\xspace}
\newcommand{\hsqldb}{\textbm{hsqldb}\xspace}
\newcommand{\jython}{\textbm{jython}\xspace}
\newcommand{\pmd}{\textbm{pmd}\xspace}
\newcommand{\xalan}{\textbm{xalan}\xspace}
\newcommand{\chart}{\textbm{chart}\xspace}
\newcommand{\luindex}{\textbm{luindex}\xspace}
\newcommand{\lusearch}{\textbm{lusearch}\xspace}
\newcommand{\lusearchfix}{\textbm{lusearch-fix}\xspace}
\newcommand{\avrora}{\textbm{avrora}\xspace}
\newcommand{\sunflow}{\textbm{sunflow}\xspace}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Here's an example abbreviation macro. Notice the \xspace at the end, to ensure
% it gets the right spacing after it.
\newcommand{\jikesrvm}{Jikes RVM\xspace}
% We can also make things sans-serif, which might be better style.
\newcommand{\avrora}{\textsf{avrora}\xspace}
\chapter{SHIM}
\label{cha:shim}
\section{Summary}
\chapter{Tailor}
\label{cha:tailor}
\section{Summary}
This diff is collapsed.
File added
......@@ -63,10 +63,9 @@
\input{intro}
%% Chapters
\input{background}
\input{design}
\input{methodology}
\input{results}
\input{shim}
\input{elfen}
\input{tailor}
\input{conclusion}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment