I think I’ve finally figured out what bugs me about Agile software
development, or whatever it’s called when it’s institutionalized at a
company: the fact that it says nothing about the actual design of
software systems, and that because of that, companies ignore design,
and because of that, teams get better and better at adding less and
less value and companies get better and better at expecting less and
less of their development teams.
Agile is a tactical approach to managing the day-to-day work of a
development team: breaking down a project into user/customer centric
tasks (which deliver customer value, let’s say), figuring out how much
those tasks cost, prioritizing, perhaps working out dependencies, and
then making a commitment for a given iteration. It’s also about
managing expectations about what a team can actually accomplish.
Software development is iterative, but businesses are waterfall in the
sense that they conceive, plan, develop and then deploy a product in a
nice straight line, then do it all over again depending on how the
market responds. (A series of waterfalls, which is kind of iterative,
isn’t it? I tend to think “waterfall” is a myth at worst, a metaphor
at best for something that doesn’t exist, but is useful for a
fear-and-doubt kind of argument.)
Anyway, an Agile process allows a given team to quantify just how
difficult or time consuming a software project is, and to track in
objective-seeming metrics how scope creep affects progress, and so
on. Stuff like that. I’m not so interested in the details of anyone’s
specific implementation of the Agile way.
What’s not part of Agile is design.
The design of software systems.
I’m not talking about colors, whitespace, borders, gradients, flat or
skeuomorphic or even the user experience.
I’m talking about how a complex software system actually works: its
implementation. Is it a monolithic app with many separately built
components? Is it a message-based distributed system? Does it use a
central data store, or small, domain-specific, distributed data
caches? Can it adapt to change without any one person understanding
the whole? Does it account for what you don’t know you don’t know
about the problem to be solved? Can lots of developers work in
parallel, or is any given feature tied down by a huge dependency
Agile doesn’t speak to this part of the problem, and I don’t think
anyone who understands Agile thinks it does.
How does having a daily stand up prevent you from building technical
How does a sprint review facilitate a design that mere mortals can
How does sprint planning make a project more maintainable?
Absurd questions, you might think because Agile is about
customer-focussed stories, not about implementation details.
But if Agile says nothing about implementation details, what does an
agile team do about, well, the implementation details they have to
have in order to realize the stories?
In my experience, there are two problems to be solved for a software
system: the implementation design and architecture, focussed the
ultimate business goals and on enabling humans to continue to work on
it (i.e., long term strategy), and team management for working on that
design within a larger organization (short term tactics).
The problem I have is that people think Agile addresses the first of
these problems, the architecture, when it does nothing of the
sort. Or, perhaps more accurately, folks think that Agile addresses
all their software problems, so they “bottom line” Agile as their
solution and stop there.
Thus, sane implementation falls on the floor. Any long term view is
trumped by short-term, sprint-planned tactics.
Because of this, the short-term thinking of the team management part
(daily stand-ups, a backlog of stories with no technical aspects,
tasks, each one of which is treated like an item on a grocery list
that doesn’t require the buyer to know what’s being cooked, etc, etc),
allows a team to accomplish every task set out for and agreed to by
them and end up with a giant, unmaintainable mess. Such a team will
end up going slower and slower.
When I’ve participated in such teams, I’ve been frustrated. I now
understand that personal frustration: without a long-term architecture
and design, all the process seems pointless, and pointless process
then seems to get in the way of coming up with an appropriate design
which provides for the ability to accomplish software, which is the
reason for process. Tactics without strategic understanding is really
hard, at least for me. Such a milieu tells me over and over again to
be and think like a contractor. Like a grunt. To have no real stake in
the project at all. To do things that are not only counter to any
recognizable strategic goals, but prevent future strategic
opportunities. That’s what I find frustrating: a process that
prevents me from doing good work.
So, the task for the Agile folks out there is to figure out a way to
raise implementation strategies to the same level as user
stories. Don’t mix the two. “Technical” stories aren’t
strategy. Instead, find a parallel mode for such things and emphasize
the hell out of it. Please. For all our sakes.
When you want to put something in front of a customer or stake holder,
you make a mock-up or a functioning prototype. The prototype teaches
you, the engineer, about a lot of things you’ll need to consider, both
hidden, in terms of the engineering challenges and visible, in terms
of the customers’ expectations and the users’ experience of the “flow”
of the application.
No duh, right?
The rightness is that you’re working on something the world has never
seen before and you want it in front of the customer as early and as
cheaply as possible. Prototyping is the process of discovery, of
charting a route from where you are to where you want to be. At its
best, prototyping leads you to a new and better end-point you wouldn’t
have been able to think of without taking those first few steps. You
want your customer to end up there on the West Coast where you both
knew you were heading, rather than in South Florida, swatting
mosquitos in lieu of each other, wondering how your best intentions
lead you so far astray.
What I want to suggest, though, is that prototyping shouldn’t stop
with what it’s possible to put in front of the customer. Any user
experience is the just the tip of the iceberg, after all,
especially in the world of network based services and applications.
Before any new software project, we ought to prototype every part of
the process that’s going to go into building and maintaining the
project over the long haul. Projects aren’t just about the customer,
they’re about the team that needs to fix problems, support the
customer, build enhancements and new features, bring new employees up
to speed, inform the business of the impact of those changes and so on
and so forth.
Sure, the initial customer experience is great, but customers are
fickle, competitors don’t stand still, and the first version is never
the complete version, often by design. Change is inevitable and if you
don’t anticipate and embrace it, you’ll find yourself in an even worse
situation: unable to change.
So, why not prototype the whole production experience as well as the
user experience? Why not consider the engine getting you where you
want to go as worthy of as much attention as the destination itself?
You can’t have one without the other.
Let’s take the idea of a spin across a few of the hidden concerns
making up the production of software.
Why not prototype a workflow using different kinds of version
control systems? How are you going to manage branches and
releases? How well does your choice integrate with build servers
(if you want them) or shell scripts or projects that want to use
the code as code (such as package deployers or automated
integration testers)? What about developer tools? You could use
SVN, for instance, which keeps a centralized server model, or you
could use BZR which tracks branches as separate directories on a
file system, or GIT, which is very will supported, but has a
higher initial learning curve. What are the long term
implications? How easy is it to export change history from one
system to another? Do you want a distributed work flow with
patches flowing to a designated “owner” of a module, or do you
want a central source-of-truth everyone pushes to and pulls from?
Do these tools work well on the kinds of platforms the developers
are likely to want to use for hacking? It’s always easy to go with
the dominant choice of the day (CVS back when, GIT or SVN now), or
with whatever the corporate mandate is. With a prototyped
workflow, you’ll be informed enough to make a compelling case if
you feel you need to.
Personally, I think it’s worth prototyping different ways of
organizing source code, especially if your project is going to be
large enough to require a sizable team. Do you want to use a
library model in which sub-components reside in separate code
repositories to be brought together via a dependency management
system? Maybe a distributed architecture is the way to go such
that each “concern” (or service) of the application is at the end
of a network pipe. In that kind of system, you can get away with
no shared code at all. (If you’re tempted to share code, it means
you’re sharing a concern, which means, well, refactor until the
need goes away.) Or might it make sense to bundle all the code
into one big repository organized by subdirectory? Easier to
change multiple concerns if (say) the data format shared between
them changes, but a lot of files are a lot of files, no matter how
you organize them. Try these things out. Distributed source
code (one project, one service) might be frustrating at the
beginning, but it might pay off big time as the project matures
and more and more people get involved. A new person can master a
small code base much more quickly and confidently (and thus feel
good about being on the team) than mastering a small subset of a
huge set of files, no matter how well organized. Doubt it? Try out
a prototype. Prototype something that’s the exact opposite of what
you think best and see if your misgivings are confirmed, or if
(you’ll hate to admit), the paradigm, tried in anger, is
compelling. Think of other developers as your customers. What’s
going to be easiest for them? Small, single purpose code bases, or
a large, organized, comprehensive code base? Which would you
prefer on the crankiest day of your life?
build system Are you going to use a simple build system
that only knows how to actually build the application, or do you
want one that can package it for deployment, run integration
tests, generate merge, test, contributor and dependency reports?
What are the strengths and weaknesses of the approach? Do you want
the build system to be dependent on multiple source code repos? Or
do you want component “integration” at some other stage, such as
deployment? Maybe try using a simple build, then using other
prototypes to see how easy (or problematic) it is to leverage that
system from afar.
deployment prototype Given a “hello world” of some sort,
prototype a way to deploy it. Is it an application running in an
app server? Does it run as a process hosted on a Unix OS? Windows?
Both? Maybe your prototype knows how to pull down the source code,
invoke the simplified build script, then generate an installable
package for your target OS. If you deploy to multiple platforms,
do you make a single project that can generate all the installers
or do you have separate, small platform-specific deployment
projects? Do you want to use this system as a precondition of your
testing apparatus? Is it easier for developers to come up to speed
on a monolithic “create an installer” project, or a sequence of
proxy server vs app-server prototype Suppose you’re
working on a system made up of separate but related components
making up a web-based service of some sort. One component might be
the UI, another a directory-server, yet another a background
data-import service. One way to put these together is to deploy
them to a single application server. Another way is to stand each
service up as an OS process and use a proxy server to route
requests. In effect, you’re turning a normal OS into a kind of app
server. You’ve decided you like this approach because it enables
you to spread the load over multiple host machines if you find you
need to scale in certain directions. No one believes you? Push
back from folks? Write a prototype! Stand up a few hello-world
apps and put them behind a proxy server and show how they appear
to be different aspects of the same app. What are the UI
implications? Make a single installer that can deploy the proxy
server configuration and the individual services all via a one
button click or shell command. Prototype the use of OS tools to
track the performance of the various services and the stats on the
proxy server itself. Make one of the services blow up and tank due
to an out-of-memory condition and show how the rest of the
services keep going. Show how you can replace one of the
components and introduce a new one without having to change the
existing components. Add a whole new host and integrate it via a
simple proxy-server update (automated, of course). Prototype this
with an emphasis on ease of maintenance. You might decide you
don’t like it, but you’ll make that decision based on actual
experience. Try and take the long view. What options does this
give in the far future? What options are precluded by NOT doing
it? It only took you a couple of days to get the whole thing
working. Think of the other developers on your team. Is this kind
of thing too complicated for them? A nice separation of concerns?
Let’s them work on services without worrying about their impact on
peers or various “security” concerns at the proxy layer? Easy to
fold into an “integration” stack?
black box testing prototype What might it be like for a
QA team to test your application? Can you get a project going
independent of your hello-world prototype’s source code, yet can
test it when it’s hooked up to the real resources its going to
need? Better to find out now. You can get away with manual testing
for quite a while but why not prototype an automated testing
system and see what it would look like before you get too far down
the implementation road? Unit tests are one thing, but external,
black-box testing is what you need for multi-process systems. If
you have a database, you have more than one process. At the very
least, you’ll keep that black-box testing prototype in mind when
designing the real product and that can’t be anything but good.
system updates / change over time / operations You’ve
prototyped your code management organization, your deployment
system, your black-box testing system. The real question is how
this helps you introduce change over time. How do you deploy new
code over the top of old code? How do you migrate data formats
from one version to the next? If you want to update one thing do
you have to update everything? How hard is it to introduce a
spelling fix? Can an operations team understand the risks of every
move you make? If you move from MySQL to MongoDB, how much of an
impact is that on your entire stack? How hard is that to test? How
hard is it to back out if it all goes wrong? If you prototype the
full “production line” (so to speak) you can answer a lot
of these questions while you still have time to make
adjustments. Technical debt consists of the problems you require a
“business decision” about “priorities” to fix rather than problems
you can just get done in the course of producing new features and
enhancements. If you build a system that allows change over time,
that minimizes the affects of any change on the whole system (from
source code all the way up to customer experience), then you’ll
have produced an engine that’ll enable an even better customer
experience than you imagined a that start because you’ll be able
to act and act quickly. Sure, there will be surprises, but you
want those surprises to be true surprises, not “we’ll figure
that out later” kinds of surprises.
These prototypes don’t have to be perfect implementations, they just
have to teach you enough to know if a given approach is going to lead
to road-blocks or unintended complexity in the future. For instance,
if you know that a simple build system that only builds (doesn’t
make deployable packages, or do integration testing, etc) is going to
work for you, you can always use a different (but equally simple)
build system in the future, or fix up your sloppy build file, or even
be in place to use an entirely different technology stack. But the
“idea” of the build system (that it’s simple and that other
traditional build-time tasks are handled elsewhere) is what you’ve
really worked out. The internal details can ebb and flow.
I’m really talking about interfaces here, I think, from the source
code on up. All these prototypes lead to well-defined interfaces. And
interfaces define the system, not their implementations.
Dedicating the idea of prototyping for the production of software as
well as the software itself is really just prototyping the customers
experience for a whole other set of customers: the engineers who are
going to have to work on the project for (hopefully) years to come,
including yourself. Sure, the bottom line is always going to be the
customer experience and the actual product. The application needs to
do what you say it’s going to do and what the stake holder needs for
it to do. Who can argue with that? But you want to be able to grow and
change the product and that’s going to be a lot easier to do when your
tools and processes are as thoughtfully designed as the end result.
Sorry for the cliché, but it’s so apt. Examples abound:
Twitter, Facebook, Google Search, Amazon, Apple’s iTunes and App
Stores, but even small e-commerce sites for mom-n-pop businesses. ↩
Library science and taxonomy standards exist for a reason, I
imagine. This stuff is hard. The part of you that files things into
categories doesn’t seem to be the same part of you that looks for
things. Once you get past a certain point, all you can really do is
search à la Google. ↩
I don’t ever want to suggest that software engineering is
anything like manufacturing. I just don’t believe it and a lot of the
frustration I and others have experienced is due to the imposition of
such methods on what’s more akin to movie making than gadget
This is the story of how I became interested in basing all distributed
systems on asynchronous messaging.
Some time ago I worked on a project to bootstrap a video conferencing
system. The hardware for the system was largely done, but the software
wasn’t much more than a few Perl scripts the hardware folks threw
together to show off their gear.
After some analysis, a colleague and I determined that the work to be
done broke down into two concerns:
A user interface allowing video conference room participants to
turn on a document camera, connect to a room in another city, view
a catalog of available rooms. (I also added an administrative user
A library for interacting with the rooms themselves. Each room had
several machines with network interfaces. To connect a room in one
city to a room in another city, you had to send signals to both
rooms telling them to reach out and touch each other.
A colleague and I split up the work. I took the “UI” portion (which
involved a lot of data modeling and usability guesses). He took the
library concern. He loved the problem of brokering socket connections
to RS–232 bridges. Me? Not so much.
We decided to write the whole thing in Python rather than the de
rigueur Java or .NET. He’d work on his code separately, I’d work on
mine, and then we’d hook them up to get this sort of functionality:
When a user pressed a remote-city’s button on the in-room GUI, I’d
record that connection in the database, then call my colleague’s
library to make the hardware perform the actual sound / video
When a user pressed a document camera button, I’d record that it
was on (so that UIs would reflect this on both sides), then call
his library to do the work.
My colleague’s library was pretty cool. He had configuration files for
each room such that we could connect them appropriately. We talked a
bit about the interface, how to import one side into the other, etc.
So, what we developed and tested separately was:
An admin UI and video conference room UI written in Python, backed
by a Postgres database.
A system interface library, also written in Python, with public
methods for connecting and disconnection rooms, turning on lights
and cameras, and so on.
The thing is, we used threads, a felix culpa if there ever was one.
The web server used threads, of course, to handle simultaneous HTTP
The system interface library used threads to manage communication
channels to all the devices at the core of each room participating in
When we merged the code together (the web app pulling in the system
interface library), nothing worked. I suspect this was the result of
Python’s Global Interpreter Lock.
To tell you the truth, I can’t remember the exact details of the
problem because the solution was so much more entertaining and (for me
at least) paradigm shifting.
I’d been interested for a long time in the idea of applications
“chatting” to each other in the same way that bots on IRC or Jabber
chatted with users, or each other, or, well, in the way that viruses
installed IRC bots on individual machines and had them communicate
back to an IRC room to await nefarious instructions. I just liked the
whole idea of “chatty” apps, how easy they were to write, and how a
single person could write a bot, yet participate in a rich,
So, to solve our concurrency problem, I proposed we use a message bus
so that our two layers could communicate without having to run in the
same process space.
My colleague was all for this because he wanted to have complete
freedom to change his code in the same way I did. As long as we agreed
on a simple, high-level message format, we could have complete freedom
on either side of it.
This kind of thing makes for very happy developers. You know it’s
Why not have both sides implement a REST style web service and just
have calls back and forth?
REST was very new at that time and not quite yet the stultifying
silver-bullet it has become today.
REST interfaces are good for leaf nodes, but for inter-system
communication, a message bus in which publishers don’t care who
consumes, and consumers don’t care who produces, seems far, far
more flexible and far less prone to technical debt.
In a distributed world, managing the configuration and tracking of
all the end points across multiple possible deployment scenarios
wasn’t really the best use of our time.
A messaging system facilitates the “symantics” of event driven
software, which seemed just write for a video conferencing
system. Sure, you can do that with Web Services, but it’s not a
natural fit and is thus easy to break discipline and incur technical
Fire and Forget 1: I wanted my UI to tell the system interface
library to “connect room-a room-b” and then assume that it
happened. I don’t need a response back, because the connection
would be obvious to the users of the room.
Fire and Forget 2: If the system interface lib needed to, it could
send me a message back, something like, “disconnect room-a room-b”
and I’d be able to adjust the state of the conferencing system
accordingly. I could even make a request for “give me everything”
and then adjust my state accordingly.
I wrote a message queue server in about 150 lines of Python and the
Twisted Framework, implenting a line-oriented protocol on a TCP
socket. The protocol was so simple you could log in via Telnet and
test things out.
and so on. Truthfully, I don’t remember much of the protocol anymore
but it was about as simple as the above, or maybe simpler. (If I can
make a message protocol simpler, I will). Any connection joining a
topic would receive all lines sent from other clients sending to that
Sending messages was as simple as writing a line to a socket:
send syslib connect room1 room2
If you’re thinking this is just like IRC, then you get it.
This worked really well for lots of reasons:
My colleague used the message bus to “backup” his data sets by
sending messages to a separate listening process, making up his
own protocol messages as needed.
We didn’t have to deploy the UI concerns on the same host as the
system interface library, each of which had (potentially)
different performance needs.
Not sharing code made each of our code bases much simpler and
We could fire up telnet and verify that my code was actually
We could fire up telnet and send messages to the UI to make sure
it did what we thought it would do.
When running in production, we could log every message and have a
pretty good idea of what went on.
The code was small, easy to understand, and the line-oriented
protocol resisted the propensity for developers to over-engineer
messages. The lack of persistent queues never re-inforced the
notion that message busses are for moving data, not storing it.
Text-based messages were easy to inspect and required nothing more
than basic Unix utils.
But the main thing was that we could work together without any
personal or technical conflicts and with a sense of efficacy. The
architecture itself reflected the nature of how we work as human
beings, especially experienced, opinionated human beings.
This asynchronous message-based architecture worked so well I started
to wonder where else the pattern could be used.
For instance, what about shared databases? These are always a problem.
If the developer of one application needs to make a change to the
database schema, he’ll have to negotiate with all the other developers
of the other applications using that same database. If those
developers report to different organizations, things get even more
complicated and slow — and political.
A good way to deal with this is to put some sort of service interface
in front of the database. A REST web service, for instance.
But what if you used some form of messaging to transmit copies of
database from one system, or “concern” to another? What if every
change to a database was broadcast such that any interested parties
could slurp up the change, ignore what it didn’t care about, and write
down the rest in a way that made sense for that particular concern?
As typical software developers, the idea of having “copies” of
something smells a lot like cutting and pasting. The thing is, the
needs of big distributed systems are not the same as a small
application in which you implement algorithms once and then
re-use. Right? And we’re talking about data replication, not
Anyway, all of this lead me to investigate JMS in the Java
ecosystem, but once I got into Erlang (in which there is no
shared state or mutable data), I pretty much moved into the mode in
which I could no longer be satisfied with typical server-side
Corporate Development and the Best Practices it tends to s[t]olidify.
That’s when I became the guy for whom bozo bits are flipped.
This was in the days before async socket based web servers
were all the hype. I suspect it wouldn’t have made much difference,
I’ve gotten so much value at conceiving of a system of
applications as agents communicating via asychronous messaging that I
think the burdern of proof is on the designer who wants to do
something else. By which I mean, it’s difficult for me to recover the
persuasive arguments in the face of such self-evident goodness. ↩
Since trumped by the various AMQP-like implementations. ↩
I was sick yesterday. Too sick to stay at work. Too sick to do
anything other than sleep or occasionally think about listening to a
podcast. I’ll spare you the symptoms except to say they involved a
headache and the knowledge that eating something was probably not a
What’s interesting to me, other than the utter relief of not being
sick now, is how being that kind of sick changes my senses.
I went to work, which was fine. I’d hoped that some coffee, and
something sweet to eat would fix me up. No go. Had a meeting. Typed an
email. After some quality time in the surprisingly cramped corporate
comfort station, I decided it was time to go home.
But the bathroom is where it started. Each station has a can of
something-or-other you can spray if you feel the need to be a little
kinder to those who come after you. Someone must have been kind,
because the vaguely floral scent was overpowering. It’s as if, when my
cognitive functions are impaired, my senses fill in the gap.
When I made my way on to the bus to head home, someone sat behind me
eating a Burger King sandwich. I could smell the wrapping paper. A
distinctive smell. Perhaps vegetable oil? The grease for whatever kind
of meat used in those sorts of sandwiches?
Thing is, I’m not all that sure she was eating something from Burger
King. I saw her sipping a soft drink before she boarded and heard her
shifting around a paper bag for what seemed thirty minutes and then I
caught the scent. Strongly. I might have imagined it, but I don’t
normally imagine smells. I probably did imagine it. I’m almost sure of
it. Mild headache, nausea: and boom: I can create a world of scents.
Speaking of that thirty minutes: time distorts tremendously when I’m
sick. I must have listened to her shift around that paper bag for 30
seconds but it seemed to me that the bag was three miles long. She had
to crawl carefully through, pressing out the sides with dry hands to
widen the passage until she finally found what she was looking for. An
epic struggle, that paper bag.
Podcasts, though, are the opposite. While they take forever to run
their course, the hosts talk in a bizarre form of English in which a
sentence has twice as many syllables as it needs.
At home when I couldn’t sleep, I could hear, through the general
white-noise of everyone’s climate control systems, an occasional bird
cry, wings fluttering against a window, a child calling from one room
to another in the house across the way, dogs grunting as they chased
each other across a muddy yard, the porch smokers mumbling around
their cigarettes, the occasional horn on the Fremont Bridge.
The world was alive and I was sick, present in a way it never really
is when I’m well.
You’re a software developer, an individual contributor at a company
who’s business is the selling of software and related services.
You have an idea about something neat that you think meets a business
need. You write it up, explore it a bit, maybe write a prototype, then
show your manager. He likes it and shows the director. The director
likes it and shows the vice president. The vice president says no, and
that’s the end of that.
The idea doesn’t really meet a business need, isn’t aligned with the
direction of the business, or, perhaps, he just doesn’t get it or
doesn’t buy the underlying premise and says no out of ignorance.
Regardless, he comes up with reasons for saying no and tells the
director, who adds a few more explanations of his own when he talks to
the manager, who expands on the context of the negative result to you,
the individual contributor.
A month later, you think up another idea. This one seems different
enough that the previous objections don’t really apply. So you suggest
it to your manager, who suggests it to the director. This time, the
director says no. He knows how the Veep thinks and it just isn’t worth
(and, he assumes, the Veep’s) time to move the idea forward.
Another month another idea. Your manager asks you lots of tough
questions which add up to no.
Eventually, things work like this: You have an idea, you sketch it out
on paper. You think it meets a business need, or reduces costs, or
might be the start of a new revenue stream: but these sorts of
“business” concepts aren’t your main thing, you realize, so you’re
probably wrong about it anyway. You shrug, set the idea aside, let it
the self as will and representation
What you’ve done is internalize a representation of the vice
president, the director and your manager, you’ve consulted these
chimeras, heard their likely answers (always no), and acted
You’ve internalized a power structure such that you no longer have to
engage with it directly anymore. Instead, you just know what will
happen and can thus avoid the pain of possible, perhaps likely,
But the pain is still there. You have the idea, you run it through
the internalized script of the above scenario, you experience the
rejection all the same, even if more quickly and less personally than
It’s not an especially complicated idea, that we tend to internalize
the authoritarian systems in which we find ourselves such that we
apply the dictates of that authority even when the “boss” is not
I’m not talking about basic morality — the capacity we all have to
see ourselves in others, and others in ourselves, and act accordingly
— but about the sorts of power structures we find ourselves in as an
often unexamined part of our culture, both at work and at large.
Sure, we’re all independent thinkers and masters of our own destiny
— or so we’ve been told enough times to internalize — but how many
of us, when on foot, wait at a street corner for the light to turn
green even though there are no cars coming? Walking across an empty
street, or not, is neither good or evil; it’s not something that does
harm to oneself or another. Though it makes sense to adopt conventions
about how we cross roads, most of us don’t cross against a red light
because we’ve internalized what’s basically an arbitrary (if
reasonable) rule. Even though there’s no one there to ticket us, there
might be, and, well, it’s just less complicated not to have to do a
risk assessment every time we step into the street.
There are lots of wonderful metaphors for this notion of internalized
There’s Bentham’s Panopticon prison in which all the prisoners are
exposed to the guards’ viewing station, but can’t tell if the guards
are watching. All they can do is assume an omnipresence and act
accordingly. Imagine cameras, and you’ve got Britain’s CCTV.
The idea is all over Orwell’s 1984 and is de rigueur for any sort
of depiction of the modern totalitarian regime.
Michel Foucault turned the panopticon into a famous metaphor for
modernity revealing the nature of social behavior in the context of
often unrealized cultural power structures from the classroom on up.
I imagine Freud’s Civilization and its Discontents gropes toward the
same idea when he suggests psychological discontent is the tension
between what we might normally do versus what we give up in order to
retain the benefits of civilization. This internalized authority is
like the buzzing of a computer fan in the back room, or the whitenoise
of an air conditioner: you only notice it’s irritating when it’s
I wonder if Kafka’s The Trial is part of this pattern. Josef K
spends the whole novel trying to figure out what he did wrong — he
never does — until the guilt becomes so internalized that he not
only accepts his fate but believes in its rightness.
the workplace as will and representation
Let’s get back to you, the individual contributor with lots of ideas.
In your mind, the vice president has rejected a hundred of your
ideas. In actuality, he’s rejected ten. Intentionally or not, he’s
created a situation in which it’s clear to you that the answer will
always be no, and, again, intentionally or not, he has communicated
(non-verbally) that he considers these ideas floating up from the
staff to be a bother and, well, he does have that hire-and-fire
As an “individual contributor” (aka, grunt), especially in the
software profession, you have a tenuous relationship with the Powers
That Be, otherwise known to you and your colleagues as “The Business”.
You want to write neat-o software, but you internalize the notion that
neat-o is a waste of time if the software isn’t right for “The
Business” or doesn’t solve a “Business Need”. You and your colleagues
encourage “agile” methodologies that don’t say anything about how to
actually write high quality software but instead help you “align”
with “The Business” and manage time-and-effort expectations against
“business needs” as well as deliverable goals or “stories” about
Snarky scare-quotes aside, I think this is largely all to the
good. Keep doing it. Seriously. Good software is good because it meets
a need. If you’re writing it in a business context, then that’s the
need you have to address.
But the scare-quotes aren’t just snarky, they illustrate how the
internalized-authority exists within and is re-enforced by your
colleagues as well. As an individual contributor, you’ve no doubt had
these phrases quoted to you whenever you make a suggestion that varies
from the status quo. Your colleagues will say these sorts of things to
you before they even ask questions about your idea to see if they
understand what you’re up to. They’re re-enforcing the internalized
power structure, so to speak, when they’re not just brushing you off,
having been down the idea road before you.
They’ve internalized what you have, that the Powers That Be really
aren’t interested, don’t want to be bothered, and have a much better
idea about what’s going on with the business. All of which may be
true. Or not.
But here’s the thing: as a mere grunt, or, uh, individual contributor,
how can you tell if any one of your ideas is contributing to a
business need, or not? You’re hired to produce software and part of
producing software — a creative pursuit regardless of the
engineering scaffold — is to produce ideas. You’re not hired to make
business decisions. Someone else is. By silencing your ideas, you’re
preventing the opportunity for business leaders to make those
how to tilt
Don’t assume that all the rejections and the quoting of “business
need” bromides at you in a kind of brush-off way mean that you’re
clueless about the business needs, or customer needs, or don’t come up
with good ideas. Don’t, like Josef K, assume you’re too ignorant to
have good ideas, and that the ideas you do have must be, by
definition, bad, that their rejection justified. Don’t thrust the
knife into your own belly. That’s that the veep is for.
The vice president is, ultimately, the one who makes the decisions, so
you should give him decisions to make. He should have to make those
decisions and live with the consequences, good or bad. If he tells you
to never present another idea, then that’s that, but let him tell
you that, don’t do that on his behalf, or let the intermediaries and
mentors tell you that. Get it out in the open, if you can.
Your job is to get better at presenting your ideas in terms the vice
president is likely to understand. You don’t want to fake him out or
hard-sell him (he’s too smart for that anyway), but you want to make
sure that if he rejects the idea, it’s because he fully understands it
but has decided it’s not where he wants to place his bets. You want
your real idea to be decided, not a misunderstanding.
So, keep the ideas flowing for as long as you can and learn from how
well they fare with the Powers That Be.
If you can bypass the internalized culture of rejection shared by your
colleagues, your manager, his director, and so on, so much the
better. Send email. Schedule a meeting if the chain of command isn’t
working for you. At best, you get that meeting and you’ll get better
at presentation. At worse, you won’t get the meeting because the real
veep declined it, not the imaginary veep you’ve internalized.
the re-presentation itself
If constant rejection of your ideas is too emotionally taxing (and how
could it not be?), invest your sense of success in the presentation
itself. If you’re sure that the stakeholder fully understands the
business implications of the idea (not the techno-babble
implementation details), then you’ve succeeded, whether he moves it
forward or not.
The successful presentation of an idea is when your audience
understands it, not when they accept it.
If you keep focused on that, you’ll find the energy to try again and
again, because the attempt itself becomes the success.
I like to use lots of programming editors and never end up sticking
with any particular one for long. All of them are interesting in one
way or another and my curiosity gets the better of me.
Thing is, people who do stick with a given editor get really good at
them, doing all kinds of automated things that I do by hand, playing
their editors like accomplished musicians. Take Emacs Rocks for
example. Geek performance art.
When I have a choice, I write software in in Clojure. Emacs coupled
with Slime/Swank or nREPL/nrepl just can’t be
beat. (There seems to be a lot of activity in the Vim universe along
these lines. I might give that a tourist try someday.)
Part of adopting a programming editor is to learn about its
extensions, and Emacs is extensible by its very definition (Editor
MacroS). So far, here’s what I’ve found useful:
This does exactly what it says it does. When you start typing a
symbol, Emacs suggests how to complete it. Press return, Emacs does
the rest. Sometimes this is quite annoying because it breaks your
flow, especially if you slow down enough that the pop up breaks in,
and then (at least I) spend too much time trying to make the widget it
go away. On the whole, though, it’s pretty neat, and it’s certainly no
worse than you average IDE. As should be obvious, the completions are
based on pattern matching, not a semantic understanding of your
One of the downsides of using Emacs (and, I imagine, Vim) on the Mac
is that it doesn’t participate in the generic spell-checking that
every Cocoa app gets for free. Enter flyspell. You can check spelling
“on the fly” as you type or you can run it once to find all the
misspellings in your buffer. Downside: it’s just not as automated or
fast as the native OSX stuff.
Emacs seems really focused on the one-visible-file-at-a-time sort of
editing. You can have multiple panes, sure, but tabbed panes (like in
web browsers) are (ahem) painful to manage if you have a lot of them
and the speed-bar (and other file listing options) are also
sub-standard (even if kind of neat in their own way). Enter
Interactive Do. This feature allows you to “fuzzy” search for files in
the mini-buffer, a kind of command-line completion sort of thing. It’s
smart enough to remember files you’ve edited before. If I’m deep into
a directory structure for some project or other, I can still C-x f to
the mini-buffer, type init.el, wait a moment, and Ido figures out
where init.el is and lets me load it up. The end result is that I
start to see this as a solution for the problem of giant trees of
files in a side bar, or way-too-many tabs across the top of a
window. Just punt and use search instead. (I also use ido-ubiquitous
with this stuff.)
Git. Who’s not using it these days? I’m one of those types who likes
to keep text editing inside text editors, and command-line stuff at
the terminal. But this mode is pretty nice. All I ever really do with
git, or other VCSs, is push, pull, commit, browse logs. Sometimes I
look at diffs. This app-within-emacs works really well for that sort
of thing. Here’s a nice little tutorial in two pars: part 1part 2. Could use more colorful diff output.
IntelliJ and Eclipse have these modes where when you highlight a
symbol in an editor, the IDE selects all similar words. When you
change the first word, all the rest of them are changed, too. A bit
safer than “search and replace”. This minor mode lets you do that as
well. Worth checking out!
The package extension (default in Emacs 24+) allows you to install all
of the rest of these extensions automatically. It’s about time! Very
nicely done. Downsides: There doesn’t seem to be a nice automated way
for Emacs to tell you that there are new versions of the extensions
available. At least not that I’ve noticed. If you use this, you should
add in the Marmalade repo.
A specialized mode I use for Clojure, mainly. Seems to work reasonably
lisp which means it’s just symbols and parenthesis (and brackets and
braces). Paredit keeps track of all that plus has a lot of handy stuff
for moving Lisp expressions around. I’ve barely scratched the surface,
frankly, and I still find it terribly useful.
Fuzzy-matching for Emacs commands. When you type “M-x” you’re put into
the mini-buffer where you can enter a command (such as
“string-replace”) or countless others. With smex enabled, you can type
in characters that are somehow part of a given command and be
presented with a list of matches. This is worth the price of
admission. For some reason, I find this much easier than
TextMate’s or Sublime Text 2’s versions of the same
thing. Not sure why.
Discovered this while working on this very, overlong blog post. I
installed it via the Emacs package management system, made sure it was
enabled for all modes. Now I can highlight a word or phrase, type a
brace or parent or quote and have the selection wrapped in that
A lot of the above brings parity between Emacs and other editors, such
as the venerable (alas) TextMate. Some things I’m never likely to use,
such as snippets.
Sometime in early 2004 an awful project I was on got cancelled. It was
awful for lots of reasons, none of them especially technical. I was
almost the only developer on the project. Normally, this is great for
me. I get to write a lot of code and see the whole thing out the door
from top to bottom. But not in this case. One of the main issues I had
to contend with were constantly shifting requirements, due, in large
part, to the fact that the project involved many stakeholders, each
one in a bid to see their particular vision as the right one.
To get the job done (a web application), I used a company customized
version of Struts to develop an application running on a cluster of
WebLogic servers. Struts is not the most agile of technologies,
obviously enough, but it’s not the worst, either. In fact, I got quite
good at getting things done with it.
Nevertheless, the constant stream of requirements changes, some of
them sprung on me after I’d finished the application, some of them due
to my own misinterpretation of what was said in meetings meant that I
could not keep up with the code. We brought on another developer, who
didn’t help much because he was completely unfamiliar with Struts, web
applications in general, and all the discussions that had happened
(I won’t even bring up the nightmare that was integration via
SOAP when you don’t control both sides of the
After the project was cancelled, I had a few months in which I didn’t
work on any project at all. In that time, I studied Lisp, Scheme,
Python and asynchronous messaging systems and key/value store
databases. When I finally got another project to work on, it was a
video conferencing project.
I and another developer used Python. I wrote all the front end bits
(controlling scheduling and the user interface), and he wrote the
backend bits (controlling hardware).
At first we tried to bind our code together, his as a library to my
application, but there were just too many issues, mostly involving the
whacky Python threading model (at that time), so we used a message
bus, inspired by IRC, separate processes communicating to a central
Our collaboration using this technology was extremely successful: we
implemented a massive amount of functionality in parallel in a very
short time using a surprisingly small amount of code. (It took the
subsequent team about a year to rebuild the functionality we had. They
ported the code to Java, using a distributed object model, rather than
messaging. And a much bigger team. They even had a full time build
Comparing the success of the new project, and the failure of the old
one, I came up with a few notes (around late 2004) I’d like to share
now. I recently found these notes, and what I noticed about them is
that I’ve been trying to get these ideas allowed into my daily
practice ever since. I’m pleased to say that when I did manage to get
my team members to use these sorts of things, the projects were on
time, required fewer people, and were generally easy to change and
always adjusted well to constantly changing requirements.
Anyway, here are the notes which were a sketch of a presentation I
gave at a Developer Days conference (sponsored by our group at that
(Italics denote my current commentary on my ancient notes.)
As we all know, requirements help us figure out how to get done what
we need to get done and when. Here’s a list of frustrations you’re
likely to encounter even if things are going wonderfully well:
“That’s not what we meant.”
“That’s cool, but now, let’s change it.”
“We appreciate the work you’ve put in, but we’ve been discussing
this among ourselves. Let me tell you what we’ve concluded while
you were working on that.”
Partner 1: Basically, it all comes down to A.
Partner 2: Basically, it all comes down to B.
A completely contradicts B.
(Basically, all vision is Parallax vision.)
“Well, in a few months, we’re going to hire some developers.”
“Let’s use our differing assumptions about what the customers want
to justify our disagreements, given that we can’t talk to the
"That solution seems simple enough, but we want to use buzzword X
with buzzword Y because:
We’ve heard everyone’s going that way,
We’ve heard it’s Best Practices,
The company is standardizing on that technology,
Your choice is not company approved."
I think I’ve heard each of those at status meetings and demos when I
thought the work was essentially done.
As a developer, you need to make engineering choices that help you
deal with the following problems:
No requirements, or not enough to justify one technology choice
or design over another.
Frequent “sea” changes, requiring a rethink of the basic software
architecture (or ought to), or at least changes from the database
on up through to the UI.
One day your “model” (code organization, problem breakdown) is
what you need, the next day it seems like it was a bad choice:
recipe for spaghetti because it’s too hard to start over.
The following was my assessment, in 2004, about what features a
solution to the above problem should have in order to have a chance at
Make it as easy to change code as it is to change a paragraph in a
requirements doc, a slide, or a diagramming tool.
Discover requirements and be able to adjust to them.
Discover constraints, and be able to live within them.
Note: Most folks think that you need to have all the requirements
down and locked in stone at least for an initial implementation. I’ve
never experienced this, and given that projects live of die based on
how fast you can bootstrap them, I’ve come to believe that one gathers
requirements via implementing ideas. The goal, then, is to figure out
how to embrace that and stop doing the things that make it difficult.
If the basic precondition of much software methodology and technology
choices is a good set of requirements, then we need a different set of
methodologies and technologies when that precondition cannot be met.
What we need are tools, philosophies, and techniques encouraging:
Ability to get stuff done quickly based on little information,
guesses, proposals, etc.
Guessing wrong should be cheap.
Ability to change practically everything without too much cost.
Writing executable requirements.
keeping it simple
Python, Ruby, Lisp, Scheme, Groovy, PHP, Erlang
Avoid complex build systems (i.e., ant > maven, or projects
depending on other projects in an elaborate tree of
Avoid complex data / business models (data pipeline and
transformation over elaborate relational state, if possible).
Ability to change things while the app is running.
No need to re-compile, re-deploy.
Extreme-programming techniques, as much as makes sense, but unit
tests if nothing else (esp. with late-bound, dynamic languages).
Super decoupled application architecture: talking separate
processes for layers, not just object interfaces. Write code to
network interface specs, not giant libraries which hide the
details. (Think HTTP/Rest vs SOAP.)
Prefer asynchronous to synchronous everywhere possible.
In-memory DB, then flat-file object persistence, then RDBMS. (An
RDBMS should always be your last choice, not your first. Do you
really, really need one for your app?).
Prefer computed HTML over templates.
If using PHP, embed everything in each page: no fancy MVC
framework: too hard to debug, too hard to maintain, and really
difficult for future maintainers to unravel.
HTML or template languages of any sort. Keeps the backend very clean,
Fear frameworks: don’t trade the possible complexity of the
problem for the definite complexity of the framework.
Prefer straight SQL to object/relational mapping frameworks.
Recognize that frameworks often exist to overcome the shortcomings
of less dynamic languages.
Generated from code if possible.
The code, and what it does, is the documentation.
We can help best by documenting the problems we’re solving rather
than the implementation. Rather than an SQL E-R diagram (for
instance), a description of the type of information and the basic
domain entities is more import.
Always document for future re-implementors and re-writers, not for
and so it goes…
A lot of what I wrote up back in 2004 still seems controversial among
a lot of people I work with. I find, though, that the controversy
usually breaks down to a single point: requirements.
When you work in an environment where you start with vague ideas and
write code in order to solidify those ideas, to discover the
requirements as you evolve a system, the above makes a lot of sense
(or would, if I fleshed them out a bit more). This kind of environment
is much more on the artistic, intuitive side of software development,
the side that acknowledges that every new project is a “first time”
situation. If the solution already existed, we’d just buy it, so we
might as well embrace the uncertainty and develop techniques to
minimize bad choices.
When you work in an environment where the outcome of a given project
is absolutely clear, then I think most of the above is not
necessary. It’s easy enough to go with the waterfall method, or at
least to start there, by gathering all the requirements, making sure
they’re written down, and then using those requirements to schedule
and scope. In such a case, you can use any technology you want because
you can know up front if you’re going down the wrong path simply by
looking at your requirements document.
A comfortable world, if you can get it, and one that, I’m convinced,
no longer exists. In fact, I bet it never existed.
Part 1, about the problem we had to solve (validating
product serial numbers), and the resources available to us to
solve the problem, and
Part 2, about our solution: using an asynchronous web
service as an external interface to our application, and
asynchronous messaging as the backbone of the internal
This third and final part is a catch all covering some of the
operational details of the service, including build, deployment, and
monitoring, connecting to the outside world, and testing.
digression on the evils of the “software factory”
Why would I, a developer of this distributed, asynchronous
architecture, have much at all to say about operational details? Let
me begin with a digression:
A lot of my fellow colleagues — developers, operational staff,
and quality-assurance folks — tend to think that software can be
done in an assembly line fashion. The developers write the code,
someone else builds it, yet another team tests it, and a final team
deploys it. They see this as a sign of organizational maturity, or
even as part of the maturation of the software industry at large.
Alas, I don’t believe the above for a minute. Yes, it can work, but,
in my own experience, it turns a going concern into a slow moving,
classical IT shop, who says “no” to product, marketing
or sales groups, bogging them down in endless progress and process
details. (And I’m not even talking about what it does to developers.)
In fact, I’d say that projects running in Software Factory mode are,
essentially, dead projects. No growth, no change, no evolution, and no
radical discoveries that open up whole new possibilities.
I’ve found that the more detached a given development team is from
issues of testing and deployment, the more mistakes they make, and the
more mandated policy and management is required, thus causing even
more mistakes. At best, moving things along is slow. At worst,
developers make fundamentally bad designs not because the designs
don’t work, but because they’re too hard to operate. What makes sense
in a single binary doesn’t make sense when an application is spread
over several binaries, and what makes sense on a single workstation
doesn’t make sense running on multiple hosts in a data center.
But let me leave all this for another rant. What I’d like to talk
about is how the asynchronous messaging architecture facilitated
build & deploy
If you read Part 2,
you’ll remember that we created five services making up the serial
number validation application:
Submitter: Accepted jobs for validation.
Publisher: Published results of validation.
Oracle Querier: Queried a remote Oracle Database for serial
Web Service Querier: Queried a remote Web Service for serial
Refiner: Delegated validation requests to the above query
services, and assembled results for publication, including
“fuzzy logic” for “almost” matches.
To build for deployment, we decided on the following principles:
Developers should be able to check out each project, compile and
run it with no extra environment setup on their development
machines. In other words, projects, as organized in a revision
control system, should be optimized for developer
productivity. And by optimize we meant quick edit-compile-test
cycles, and minimal (or no) documentation about how to set up your
Production deployment issues should be captured in its own
project, which knows how to check out the services, build them,
apply operational details such as configuration,
production-oriented log4j.properties (say), file locations, init.d
start/stop scripts, etc.
The guiding principle for all of the above was to separate the issue
of developing the code from the issue of deploying it and then solving
each of the problems according to the problem’s specific
requirements. (Using a single build process for both issues makes for
something far more complicated than keeping the concerns separate.)
Each of these services existed in a separate directory in a Subversion
Repository. Each service was build-able on the command line using
ant, which created a “target” subdirectory, moved all the
third-party jars, log4j.properties configuration and application
classes into that directory, and included a run.sh script which
could start the application for testing as you developed code.
Edit, compile, run, test wasn’t much more than the following command
target> `cd .. ; ant ; cd target ; ./run.sh`
After changing the code, you could just hit Control-C, up-arrow (to
get the above line), and return. Experts could refashion the above
command-line to terminate if the compile was unsuccessful rather than
run the code regardless). IDE lovers could configure
their software to do the above, but why bother? Using the command-line
guarenteed that other software (such as the packager or tester) could
also check out and build your app without involving an
We created a sixth project directory called the packager, which was
responsible for building the code for deployment. The packager created
RPM packages (our target was a RedHat Linux VMWare
instance). The project contained the production oriented
log4j.properties files, RPM spec files for the
post/pre-install steps, and so on.
On installation or update, the RPM packages:
created non-shell users for each service,
installed config files in /etc/,
installed RedHat style start/stop scripts in /etc/init.d,
deployed the binaries in /opt/apps/,
created a data partition for storing published files in /data,
configured Apache to redirect all port 80 traffic to port 443,
configured Apache to use mod_jk for proxying to the submitter and publisher,
managed and rotated the SSL certificate for Apache,
set up HTTP Basic Authentication,
and so on and so forth. In other words, installing the RPMs turned a commodity, standard-ops RedHat Linux machine into a Serial Number Validation machine without any user intervention.
The slightly-modified RedHat installed by the operations group had apt-get installed and pointed to a corporate repository for Linux, and so all we had to do in terms of “manual” configuration was add a line to the apt-get config file to point to our own repository.
From then on, deploying code for the first time was a simple:
apt-get install snv
with snv being a meta package which depended on the Apache config package, Apache itself, Java, our services, and so on. The dependencies were arranged such that everything was installed in the correct order.
To upgrade to new versions of the service:
and that was all there was to it. This worked for test environments, QA environments, and so on.
Because of apt-get, we were assured that all dependencies we needed were downloaded and installed, even if we introduced new ones with new versions of the application. It was impossible to install our code if a dependency couldn’t be met, and that’s exactly what we wanted.
The Ops Staff, overworked, underpaid, and under constant threat of being “right-shored,” were very happy about this situation. We developers were happy because our documentation for setting up and maintaining the service, wasn’t more than a single page, most of which was letter-head, introductory remarks, contact information, and so on.
connecting to the outside world
The serial number validation service was in no way a public service,
and was meant, at least initially, to serve only a single client. (We
accounted for the possibility of other clients inside the batch
submission format, the publication URL construction
formula, and other authentication schemes). As such, The Company
insisted on a two-way, SSL certificate authentication
What we ended up with was something like the following:
The client used a certificate to communicate with a load balanced web
proxy farm running in the data center. The web proxy redirected
traffic over an SSL encrypted socket connection to an
Apache server running as part of our service. The Apache server only
allowed connections via port 443, using HTTPS, and
redirected all other traffic to an error page. Also, the Apache server
was configured with basic HTTP authentication so as to
protect it from other services also running on the internal network,
of which it was a part.
This is a fairly traditional set up for web services, so I don’t
really need to go in to it. The set up was also out of our hands as a
development team. The one thing to note is that we deployed the Apache
set up, including the locally generated certs it needed, as part of
our installation, so it needed no intervention by an Ops staff.
In fact, the Ops staff took a cue from us and began to deploy
certificates via RPMs on most of their other
machines. This made things very easy for them when it came time to
monitoring, observing, etc
The one thing we needed to instrument for the first pass of the
application was whether or not the external web interfaces to the
application (Submitter, Publisher) were up or down. The idea was that
the load balancer between the web proxy farm and our application would
detect if the service was down and alert the appropriate support
Rather than afix this concern to either of the web services, we
decided to apply the idea of ruthlessly separating concerns by
creating another service, called the health monitor, which would
monitor all the other services, and publish a static Apache page
containing the status of the given services.
What this required was that each service implement a module which
subscribed to a ping topic, and published to a pong topic. A
message on the ping topic would produce an event that lead to a
message on the pong topic. That message contained the name of the
component, its location, and any other details we cared about. For the
first pass, all we sent was the name of the component, which was good
Here’s an illustration of the anatomy of a given service running in
(The above shows how easy it is to write event driven services which
are largely ignorant of the applications feeding them data, and are
also largely clear of complicated, data flow logic.)
The monitor service subscribed to all the pong topics, kept track
of the last time it saw a pong for a given service, and displayed an
error message on a web page if it had not seen a message in over a
minute. (In honor of the national security ‘color alert’ system going
on at the time, we added a colored square next to the name of the
component: with yellow, red, and green, for just how ‘late’ a pong
We never went any further than this, but it was pretty clear to us
that we could leverage that ping topic for all kinds of status
messages, and that we could use a similar set of topics for adjusting
service parameters on the fly. I worked on subsequent services where
we did this, but that story’s for another day.
We created a Python test
script, similar to Junit, but suitable for asynchronous testing. It
could spawn a process to send a serial number to the service, then
wait a bit, then poll for the result, test it, eventually timing out
if something went wrong. A Black Box tester. After I left the group,
another developer rewrote the whole thing in Java because he was more
comfortable with the language and with the threading tools
available. The fact that the testing module was separate from all the
others, that it was just another project within the source code vault,
made it easy to do just this sort of thing. No need to touch all the
other code: just create a new testing module, a better one, ditch the
old one, and there you go.
These three long semi-essays are really all the conclusion I need: I
wouldn’t have written this up if I didn’t think that designing a
service in just this way, using the underlying principles, anyway, was
just about always the right way to go.
For me, the big win was using topic-based, asynchronous messaging as
the way to do interprocess communication between the components of the
Using topics disassociated consumers from producers, simulating the
adaptability and conceptual simplicity of the stdin, stdout
filters making up most of the tools we all know and love on the
UNIX command line.
Using an asynchronous mode encourages event-based programming, which
tends to make each component much easier to write and far more fault
tolerant. Actually, I should amend that: asynchronous massing forces
you to deal with fault tolerance as a design issue rather than an
afterthought when you go about making your code production worthy. For
instance, if you ship data off to a topic, and don’t know when you’re
going to get the results back, the solution of persisting the
intermediate state to disk (say), and then re-loading it when a
message comes back, is both the solution to an asynchronous
request/response pairing, and services the needs of a fault-tolerant
system that might crash (or get re-installed) at any time.
Asynchronous modes are so usable, I think, that they should be the
default for how you design services rather than an exception. You
should only use synchronous calls when there’s no way you can get
around it (such as we did for the submitter). And even then, you can
often simulate asynchronicity.
Finally, a big win all the way around is the use of packages (or
installers) native to the Operating System on which you’re going to
deploy the distributed application. This encourages automation, gives
you dependency checking for free, reduces the amount of documentation
you have to write, and builds trust between the development and
operations sides of the house. (Any Ops person who’s had to read a
five page “cookbook” for installing updates while in a cube being
interrupted by marketing and sales folks will very much appreciate
I went on to use these techniques in a couple of later, much bigger
projects, and the developers I worked with, once they gave up thinking
I was crazy or self-serving, ended up really liking these
techniques. We were always done way ahead of schedule, never had to
work weekends (at least not because of our own code), and were
generally insufferable in our glee at being ahead of the game.
And what’s not to like? You write very simple, single-purpose
applications, and, somehow, as a side effect, you end up with a rich
and complex distributed system.
Complexity is, after all, an emergent property of systems of simple
components. Make those components OS processes, and you’ve got a
distributed application that works and is easy to evolve. Ideas worth
This is part 2 of a 3 part series about an asynchronous web service I
worked on a few years ago which lead to a lot of the ideas I now hold
about how to design distributed systems. In part 1, I
talked about the problem we had to solve, which was:
create a web service to validate serial numbers, and
figure out how to negotiate numerous internal resources, not all
of which are available all of the time, and most of which were
expected to change over time.
We ended up deciding to solve the problem by creating an asynchronous
web service with the following interaction (from the client’s point of
A client submits a job containing one or more serial numbers for
At a later point in time, the client retrieves the results by using
the individual serial numbers to compute a URL.
In other words, the client polls for the results and may resubmit
numbers if it feels that the result has not appeared in a reasonable
amount of time.
areas of concern
The first thing we did in figuring out how to build the application
was to figure out what problems we had to solve, or what areas of
concern we had as far as solving the problem. Here’s what we came up
Accepting serial numbers from external clients for validation.
Publishing validation results.
Querying the Web Service resource for serial number data.
Querying the Oracle Database resource for serial number data.
Given all the results, refining an appropriate answer.
What we ended up with was a rough pre-design as follows:
As you can see, these areas of concern line up pretty obviously along
the lines of what external resources they interact with, or clients
The dashed lines represent the division between the solution space in
which we implement pieces, and the partners with whom we need to
The submitter and publisher line up with the client: they’re client
interfaces and most of their concern is made up of how best to
interface with a remote client, the Rebate Processing company.
The Oracle and Web Service Query concerns line up with the services
they consult. Most of their concern is with how to contact,
authenticate, query for and process the results of the data.
It doesn’t take much of a leap of the imagination to see the above
five areas of concerns as five separate services within a distributed
application. You could also see them as five separate modules in a
single monolithic application. (Or even, say, five different Web
Applications in a Web Container.)
why separate services?
Based on my own experience writing monolithic apps in the presence of
ever changing requirements, integration touch points, and
implementation technologies was quite painful in that the nature of
what it takes to manage separate concerns in the same code base slowed
me down considerably.
Therefore, I advocated strongly for maintaining separate applications
for each area of concern.
It’s not that monolithic applications are all that bad (oh, all right,
they are), it’s just that by merging all five concerns into a single
application, one ends up introducing all kinds of additional
abstractions in order to manage substantially different tasks.
For instance, just about everyone is tempted to treat the results of
the Web Service in a way similar to the results of the Oracle SQL
result set in some data abstraction layer that’s really, really cool
to implement, but becomes the absolutely wrong thing to do when you
need to adjust to a new requirement. And don’t get me started on
elaborate XML meta/domain-specific languages meant to bind disparate
concerns together into a single binary in hopes of creating something
easy to refactor.
Finally, if the implementation of one of your areas of concern is a
bit wonky, or uses memory-leaking libraries, it’ll take down the whole
app and you’ll never figure out why. Is it due to the implementation
of one of the concerns, or due to the impact of one concern’s
implementation against another concerns when they’re running in the
same address space?
By splitting the app into five separate services, you can at least
rule out the other four areas if something goes wrong.
pass 1: a vague architecture
Okay, so we decided to write a bunch of stand-alone services rather
than a single application.
Here’s what we ended up with:
Each area of concern became a new service in the distributed
architecture. Each service can be optimized and designed according to
its specific concern. For instance, the Submitter Service is good at
accepting client connections and validating data without having any
part if its code base have to deal with Oracle JDBC
drivers. It can implement caching schemes, thread pools, and so on,
depending on how the service needs to work under load.
The biggest win for this kind of separation is how much easier it
makes any given developer’s life. If the author of the Submit Service
wrote a lot of ad hoc, not-well-planned, first-draft code, well, no
big deal for any subsequent maintainer. Because the code only does one
thing, even the worst code ends up being easier to figure out.
The question the above illustration brought up next was how these
applications were going to communicate with each other. Back then,
SOAP was on the way out. Even when you own all sides of
a given distributed application, SOAP proves to be just
too much book-keeping all the way around.
That left “simple” sockets and a custom protocol, or
HTTP, or something asynchronous, like a
JMS provider, which is what we went with.
pass 2: asynchronous messaging
The things we liked about the JMS /
asynchronous-messaging approach were:
emphasis on interfaces: In a message-based system, the messages
are the architecture. As long as the messages are
self-describing, complete, autonomous data qua data, nouns
instead of verbs, the application becomes easy to document and
easy to understand for the people who have to maintain it. In
other words, given a certain message going in, and another message
coming out, you can pretty much deduce what the service does
without any documentation at all. This is a good thing. Burying
interface decisions in shared-libraries (say) of your own
composing, or available via app stacks, such as
J2EE or .Net, often hide how things are done and
thus make debugging and integration difficult.
decoupled concerns: With asynchronous messaging (using topics
rather than queues), your interfaces are decoupled even from the
other services making up your application. A given service
publishes data to a topic and doesn’t need to be concerned if, or
where, a given consumer of those messages resides, or what its
purpose is. (With HTTP, you have to know the URL to
post to, and that URL has to be up. If it’s not,
you have to manage fault tolerance yourself.) The service consumes
messages in the same way. It’s as close as you can get to a
standard-in, standard-out kind of UNIX
event driven: With such a system, individual services can be
event driven. A message comes in, it gets processed, and then it
gets posted to another topic. Very clean, especially given that
messaging infrastructure, in our case, ActiveMQ, provides all
fault-tolerant communication for you. Writing the individual
services in such a distributed application becomes similar to the
callback methodologies in GUI programming (though
I’d not want to press on that analogy too much).
easy to evolve: If all interprocess communication (except leaf
nodes, e.g., the interfaces to the outside world) use topics (the
JMS version of the blackboard metaphor), you can
hook additional clients to those topics to expand functionality
without having to change any of the existing components. This
comes in handy for monitoring and metrics, especially during
development. Given that we weren’t sure what additional internal
resources we might need to consult as the project matured, being
easy to evolve with as little code change as possible was a very
good thing for us.
hot upgrades: If you’re okay with the external interfaces going
down for brief moments, you can re-install all the components
making up a message based system without taking special care to be
“down for maintenance.” As one service shuts down, the message
broker keeps the messages in its local store. When the message
broker itself goes down, each producer client blocks until it
comes up again.
Another aspect of the technical choice we made was that a
message-based system was new to most of us, and its good to gain a
much broader perspective on the types of architecture one can use to
solve problems. The appeal of trying something new rather than
suffering from the same old problems is not something to be shrugged
off. We’re all human and software is an art and a craft. Sure, it
makes use of some engineering principals, some science, some
mathematics, and even rules of thumb, but so does any fine art. We
wanted to recognize the need to explore alternatives and embrace
rather than deny it under the rubric of traditional “best practices”.
If the above seems kind of sketchy for justification, chalk it up
partly to my faulty memory, and also to the fact that an architecture
that embraces the asynchronous style everywhere it can is best
justified by how easy it is to maintain, how little code you have to
write to support it, and how simple it is to understand and
trouble-shoot. These are experiential justifications which are hard to
justify via diagrams or the simple three-tier design that so many
managers and stake holders and operational staffs are familiar with.
pass 3: how it turned out
Given the above diagram as a starting point, and the notion that we
wanted to use asynchronous, topic-based messaging as our data
pipeline, all we had to do was place a topic between each stand-alone
service along all the internal interfaces:
The above illustrates the one big drawback to messaging systems:
they’re hard to draw in such a way that a managers or architects don’t
rub their eyes and mutter, “too complicated,” or, as I translate it,
“too many notes.”
The most complicated part of all of the above is the Refiner Service
which has a lot of inputs and outputs.
Here’s a narrative of how the Refiner worked, which should give you a
flavor of how easy it is to think about something rather complicated:
The Refiner receives a message from the job-submit topic,
unpacks it, crafts up a serial-number message, and posts it to the
web-service-query topic, and then it’s ready for the next
The Refiner receives a message from the query-service-result
topic, unpacks the message, and examines it. If the serial number is
validated, it posts the message to the job-complete topic. If it
is NOT validated, it posts the message to the database-query
topic. And then it’s done (or, rather, is ready to process the next
The refiner receives a message from the database-query-result
topic, examines all the results available, figures out how to
describe the result (good, shaky, invalid), appends that data to the
message, and writes it to the job-complete topic, and we’re
With not much imagination, you can see how each of these flows can be
organized as a “plugin” floating in the Refiner Service, with an
outbox publisher and an inbox subscriber object with appropriate
callbacks. Need to query additional resources? Just add more plugins
and adjust the existing inboxes or outboxes as necessary. (And
remember, changing topic names in your code is a lot easier than
changing XML bindings in three config files.)
Don’t like how things are going? You might choose to rewrite the
Refiner Service, or split it into three services. Regardless, the
Submitter, Publisher, and both Resource Query services all remain
That is, unless you change the message format. But, again, changing
the message format is changing the architecture, and even that’s
pretty easy (if you use, say, XML and XPath, or even
JSON, in which case adding new elements does not
require immediate changes if the consuming services don’t need the
The upshot of all this is not that there won’t be change over time, or
that a particular change might not have to be done in multiple places,
but that it’s always clear what the impact of any change will be,
and, because each service in the application is small and
single-focussed, it’s easy to assess the impact of the change on any
I cannot over-emphasize how important this kind of architecture turned
out to be for managing change over time with only one or two
developers and an extremely over-worked operations staff.
In the next part of this long, long essay, I’d like to
discuss some of the operational details that the messaging backbone
afforded us, and how we deployed and maintained the application as a
series of services running on a linux VMWare instance, and, finally,
what happened to the service when the maintainers were forced to move
it to a J2EE WebLogic cluster solution.
A company I worked for (let’s just call it The Company) sold a lot of
products and offered a lot of rebates. The rebates were processed by a
Rebate Processor company which took in the numbers and other rebate
information from customers (such as product descriptions), did all the
paperwork, sent out the cash, then billed The Company for its efforts.
In other words, The Company outsourced the handling of rebates, as I
imagine many companies do.
The problem, though, was that it was possible for miscreants to
introduce fraud into the system by submitting properly formatted
serial numbers which where, nevertheless, fake.
What The Company wanted to do was offer a service such that the Rebate
Processor could ask us if a given serial number was not only valid
(proper numbers and letters in the right order), but had actually been
issued against a product instance.
(My apologies for the vague language, but hopefully you understand the
legal implications of mentioning anything to anyone. Oy.)
You’d think that the solution should be pretty easy. Just offer a web
service that, when you post a serial number, responds with a “true” or
“false,” depending on whether or not the serial number was ever used.
My company did not have a single data source with all the serial
numbers ever used for all the products it sold, or had ever sold.
The Company had acquired many other companies, each of which had
their own methods and data stores dedicated to issuing, managing
and tracking serial numbers.
The Company itself had, for a long time, developed a federated
culture in which each division was locally managed, with only
minimal oversight from the corporate leadership. Each of those
divisions represented quite varying products and product families,
and each one had its own way of managing serial numbers.
Over the years, there were efforts to consolidate this information,
and those efforts were largely successful in that there were just two
sources to consult about the validity of serial numbers:
A web service, with a complicated XML/HTTP
interface. Not SOAP, not REST, but
just XML posted to and retrieved from an
An Oracle database. A BIG Oracle database, with lots of views, and
many tables containing many serial numbers defined in a
not-readily-discoverable, and potentially ever-changing ways.
These two data source were internal data sources, and did not have
particularly stringent service level agreements. If the Oracle
Database needed to go down for maintenance, it went down for
maintenance, users beware. Same with the web service. The potentially
lackadaisical uptime for these services was reasonable, given what
they were normally used for, and given their role in the normal
business operations of the company.
Finally, there was a good change that a perfectly valid serial number
on an actual, physical product was not in either data store.
summary of complicating factors
Let’s summarize the situation:
More than one internal service.
Unreliable internal services.
Both internal services (potentially) must be consulted to resolve
a question about each submitted serial number.
Valid serial numbers might not be found in any internal data source.
With new acquisitions, there might be additional data sources to
The two existing internal services might merge, or morph into a
third, grand-vision, data-warehouse-like thing (which is always
the threat in a corporation as mind-bogglingly, borg-imitating
like The Company).
The bottom line is that any design, we thought, would have to
accommodate, maybe even, dare we say it, make it easy to make changes
In attempting to work out what to do about the above, we contemplated
several options, which boil down to the following three approaches:
Synchronous with Synchronized Cache: Periodically import all
serial number data from all available systems into a local service
database, and serve out answers synchronously.
Synchronous, Luck of the Draw: Each incoming web request
should consult each internal service in turn, and respond with the
results, as best it can, even if one or more of them are down.
Asynchronous: Submit a request asynchronously, and look for
the completed request at a later time. Internally, we move the job
around, consulting each source, make our best guess about the validity
of the number, and “publish” the result for later pickup.
We chose the last option (thankfully, or there’d be no reason to write
this, at least as far as my interests are concerned).
We couldn’t use Option 1, in which we’d import all available data for
several reasons: even if we import only the serial numbers with no
associated metadata, we’d have more data than our little effort could
sustain, and some Architect who didn’t understand that shared state is
bad would see the copy as duplication, rather than caching, and nix
the project. Finally, we had to also import additional metadata so
that we could guess if a given serial number we didn’t have is at
least likely to be legitimate. (We’d publish a “confidence factor” if
we couldn’t find an exact match.) And, of course, we only had about
two months to develop the entire solution and even if importing data
was fast and easy, procuring enough infrastructure to make it happen
most definitely wasn’t.
Option 2 seems, on the surface, the most reasonable, except that one
of the resources we needed to consult was a database with millions of
rows of data. It was unclear that the queries we’d have to run to make
it work could complete before an HTTP request could
complete. Timeouts are unpleasant on either side of a remote procedure
call. Also, the resulting monolithic webapp code would be further
complicated each time we added a new resource to consult, or had to
change our strategy. How would we know if fixing one part of the app
would break the other, seemingly unrelated part? What we needed was a
way to handle a potentially long-running request.
And so, finally, we settled on Option 3, an asynchronous web request
style service, which is maybe another way of saying “batch
The client would submit a batch of serial numbers and any associated
metadata (such as the model of the product). We’d return with an OK
if the batch job was valid, and at some later point in time, the
client would use the numbers in that batch to poll for any results. If
the client could find no results for a specific number, they were free
to re-submit it in another job.
Using the above strategy also allowed us to carry the idea of
asynchronous services behind the scenes and make it the underlying
methodology of the entire supporting architecture.
That architecture, with all the unintended benefits it provided, is
the whole point of this long exposition, and will be the main subject
matter in the next article.
Tim Bray’s article got me thinking about REST and about
synchronous vs asynchronous interfaces.
What really got me interested in asynchronous services, especially
message-based services of the fire-and-forget kind, was how helpful
such things were as you develop and maintain services over time, and
across organizations, or even across the “divide” between one
head-strong developer and another, or between two tasks that are
completely different, but share data.
But that’s for another note, another time.
One of the issues I’ve had with REST is not the style itself, but its
synchronous nature, or at least how it’s used in common web-service
style architectures as not much more than a function call. Cleaner
than SOAP, certainly, more maintainable and understandable, but,
basically, a function call.
Nevertheless, a step in the right direction, at least for me, is to be
able to use a REST style HTTP interface in a fire-and-forget kind of
Tim’s article is mostly about making HTTP requests which initiate
actions that take longer than a traditional HTTP request should
last. How do you workaround connection timeouts?
I’m interested in a slightly different but related idea: how do you
make a REST request without getting any answer, but then set things up
so that you can get the results of that request at a later time?
The idea of polling is that you submit a request to a specific
resource, which returns an in-progress result code, and a payload with
a URL providing you with a resource you can consult about the status
of the job.
You can periodically issue a GET on that resource to find out the
status of the job. Presumably, when the job is complete, the poll
request will provide a link to the finished results (if there are
I’d guess that this solution is pretty hard to scale, though that
might be done by returning not only how complete the job is as a
percentage, but an allowance for how many times in a given time frame
a client is allowed to check back. For instance, a client might be
allowed to check back 3 times a minute, or each check might provide a
suggested time for when to check back next (and refuse any checks
earlier than that).
I’ve actually implemented a polling-style service like this a few
years ago. The client would submit a request to the service with a
payload containing a unique ID. At a later time, the client was
supposed to use that ID to construct a URL to look for the result. It
was considered okay if, after a sufficiently long time without a
result, the client could re-submit the request. As far as I know, the
service is still in production.
My favorite method is the callback.
When you submit a job, you include in that job a URL to which the
results should be POSTed. Your client can then consider its task of
submitting the job as complete and go on doing other things. Sometime
later, another part of your app, the “server” part, gets a request,
which is the result of the job.
Event driven logic.
Very clean. And works well, if you control both sides of the network,
as in, both sides reside within your data center. Not so good if you
want to make a request from your data center to the external world,
given firewall issues. (Of course, given that the call back is HTTP,
it’s probably not as controversial as your average Ops manager might
make you think.)
The really important part of this, though, is to allow for
asynchronous behavior on top of a strategy that (in the pop culture
that rules the tech world), is mostly conceived of as synchronous,
If you submit a request without requiring an immediate response, that
job can be shipped off to be handled my many internal (and invisible
to you) services, any one of which may fault, or timeout, or throw
exceptions. By requiring an asynchronous methodology, the immediate
transaction, that of connecting to the job-submission service, is very
simple. Either the job was submitted, or it was not.
It’s also straight forward to find out if the job succeeded or failed
via the polling or callback methods, at which point you re-submit the
whole thing, or simply log it for later human intervention.
Large grained simplicity over fine-grained complexity.
People in my line of work (writing distributed, network applications)
seem to be afraid of Erlang.
Several reasons, I think:
Something new to learn, and thus they feel that they’re at a
disadvantage, or won’t be able to contribute or they’ll acquire a
skill with no value on the job market.
Worried that they won’t be able to find developers to work on the
code base once the original authors leave, and fear that any new
technology is just as bad as the last few “hacks” they graciously
allowed to invade the architecture (except Java, which is somehow
always the right thing).
The strange thing is, I see people all over the place who want to work
with Erlang (or Python or Ruby or Scheme or Lisp or Smalltalk), not so
much because Erlang is a cool language, but because it solves so many
of the problems they have to deal with day in and day out. And it’s a
The whole thing strikes me as a destructive, self-fulfilling
prophecy. We won’t invest in the small amount of time it takes to
learn the language because we don’t see that there are potential
employees out there to be hired if we need them. And there aren’t any
potential employees because no one will let anyone use Erlang for
A Catch–22, to be sure.
What distresses me is that, deep down, such decisions aren’t
rational. They’re based on fear. Fear of change, and fear that one’s
employees might be necessary to one’s success, rather than being
discardable, fungible assets. One could claim that choosing Erlang in
itself is emotional, and I’d tend to agree, but it’s a positive
emotion: pleasure at being able to solve problems more easily,
pleasure learning something new, pleasure at opening up the possible
range of solutions for any given problem. If one makes a decision on
emotional grounds, these are the right emotions.
What also distresses me is that I don’t believe hiring managers know
anything at all about who they need to hire, and the skills they need,
and thus fall back on the notion that expertise in a set of platform
tools means anything at all. (A man who can wield a hammer with the
best of them may or may not be good at building cabinets.)
Right now, when I see job positions asking for candidates well-versed
in Java, J2EE, Spring, Hibernate, Inversion-of-Control, and so on, I
see a shop in which very little gets done over a long period of time,
a process-bound, over hierarchicalized, overly large team: the
mythical man month
incarnate. I see a recipe for failure.
Like the lumbering empires of old, the appearance of strength and
stability hides a vast and empty core.