Skip to content

Java v C# – Pass by reference

Java doesn’t support pass-by-reference, whereas C# does.  Generally I find pass-by-reference to be of minimal importance, but there are a few patterns where I’ve become a big fan of the feature.  For instance, it allows you to combine the “verify my object exists in the collection” and “return the object” in a single step without having to deal with magic values:

MyClass obj = null;
bool success = some_collection.TryGetValue(key, out obj);
if (!success)
{
    return failure;
}

// other logic here, confident that obj was initialized.

Java doesn’t have this feature, and I miss it.

 

Aside: while it initially irked me, I think the way C# handles ref and out parameters is nice – they have to be annotated both at the method definition and at the method invocation, which eliminates any confusion about whether you’re passing by reference or value.  Kudos to Anders.





						

Comparison of Java and C#

After doing little other than C# (and that “little” has been PHP) for three years, I just landed a job as a forward-deployed engineer (think client-facing software engineer) at a company that does entirely Java development.  So now I’m learning Java, and thought I’d post my thoughts on the process.

One-class-per-file: It really irks me that Java enforces the one-class-per-file (and only-one-file-per-class) paradigm.  In particular, it seems like this really hampers classes where you want to mix generated code with user-created code (cf partial classes in C#).  I suppose one could argue that this enforces responsibility segregation by making separation of machine- and user-generated code the path of least resistance.  I’ll keep this in mind as I go forward – maybe I’ll be a convert by the time this is over.

The Benefits of Blogging

I had run across Roy Osherove‘s poll back in September regarding readers’ favorite “isolation frameworks,” and took exception to the first two comments, which asserted basically that the frameworks listed are mocking frameworks, and that you don’t isolate with a mocking framework, you isolate with good design.  In fact, not an hour ago I started writing a rebuttal to that assertion, describing how the natural product of applying the Single Responsibility Principle in conjunction with the Dependency Inversion Principle is decoupled, stable, highly coherent code, but that isolation is a local operation accomplished within tests, frequently with the aid of a mocking framework.  Thus, because code isolation is a per-scope property, not an intrinsic property of the code (i.e., it’s contextual), any framework that allows you to achieve this local isolation is, in fact, an isolation framework.

Then I realized that the same argument applies to code design.  This local isolation is only achievable through the creation of decoupled, abstract, highly coherent and minimalist code.  Those are necessary and sufficient conditions for code isolation; therefore, they are effectively congruent with isolation.  The end result is that the two commenters, which whose comments I took exception, are absolutely right: isolation is a design feature, not a framework.

The upshot?  By taking the time to write up a blog on it, I was forced to think about the issue in enough depth to realize I was wrong.  Blogging has made me a better programmer.  All you regular bloggers already knew that.

Tuxedo: Groups

The Institute for the Study of Violent Groups maintains arguably the world’s largest database of open-source, group-based violent incident data.  Our model of group dynamics is rich, and constantly evolving.  On the bright side, that means our ability to capture data is consistently improving; on the down side, that means that our IT team has to regularly develop and validate rules and processes for recoding old data to fit the new model.  Over the course of several iterations, we’ve moved from a completely ad-hoc approach to this recoding strategy to a more structured and formalized approach.  That’s fodder for another post but Kyle Baley has a good article on dealing with a similar problem using AppEngine; we’ve felt his pain (though the defer-updates-until-the-data-are-used approach was new to me – luckily, we can afford the downtime required for the “Big Bang” approach most of the time, so we don’t need to deal with the additional complexity of lazy upgrading).  But that’s not what I came here to talk to you about.

What I actually want to talk about is how to represent group dynamics with accuracy, flexibility, and fidelity over time.  To give you an idea about the lessons we’ve learned, I want to walk you through the evolution of the ISVG group model. Consider the basic case:

A terrorist group bombs a building.

Short and sweet, right? How do we go about modeling this?  In the first iteration of the ISVG model, the model was very simple:

One class for group, one class for event, and one class to provide many-to-many linking between them

The original ISVG group-to-event model

Done.  And for this basic case, we still have this exact model.  There are quite a few supporting classes (target, motive, weapons, tactics, etc.), and we classify group-to-event relationships in multiple ways (perpetrator, victim, etc.) but at heart it’s still just as you see it above.

OK, we seem to have done pretty well.  Let’s keep going.  We get a report that indicates that the Abu Sayyaf extremist group in Philippines actually calls itself al-Harakat al-Islamiyya.  That’s information we want to track!  So clearly the database needs aliases, right?  Let’s add that.

One class for group, with one or more GroupAlias instances

Alright, that’s a straightforward implementation.  So we did that and ran with it, never thinking that we were missing something.  But then one day one of our researchers tells us that Group X is now known entirely as Group Y; it’s not an alias, it’s a complete name change.  There’s been a change of power, and the group has a new identity.  So we need to change the name of the group to Y, but we don’t want to lose the fact that it used to be X.  What do we do?  We decided that the only sensible thing was to add a date range to the Alias table indicating when an alias was valid.  So our model now looks like this:

One Group, with one or more Aliases, each of which has a start and end date

OK, that works.  We just added the requirement that whenever a group is named, we do some research and find out when the group was founded, and enter its primary name as an alias also with StarteDate set to its founding date.  And that’s about how things work to this day.

Alright, done.  That was easy!  But next up we’ll talk about some of the issues involved in modeling group-to-group relationships and pedigrees – that’s where it starts to get interesting.

Tuxedo: The Model-View

As originally implemented, our domain model was all-encompassing and very kludgie.  In particular, entity instances with fields that whose values were constrained by database enumerations were automatically populated with all allowable values for their enumerated fields (for providing to a user via the UI).  At the time, I didn’t see the problem with this approach – it made sense to me, because I thought of the constraints we placed on entities’ enumerated fields as (albeit implicit) behaviors of the entities themselves.

I started to smell something when I realized that this arrangement meant we couldn’t create new entity instances at runtime without going to the database to get those lists.  A brand new entity without links to anything else yet should be able to be created in a vacuum – but in our case, if we allowed that, the entity couldn’t be used for anything, because it would only accept enumerated values that existed in its lists.  That (along with some helpful advice from people on the nhusers (NHibernate Users) Google Group) got me thinking of the idea of a model-view – a decorator that would wrap instances of our domain entities and supply consumers (e.g., the UI) with additional contextual and validation data.  So I created a new assembly and namespace for my views (we’ve been wrestling with an appropriate name for these objects for a while now – what’s the right name?), and switched the UI logic so our WinForms data binding was performed against the views instead of entity instances.  At this point, the domain model has no dependencies except on framework classes, the IO layer only depends on the model, and the model-view layer has dependencies on the Model and the IO layer, which just feels right.  And it’s easier to use, too.

But I’ve recently discovered a code smell with this implementation, too: The way the decorator works, all data are actually stored on the entity, and all getters and setters on the decorator that mimic properties exposed by the entity are simply forwarded to the entity – so the entity gets potentially bad data, with no opportunity to reject it until validation is performed.  My solution so far has been a kludgie state-restore facility on my entities (the entities keep track of their “original” and “current” states, and a call to Reset() copies original over current), but I don’t like it because that’s usage-related logic unrelated to business logic, and as such clutters up the domain model.

An alternative approach I’ve been considering is to switch from a decorator pattern to a disconnected proxy.  When a UI to interact with a given entity is spawned, an appropriate proxy is spawned as well.  The proxy performs reads on demand (lazy retrieval) from the entity, but caches writes internally until a call to Proxy.Commit() is made.  At that point, the proxy performs all cached writes against the entity in one go.  That way, a validating proxy implementation can refuse to write ANY data to the entity until all writes have been validated, obviating the need for the complicated and kludgie state-restore code on my entities.  And the nice part about it is that it should be doable quite easily with AutoMapper and/or Castle.DynamicProxy.

Thoughts?  What’s the best practice here?

Tuxedo: The Database

Before diving into the rewrites that I’m working on, I thought it would make a good post to talk about one of the rewrites I’ve already done, particularly since it heavily impacts the codebase at the DAL level.

Some background: the data model we use has a number (though I’m not certain as to the precise number – to come in a later post, if I’m allowed to blab that info) of object types.  They fall into essentially two layers, which I refer to as primary and secondary objects – or entities and children, respectively (though not entities from a DDD standpoint – most of our objects fit the bill there).  The primaries are of five types: Individuals, Groups, Events, Incidents, and Sources.  Primaries are related to secondaries via one-to-many relationships, and to each other via many-to-many relationships.  There are a few outliers – tertiaries, I suppose we could call them – related to secondaries with many-to-one relationships, but I just lump those in with the secondaries, mostly.  The last piece are what we call the enums – these are simply lists of values used for constraining input fields.

Since its inception, the database has been using these enums, and each enum type had its own table.  The first significant modification I made to the database (aside from reverse-engineering and enforcing foreign key relationships) was to consolidate these enums into only two tables – the CategoryTbl table, and the ValuesTbl table.  In retrospect, I wish I’d named it  EnumsTbl, but ValuesTbl is a decent enough name that we’re not going to change it based solely on personal taste at this juncture.  The CategoryTbl essentially holds a single row per previously-extant enum table, and the ValuesTbl table holds one row per enum.  While consolidating these into a single table significantly slows deletions of obsolete values (there are nearly 1000 foreign keys targeting ValuesTbl.ValueID), it has made development much easier, because logic for all constrained fields now has a common implementation.

The other modification I made was to create a common base class (IsvgObject) for all our primary, secondary, and enum instances, and to add it to the database.  These rows are stored in the ObjectTbl table.  By having all primary keys for each instance of a model class predicated upon a row in this table, we now have globally unique object ID’s.  To remove ambiguity, I also modified the primary key columns for all tables to be foreign keys to ObjectTbl.ObjectID.  This applies even in multi-tier object hierarchies; e.g., AAssaultInfoTbl.AAssaultInfoID is the primary key for ArmedAssaults, and is a foreign key to TypeTbl.TypeID, the table and column that represent instances of EventBase, which is in turn a foreign key to ObjectTbl.ObjectID.

The ObjectTbl is good for a variety of things – raw auditing of user productivity especially, since we keep timestamps on the records – but the ValuesTbl has really paid off.  Does anyone have a better approach for either?

Tuxedo: The Beginning

At the Institute for the Study of Violent Groups, we maintain the world’s largest open-source (meaning, drawn from publicly-available sources; not FOSS or CC, sadly) terrorism database.  I’ve been the lead developer here at ISVG since February, 2008, and our IT and software infrastructures have seen a LOT of change.  One of the greatest parts about the job is that I have essentially carte blanche to guide and refactor the codebase as I see fit, which has been both a blessing and a curse.  When I arrived on scene, I was a hot-shot, full-of-myself programmer a couple years out of school (still working on my MS, actually).  I knew was a pretty decent programmer, but I didn’t realize that I was a very poor software engineer.  Now, two and a half years later, I’m both significantly better at all things development-related, and significantly more humble!  I’ve gotten the opportunity at work to redo our “legacy” software (“legacy” meaning I hacked it together before I knew any better), applying all the lessons I’ve learned.  The next few months should be interesting!

I’m going to post several blog entries in the coming months about the various refactoring steps we (a single co-worker and I) take in the process of improving the codebase and our database, so it’ll probably help to describe the current state of affairs, and the modifications I’ve applied already since I started at ISVG.

Since 2001, ISVG has produced a variety of reports and whitepapers, but our primary product has always been our database.  I’m not quite clear on the timeline and technical details before 2008, but it started off as an Access database.  At some point they experienced significant growing pains while having to support both a growing model and an increasing data entry staff, and the decision was made to migrate to a SQL Server 2000 database.  The Microsoft Access-based UI was still used as the entry point, though.  Over the course of several years, the Access file grew to the point of being unstable and virtually unusable (again, due to a growing domain model, but primarily due to the number of not-used-but-not-deleted artifacts the developers left behind).  It was at this point that I came on board, tasked with building a custom UI for the database that wouldn’t suffer the same woes as the previous incarnation.

Before getting to start on this project, though, I was sidetracked to work on integrating our database with an unstructured knowledgebase we had contracted to create.  Between data integration with the knowledgebase and working on integrating our data with various analytical tools, it was November before I really got to start on the new UI.  At long last, though, I started digging into the Access module, and my struggles truly began, as the database schema had no documentation, and I was forced to reverse-engineer the entire schema based on bindings within the Access forms.  Eventually, though, I forged through, and our database schema (despite still having many legacy artifacts) enjoys complete documentation.

In an effort to transition away from the dying beast that was our Access application as quickly as possible, I originally targeted SQL databinding in WinForms as the technology of choice.  After mapping out a fifth of the schema and getting nothing more out of it than a generated data set large enough to consistently crash the Visual Studio 2005 designer, though, I decided we couldn’t take the short route, and set about implementing an OO domain model.  Tuxedo (the internal name for our application) was born.

It was at this point that I experienced my greatest professional shame so far: after wrestling with NHibernate, with its then-unfamiliar terminology and (for a newbie like me) unforgiving documentation, for several weeks, I decided it would be easier to roll my own DAL/ORM, and set off down that rabbit hole.  I’ve made many mistakes (some of which will be the topic of future posts), but eventually arrived at a working implementation that’s a daunting and schizophrenic combination of good design and ugly hacks, and all told about 120,000 SLOC (including designer-generated code).

At this point, our domain model consists of ~200 objects, our UI uses pretty complicated complex databinding, and our support for entity deletion is buggy at best.  The codebase contains almost no abstraction (coming from a C++ background, I made overuse of base classes, and dramatic underuse of interfaces), and it’s very difficult to add new features at any point in the system.  These are the problems I’m setting out to solve moving forward, and I hope you’ll come along with me for the ride.

File List class

I’m in the process of working on a project that, among other things, requires taking action on a large number of files accessed by recursively descending a directory tree.  My first approach was simply to populate a List<string> with the full path of each file in the tree, and then iterate through the list.  That ended up taking a while, and I didn’t like the initial delay, so I wrote a new list class (actually, that’s semantically incorrect — technically it’s not a list class because it doesn’t implement IList, just IEnumerable and IEnumerator) that will keep track of its position in a directory tree and not enumerate a list of files until they’re needed.  Hopefully it’ll help someone — I’m hereby releasing it into the public domain.  Also, I appreciate any and all comments.
using System;
using System.Collections.Generic;
using System.Text;
using System.Collections;
using System.IO;

///
/// A class that supports enumerating through the files in a directory tree using .Net foreach semantics by implementing an enumerator over the files.
///
///
/// TODO: The FileList class should support lazy access, so it’ll only get a list of files from a directory when that directory is accessed. This will require
/// a major redesign.
///
public sealed class FileList : IEnumerable, IEnumerator
{
private FileInfo[] m_files;
private string m_root_path;
private int m_file_index;
private DirectoryInfo m_cur_dir;
private Stack m_dir_stack;
private Stack m_index_stack;

///
/// Initializes the FileList.
///
///
The path to the directory to open. public FileList(string path)
{
this.m_root_path = path;

this.Reset();
}

///
/// Gets an enumerator for the list.
/// Necessary to implement IEnumerable
///
/// A reference to the IEnumerator interface on this object.
public IEnumerator GetEnumerator()
{
return this;
}

///
/// Move the current pointer to the next file.
///
/// true on success;
/// false if no more files are available
///
public bool MoveNext()
{
if ((this.m_files != null) && (this.m_file_index + 1 < this.m_files.Length))
{
++this.m_file_index;
return true;
}
else return (this.gotoNextDir() && this.MoveNext());
}

///
/// Used internally to advance the internal state to the next directory when the list of files in the previous directory has run out.
///
/// true on success
/// false on failure (usually because there aren’t any more directories to be had)
private bool gotoNextDir()
{
// attempt to get a list of subdirectories
DirectoryInfo[] subdirs = this.m_cur_dir.GetDirectories();
if (subdirs.Length > 0)
{
this.m_dir_stack.Push(subdirs);
this.m_index_stack.Push(0);
this.m_cur_dir = subdirs[0];
this.m_files = this.m_cur_dir.GetFiles();
this.m_file_index = -1;
return true;
}
else
{
// no subdirectories, so go back up the stack
if (this.m_dir_stack.Count >= 1)
{
int index;
do
{
subdirs = this.m_dir_stack.Pop();
index = this.m_index_stack.Pop();
++index;
} while ((index >= subdirs.Length) && (this.m_dir_stack.Count >= 1));

if (index < subdirs.Length)
{
// legal, so move to that directory
this.m_cur_dir = subdirs[index];
this.m_dir_stack.Push(subdirs);
this.m_index_stack.Push(index);
this.m_files = this.m_cur_dir.GetFiles();
this.m_file_index = -1;
return true;
}
else
{
// implies we’ve made it back up to the top of the stack
return false;
}
}
else
{
// stack’s already empty, so can’t go to the next directory.
return false;
}
}
}

///
/// Clear the internal structures; effectively move the pointer back to the element just before the first file.
/// Necessary to implement IEnumerator
///
///
/// May throw System.IO.DirectoryNotFoundException
///
public void Reset()
{
this.m_cur_dir = new DirectoryInfo(this.m_root_path);
this.m_file_index = -1;
this.m_files = this.m_cur_dir.GetFiles();
this.m_dir_stack = new Stack();
this.m_index_stack = new Stack();
}

///
/// Retrieves the current FileInfo structure
///
public object Current
{
get
{
return this.m_files[this.m_file_index];
}
}
}