JSoup: HTML Parsing Made Easy!

Do you ever get excited by a software library? I do. Good software is like beautiful art to me. I like to look at it, marvel in its power and revel in its simplicity. Good libraries are fun and easy to use and they just feel right. Google Guava felt right from the first time I used it and I just had a similar experience with another Java library - JSoup. Note that I have no affiliation with JSoup whatsoever - I just think that it’s great software that ought to be shared.

JSoup is an HTML parser (among other things), but it is atypical of most HTML parsers:

  • Its selection syntax is incredibly powerful, yet easy to understand.
  • It works with “real-world” HTML (in other words, HTML that isn’t well-formed).
  • It has no dependencies.

As an example, suppose I need to parse the goals scored in an Ontario Hockey League (OHL) game, such as the one today where Kitchener beat Erie 11-4.

In an OHL game summary, goals scored are rows in an HTML table. But the game summary contains many HTML tables, so how do I specify the table I want? Well, the first row of the goals scored table is a title row that appears in HTML as follows:

<tr class="content-w">
   <td class="dark" align="center" colspan="2">
      <b class="content-w">Scoring</b>
   </td>
</tr>

Each goal follows the title row and appears in HTML as follows:

<tr class="light">
   <td><i>1</i>. ER
      <a href='player.php?id=6330'>H. Hodgson</a>, (5) (
      <a href='player.php?id=5572'>L. Cairns</a>), 2:51
   </td>
   <!-- Stuff removed for brevity. -->
</tr>

Therefore, the rows for each goal are tr elements that have a class of light that follow an earlier sibling tr element that contains the text Scoring somewhere in its content. Here’s how easy that query is in JSoup:

final Elements elements = document.select("tr:matches(Scoring) ~ tr.light");

Amazing! Look at how powerful this one line of code is:

  • JSoup’s matches pseduo-selector means “elements whose text (or the text in any of its descendants) matches the specified regular expression”. (In this case, we are using the simplest type of regular expression to perform exact text matching.)
  • JSoup’s dot (‘.’) operator means “element with a specified class”.
  • JSoup’s tilde (‘~’) operator means “element preceded by sibling”.

Taken together, we get exactly the HTML elements we want - one element for each goal. I don’t even want to think about doing this type of selection with other HTML parsers! You can see the full power of selection statements in the Javadoc of JSoup’s Selector class.

Another beautiful element of JSoup’s design is that it has its own collection classes, but they implement interfaces in the Java Collections Framework. For example, the Elements class is org.jsoup.select.Elements but it implements Iterable<Element>, Collection<Element>, and List<Element>. Therefore, you can use the convenience methods provided by JSoup (such as elements.first() to get the first element in a list) but you can also treat the collection as an Iterable, Collection, or List. Brilliant!

This article just scratches the surface of JSoup. I invite you to check it out - you won’t be disappointed!

Well-named methods are better than comments

Today I’m going to discuss a benefit of one of the greatest inventions in the history of computer science - the method. One reason I love writing software so much is that the field is so young. We haven’t even had a century to consider the best alternatives for writing robust software and methods are only 50 years old or so, depending on your definition. Compare that with other fields like construction. Imagine how much we had to learn about construction 50 years after the first building was ever made. I bet there was ample room for improvement! That’s the state we’re in today in software.

Methods are normally taught as a way to remove duplication and that is undoubtedly one of their benefits. However, I want to discuss another benefit of methods that is often overlooked - their ability to improve program readability.

Consider a program I’ve written that converts from National Hockey League (NHL) depth charts on RotoWorld into a comma-separated values (CSV) player file suitable for use in another program I’ve written than runs the draft for an NHL pool.

At its heart, the depth chart parser reads from an input file (the depth charts saved as a text file) and writes to an output file (the CSV file). Therefore, the first time I wrote the program’s main method, it looked like this:

public static void main(final String[] cmdLineArgs) throws IOException
{
  final String inputFileSpec = "C:/data/hockeypool/rotoWorldDepthCharts.txt";
  System.out.println("Reading players from " + inputFileSpec + "...");
  final List<Player> players = new PlayersSupplier(
        new FileReader(inputFileSpec)).get();

  final String outputFileSpec = "C:/data/hockeypool/regularSeasonPlayers.csv.txt";
  System.out.println("Writing players to " + outputFileSpec + "...");
  new PlayersWriter(new FileWriter(outputFileSpec)).write(players);
}

(Ignore the hard-coded paths for now. I haven’t yet made the program flexible enough to accept configuration as to where to read and write data.)

At first glance, there’s nothing wrong with this method. It’s pretty short. There is only one path for readers to comprehend (in other words, there are no conditional statements). However, I still think it can be improved because the lines that read from and write to files are hard to digest. There’s a lot going on there that obscures the real purpose of those lines.

Some literature encourages the use of comments in a situation like this:

public static void main(final String[] cmdLineArgs) throws IOException
{
  // Read players from the input file.
  //
  final String inputFileSpec = "C:/data/hockeypool/rotoWorldDepthCharts.txt";
  System.out.println("Reading players from " + inputFileSpec + "...");
  final List<Player> players = new PlayersSupplier(
        new FileReader(inputFileSpec)).get();

  // Write players to the output file.
  //
  final String outputFileSpec = "C:/data/hockeypool/regularSeasonPlayers.csv.txt";
  System.out.println("Writing players to " + outputFileSpec + "...");
  new PlayersWriter(new FileWriter(outputFileSpec)).write(players);
}

Don’t do this! As Robert Martin so eloquently puts it in Clean Code, comments are failures; they indicate that we have failed to express ourselves in code. We should not be proud when we write a comment. Rather, we should use comments only as a last resort when all other mechanisms for expression have failed.

As Martin recommends, let’s try to express ourselves in code by using well-named methods:

public static void main(final String[] cmdLineArgs) throws IOException
{
  final List<Player> players = readPlayers(
        "C:/data/hockeypool/rotoWorldDepthCharts.txt");
  writePlayers(players, "C:/data/hockeypool/regularSeasonPlayers.csv.txt");
}

private static List<Player> readPlayers(final String fileSpec) throws IOException
{
  System.out.println("Reading players from " + fileSpec + "...");
  return new PlayersSupplier(new FileReader(fileSpec)).get();
}

private static void writePlayers(final List<Player> players, final String fileSpec)
      throws IOException
{
  System.out.println("Writing players to " + fileSpec + "...");
  new PlayersWriter(new FileWriter(fileSpec)).write(players);
}

Doesn’t this make the main method easier to understand? The method now clearly does two things - read players and write players. If you want to know the implementation details of those operations, you can look at the private methods, but if you don’t care about those details, you can ignore them. For example, if your team took over responsibility for maintaining this class, the first time you looked at the code you should not have to burden your mind with the minutiae of how the files are read and written. Note that by introducing private methods we were also able to forego the use of the inputFileSpec and outputFileSpec variables, further condensing the main method and making it even easier to understand. Additionally, the implementation details of reading and writing players can now more easily be refactored into their own classes, should we choose to do so.

In summary, use well-named methods liberally, even if they’re only called once. They make it easier to understand your code and they’re so much better than comments!

Meet Google Guava's Optional class

Google Guava is a great library that should be part of every Java developer’s arsenal. As its user guide states, it includes classes for “collections, caching, primitives support, concurrency libraries, common annotations, string processing, I/O, and so forth.” I often peruse Google Guava’s Javadoc. There’s just so much good stuff in there! I can learn about API design, discover ways to improve my software, and seeing such beautiful code just makes me feel good. (Yeah, I know. That’s about as cool as driving a moped on the sidewalk while wearing khaki flood pants with a ketchup stain from a street meat hot dog but whatever… it’s what I do!)

I stumbled upon another gem in Google Guava last week while improving my program that creates gamesheets for Ontario Hockey League games. When I attend an OHL game (and I often do… go Bulls!) I take gamesheets that include details about the players on both teams. The details include player name, player type (rookie or veteran), season statistics and streaks, biographical details, and sweater number. It’s the last detail that introduced me to Google Guava’s Optional class.

Optional is a simple immutable generic class that stores an optional non-null value. As its Javadoc states, Optional allows you to represent “a T that must be present” and a “a T that might be absent” as two distinct types in your program, which can aid clarity.” For hockey players, sweater numbers are sometimes not known. For example, when a player is traded to a new team, he can’t always use the sweater number of his old team because a more veteran player on his new team might already use it. Until he plays his first game for his new team, his sweater number is often unknown. So, what Java type should I use to represent a player’s sweater number?

Well, sweater numbers in hockey are positive integers ranging from 1-99 so I started out with the obvious - int. For example, the Player class contained the following method:

class Player
{
  int getSweaterNum()
  // Other methods elided for brevity.
}

This poses a problem for players with an unknown sweater number. What should getSweaterNum return for such a player? I initially took the lazy approach and used a magic number (0). However, I forgot to check for that magic number in my code that uses the Player class. It blindly called getSweaterNum and put 0 in the sweater number column of my gamesheets. How ugly! Rather than print a 0, the sweater number column should remain blank so that I can fill it in at the game when I see what number the player wears during the warmup.

There are a couple of approaches I could use to solve this problem:

  • Define a constant (say Player.UNKNOWN_SWEATER_NUM = 0) for an unknown sweater number.
  • Use an Integer instead of an int for sweater number and use null to indicate that the sweater number is unknown.
  • Create a SweaterNum class to encapsulate the concept of a sweater number.

I don’t really like any of these approaches. The first approach leads to ugly code at the calling site and doesn’t make it any more likely that the caller will remember to check for the magic value. The second approach violates the best practice of not returning null (see Clean Code: A Handbook of Agile Software Craftsmanship by Robert Martin - it’s a fantastic book!). The third approach seems a bit like overkill.

Instead of these approaches, I chose to use Optional<Integer> for the return value of Player.getSweaterNum:

class Player
{
  Optional<Integer> getSweaterNum()
  // Other methods elided for brevity.
}

This increases clarity; human readers know by the method’s return value that a player’s sweater number is optional. This makes it less likely that calling classes will forget to check whether a sweater number is known. Calling classes do the following instead:

private String getSweaterNumCellValue(final Optional<Integer> sweaterNum)
{
  if (sweaterNum.isPresent())
  {
    return Integer.toString(sweaterNum.get());
  }
  else
  {
    return EMPTY_TABLE_CELL;
  }
}

The method that determines a player’s sweater number does the following:

try
{
  final String sweaterNumString = // get from OHL website
  return Optional.of(Integer.parseInt(sweaterNumString));
}
catch (final NumberFormatException e)
{
  return Optional.absent();
}

This is by no means a perfect solution as it reveals some of Java’s clunkiness. For example, it would be nicer if we could use Optional<int> but, alas, Java does not allow that syntax. However, I think it’s a good compromise. What do you think?

The False Economies of Software

I’ve been thinking a lot lately about how software companies can fall victim to false economies. Companies can take actions that save money in the short term, but have detrimental long-term impacts. The most telling examples are:

  • Use of low-cost offshore contractors
  • Failure to provide employees with the tools they need to get their job done
  • Inadequate hiring practices

Offshore Contractors

Offshore contractors seem like an attractive solution to a cost-conscious business. Their salaries are much lower than North American employees with the same job title, so why not use them? I’ll tell you why not! The true long-term cost of using offshore contractors is difficult to measure, but is typically very high because of:

  • Location difficulties
  • Language barriers
  • Skill level differences

It’s difficult enough to work on a team split across multiple locations. It’s even more challenging when some members are separated from others by many time zones. Given the location of prevalent contracting companies, it’s even possible that contractors never end up working at the same time as their full-time counterparts. This causes untold delays. Instead of discussing a problem right away in person with a colleague, an employee sends an email to a faraway land and it takes an entire day for it be read and acted upon.

Delays are further compounded by language barriers. It is often the case that contractors don’t have the same primary language as their full-time counterparts, so more clarity than normal is needed in communication, which costs time and money. Even more formality is required when there is a disparity in skill level between cheap contractors and their full-time counterparts.

Inadequate Tools

It’s easy for companies to think they are saving money by restricting spending on tools. However, the true cost of inadequate tools is wasted time and low employee morale. For example, if a company doesn’t buy its developers modern computers, builds take longer and developers spend more time waiting (and moving on to other activities like browsing the Internet and looking for a new place to work!). When developers go to lunch together, they share horror stories about their poor equipment and this further feeds the downward spiral of employee morale.

Inadequate Hiring Practices

The hiring process should be a great experience for applicants. Don’t cheap out and expect an interviewee to pay for their own transportation or lunch. Even if you wouldn’t hire a candidate to pick the weeds out of your Aunt Ida’s carrot garden, treat them like royalty. You never know who their friends are! You want them to tell everyone they know how great an experience they had and how amazing it would be if they were hired to work at your company.

What Makes a Great Software Developer?

Let’s face it. Some software developers are better than others. According to Robert Glass in his great book Facts and Fallacies of Software Engineering, the best developers are up to 28 times more effective than the worst developers. My experience has been that the effectiveness range is even wider than that. Some developers are a net negative for their employer. Their work is so shoddy and to get anything done they have to spend so much time with their colleagues that they are an overall drain.

On the other hand are the great developers. They are a joy to work with and they make everyone around them better. I’ve been thinking lately about the traits shared by the developers I most like to work with. I always consider Joel Spolsky’s advice from Smart and Gets Things Done: Joel Spolsky’s Concise Guide to Finding the Best Technical Talent when interviewing candidates. In particular, I consider the two most important questions:

  • Is the candidate smart?
  • Does the candidate get things done?

In this article I want to go beyond those two high-level questions. What behaviours are shared by the best developers? In no particular order, here’s what I’ve come up with so far:

Prefers Collaboration to Individualism

Poor developers love working alone. They prefer to be off in a corner by themselves for weeks or months at a time while they slave away and create incomprehensible code… that’s how they prove how smart they are! By working alone, they become indispensable. Only they can decipher the hieroglyphics they’ve created. The best developers are the exact opposite; they love working on a team. They get a thrill when their idea is added to by a colleague to come up with a better idea. They revel in creating a code base that anyone can read, where nobody is indispensable.

Attention to Detail

The best developers have tremendous attention to detail. They care about naming and formatting. They’re like a perfect camper in a national park; things are always cleaner after they leave than when they came. They don’t leave messes like commented-out code, unused private methods, or unnecessary classes. When they see a small error, such as a typo in a code comment, it bothers them enough that they fix it. Poor developers don’t care about such mundane things; their attitude is much more blasé.

Plans Ahead; Doesn’t Rush

The best developers have the right mix of slowness and speed. They don’t rush ahead and hammer away at the keyboard with the first idea that pops into their head, nor do they spend weeks carefully mulling over every possible alternative. They fit nicely down the middle of the rushing spectrum. Good developers seem to have stretches of thinking time followed by rapid bursts of code. Lesser developers seem to spend most of their time coding, a lot of which eventually gets rewritten by the better developers.

When coding, the best developers spend a few moments considering every detail. What should this class be called? What package should it go in? Has this method been written before in a reusable library? What should the parameters be?

Embraces Change

The best developers are not only open to change, they embrace it! Our industry changes so rapidly that there is always a better way and a person who knows better. A best practice from two years ago might be a hindrance today. Poor developers resist change; they cling to their old ways and refuse to accept that a better alternative exists. Good developers are keen when shown a better way. They spread the gospel of this new technique to their colleagues. They shout it from the rooftops… “I’ve been doing it wrong!”

Willing to Ask for Help

The best developers aren’t afraid to ask for help and know when it’s appropriate to do so. Rather than struggle for days to cobble something together, great developers are willing to admit that they need help. It’s all part of being a humble programmer, as Edward Dijkstra called it. We must be willing to put our egos aside and get the job done!

Clear Communicator

The best developers communicate clearly, both in their spoken words and their software. When debugging a problem, they can provide a clear explanation of what they’ve tried so far and what they’ll try next. They can provide a crisp status on their project and their daily work.

Able to Admit Mistakes

Everyone occasionally makes mistakes. It’s how you react to those mistakes that separates great developers from the rest. Great developers readily admit their errors; they’re human and they can even laugh about it. Poor developers think they’re infallible. They refuse to admit they were wrong, even when all evidence points in that direction.

The Importance of Formatting

I've said it before and I'll say it again. The book that most changed my view on software is Robert Martin's Clean Code: A Handbook of Agile Software Craftsmanship. There's an entire chapter of the book devoted to formatting. In it, Martin says the following:
When people look under the hood, we want them to be impressed with the neatness, consistency, and attention to detail they perceive. We want them to be struck by the orderliness. We want their eyebrows to rise as they scroll through the modules. We want them to perceive that professionals have been at work. If instead they see a scrambled mass of code that looks like it was written by a bevy of drunken sailors, then they are likely to conclude that the same inattention to detail pervades every other aspect of the project. You should take care that your code is nicely formatted. You should choose a set of simple rules that govern the format of your code, and then you should consistently apply those rules. If you are working on a team, then the team should agree to a single set of formatting rules and all members should comply. It helps to have an automated tool that can apply those formatting rules for you.
In The Purpose of Formatting section Martin writes:
First of all, let's be clear. Code formatting is important. It is too important to ignore and it is too important to treat religiously. Code formatting is about communication, and communication is the professional developer’s first order of business. Perhaps you thought that “getting it working” was the first order of business for a professional developer. I hope by now, however, that this book has disabused you of that idea. The functionality that you create today has a good chance of changing in the next release, but the readability of your code will have a profound effect on all the changes that will ever be made. The coding style and readability set precedents that continue to affect maintainability and extensibility long after the original code has been changed beyond recognition. Your style and discipline survives, even though your code does not.

Benefits of post-commit code reviews

Many teams debate whether code reviews should occur before a submission or after a submission. I've worked on teams that have used both approaches and am convinced that post-commit code reviews are more beneficial. Post-commit code reviews:

  • allow reviewers to easily try out the real code they are reviewing.
  • are easier to create.
  • promote a culture of many small submissions.
  • work better for submissions that require more than one review.
  • prove to developers that they are trusted and therefore leads to a happier development community.

Each benefit is explained below.

Reviewers can experiment with code under review

With post-commit code reviews, a reviewer can experiment with the code under review just by checking out the relevant code base from source control. If a reviewer works on the code base frequently, he probably already has a checkout of the code base and just needs to update it. To experiment with code in a pre-commit code review, a reviewer must typically download a patch that is attached to the code review and apply it to his checkout, making sure to apply it to the correct root folder.

Ease of code review creation

Post-commit code reviews are easier to create than pre-commit code reviews. Code review tools, such as Atlassian Crucible, allow you to create a code review simply by selecting a submission. Pre-commit code reviews are harder to create because they typically require the author to create a patch, upload the patch to the code review tool, and ensure that the patch is anchored to the correct folder in source control.

Culture of small submissions

Post-commit code reviews promote a culture of many small submissions - get a little piece of functionality working and then submit it. Repeat this frequently and trust developers to decide when a particular submission requires a review.

Submissions that require more than one review

When you do a pre-commit code review based on a patch, it's not clear how to review the changes made based on review comments (because you'd have to make a patch of a patch to review only the changes). When you do post-commit code reviews, you just make a second submission and, if necessary, call a review for that submission.

Trust in developers

Just like any other group of people, developers like to be trusted and are happier and work more effectively in an environment where they are trusted. Post-commit code reviews promote such trust in developers. Pre-commit code reviews promote the opposite - that developers must be scrutinized. Sure, junior developers may need to be scrutinized but that should not be the default team behavior.

Potential drawback and workarounds

One potential drawback of post-commit code reviews is that an unreviewed submission can break everything. However, modern development tools provide easy workarounds:

  • A continuous integration system, such as Bamboo, will quickly tell you that everything is broken.
  • A modern revision control system, such as Subversion, allows you to easily revert a single submission. Therefore, in a worst-case scenario when the continuous integration system indicates that the build is broken, you can revert the offending submission and trigger another build.

Scrum team member code of conduct

This post defines a code of conduct that I try to adhere to on a Scrum team.

Team participation

  • Fixing my team's broken builds is my highest priority.
  • Participating in my team's code reviews is my second highest priority.
  • Improving my team's velocity is more important than improving my individual velocity.
  • I will not be late for team meetings.
  • I will ask for help when I need it.
  • I will provide help when asked.
  • I will be honest at all times.
  • If I don't believe something is possible I will say so without hesitation.

Peer reviews

  • I will be professional and tactful when providing review comments.
  • I will be open to receiving feedback when others review my work.

Writing code

  • I will express myself through code.
  • I will add comments to code only if I have failed to express myself through code.
  • I will write code that adheres to my team's formatting and style conventions.
  • I will not submit production code without thorough in-memory unit tests.
  • I will not submit code that causes any tool to emit a warning.

Comments are a failure

The book that most changed my view on comments in code is Clean Code: A Handbook of Agile Software Craftsmanship by Robert Martin.  There's an entire chapter of the book devoted to comments.  In it, Martin says the following:

Nothing can be quite so helpful as a well-placed comment.  Nothing can clutter up a module more than frivolous dogmatic comments.  Nothing can be quite so damaging as an old crufty comment that propagates lies and misinformation.

Comments are not like Schindler's List.  They are not "pure good."  Indeed, comments are, at best, a necessary evil.  If our programming languages were expressive enough, or if we had the talent to subtly wield those languages to express our intent, we would not need comments very much -- perhaps not at all.

The proper use of comments is to compensate for our failure to express ourself in code.  Note that I used the word failure.  I meant it.  Comments are always failures.  We must have them because we cannot always figure out how to express ourselves without them, but their use is not a cause for celebration.

So when you find yourself in a position where you need to write a comment, think it through and see whether there isn't some way to turn the tables and express yourself in code.  Every time you express yourself in code, you should pat yourself on the back.  Every time you write a comment, you should grimace and feel the failure of your ability of expression.

 

Removing duplication on null checks

This article describes two ways to remove duplication when classes check that methods of an interface do not return null. I'm not sure which way is better, so let me know what you think!

Suppose we have an interface with two methods that return objects:

public interface Athlete
{
  Height getHeight();
  Weight getWeight();
}

Further suppose that Athlete instances are passed as parameters to many different classes in a system. The classes that accept Athlete instances conform to the fail-fast design philosophy, so they not only check if the instance is null, they also check if the athlete's height and weight are null:

public class AthleteUser
{
  public AthleteUser(final Athlete a)
  {
    if (a == null) throw new IllegalArgumentException();
    if (a.getHeight() == null) throw new IllegalArgumentException();
    if (a.getWeight() == null) throw new IllegalArgumentException();
    // Use athlete instance
  }
}

When more than one class performs this same validation, we have duplication. We don't want to repeat ourselves, so how do we remove this duplication?

One option is to introduce a helper class for dealing with Athlete instances. If we follow the naming conventions of Google Guava, we name the helper class Athletes:

public final class Athletes
{
  private Athletes(){}
  public static void assertNoNullReturnValues(final Athlete a)
  {
    if (a == null) throw new IllegalArgumentException();
    if (a.getHeight() == null) throw new IllegalArgumentException();
    if (a.getWeight() == null) throw new IllegalArgumentException();
  }
}

An alternative approach, one I have seldom seen in production, is to put the helper class inside the interface:

public interface Athlete
{
  Height getHeight();
  Weight getWeight();
  public final class Helper
  {
    private Helper(){}

    public void assertNoNullReturnValues(final Athlete a)
    {
      if (a == null) throw new IllegalArgumentException();
      if (a.getHeight() == null) throw new IllegalArgumentException();
      if (a.getWeight() == null) throw new IllegalArgumentException();
    }
  }
}

Which method do you prefer?

The separate class is used more frequently, but the class-inside-the-interface approach has its advantages:

  • It makes it much easier to find the helper class.
  • It reduces the number of files.

Organizational Leakage

I had a bad experience this weekend at Tim Horton's. My girlfriend and I ordered the sandwich combos which include a doughnut and drink. The nice lady behind the counter quickly got us our doughnuts and drink, put them on the counter, then proceeded to help the next customer in line. I waited at the counter for our sandwiches, growing increasingly impatient. After many minutes, the same lady that took the order originally asked what I was waiting for. I told her that I was waiting for our two sandwiches and her reply was "oh, okay, that's not something I have to worry about". This annoyed me but what does it have to do with software? I'll get to that.

As a Tim Horton's customer, I don't care that they have decided to separate jobs so that one employee is a counter/drink/doughnut person while another is a sandwich maker. My transaction is with Tim Horton's and their internal organizational structure should be irrelevant to me. Just get me my ham and cheese, please!

The same "organizational leakage" occurs in software when error messages indicate the module at fault. Customers don't care about modules... they bought a product not a bunch of modules!

Code Smell: Classes named Impl

A colleague and I have come to the conclusion that public classes named Impl are a code smell. Such classes are often a sign that not enough thought was put into naming the class. Let's look at an example.

Suppose I have a system that stores information about people in a MySQL database. I need a data access object (DAO) for the people, so I start by creating a PersonDao interface. Even though my system only has one way to store people, an interface is a good idea for the following reasons:

  • I might want to introduce another implementation (say an LDAP directory) in the future.
  • I want to write unit tests for classes that depend on a PersonDao and an interface makes it easier to mock the dependency. You do write unit tests, right?

So what do I name the class that implements the PersonDao interface and uses a MySQL database? I only have one implementation class, so why not name it PersonDaoImpl?

That's a bad choice for a few reasons:

  • It doesn't accurately describe what the class is.
  • It leads to inconsistencies when further implementation classes are added in the future. For example, suppose a few years after release we discover that some of our customers would rather have their people stored in an LDAP directory. We can't name the LDAP implementation class PersonDaoImpl because that name is already chosen so we use LdapPersonDao. Now we're left with similar classes with dissimilar names.

A better choice is to name the original MySQL implementation MySqlPersonDao and avoid these problems.

Spend a little time to name your classes accurately. If you're having trouble with that, your classes are probably doing too much and you need to refactor.

The Importance of Overridding Equals

I'm trying to write JUnit tests for some code that uses classes defined in Netscape's Lightweight Directory Access Protocol (LDAP) Software Development Kit (SDK) and I am having difficulties because the authors of several classes did not override equals.

For example, suppose I want to write a JUnit test for a method that returns an LDAPModificationSet. My first attempt is as follows:

public void testGetModifications()
{
  final LDAPModificationSet expected = // some expected modifications
  final LDAPModificationSet actual = classUnderTest.methodUnderTest();
  assertEquals(expected, actual);
}

No matter how hard I try, this test will always fail. LDAPModificationSet does not override equals, so its equals method only returns true when passed the exact same instance.

Consider the alternatives for my test:

  • Assert something weaker. For example, instead of assertEquals(expected, actual), I could use assertEquals(expected.size(), actual.size()).
  • Write my own method that compares LDAPModificationSet instances. For example, I could create an LdapModificationSets class with a method boolean equals(LDAPModificationSet firstSet, LDAPModificationSet secondSet).
  • Have my test manually compare each element in LDAPModificationSet. (Although that's not even possible in this case, because each element is an LDAPModification, and that class also doesn't override equals.)

None of these alternatives are particularly fun.

The moral of this story: Override equals in value objects! The second moral of this story - buy "Effective Java: Second Edition" because Joshua Bloch already made this point in Item 8 when he wrote the book!

How to write the best exception constructors

This article describes how to write the best exception constructors. It is related to Item 63 (Include failure-capture information in detail messages) of Joshua Bloch's tremendous book - "Effective Java: Second Edition". If you haven't got it yet, what are you waiting for?

Let's start by discussing the importance of the string returned from an exception's toString method. When an exception is logged, it is usually accompanied either by its stack trace or by the string returned from its toString method. However, according to the Javadoc of the Throwable.printStackTrace method, the first line of a stack trace "contains the result of the toString() method". Therefore, the most useful debugging aid an exception can offer is a toString method that returns details of anything that contributed to the exception. This article is going to show how to write exception constructors that make it easy to have a great toString method.

There are three main types of exceptions:

  • those for general use
  • those for a specific condition
  • those used as the abstract base class of an exception hierarchy

The guidelines for constructor creation depends on the type.

General-use exceptions

General-use exceptions are intended for use in a variety of situations. Examples of such exceptions are java.lang.IllegalStateException and java.text.ParseException. When an exception has such a high potential for reuse, it's best to mimic the four constructors offered by java.lang.Exception. In particular, for FooException, define the following constructors:

  • FooException()
  • FooException(String message)
  • FooException(Throwable cause)
  • FooException(String message, Throwable cause)

Specific exceptions

Exceptions such as java.io.FileNotFoundException are only intended for use in a specific situation. The best approach when writing specific exceptions is to provide constructors that force a useful toString method. For example, write FileNotFoundException as follows:

/**
 * A checked exception that indicates that a file is not found.
 */
public final class FileNotFoundException extends IOException
{
  /**
   * Constructs a new instance of FileNotFoundException.
   *
   * @param file the file that was not found
   */
  public FileNotFoundException(final File file)
  {
    super(getMsg(file));
  }

   private static String getMsg(final File file)
  {
    if (file == null)
    {
      return "null";
    }
    else
    {
      return file.getAbsolutePath();
    }
  }
}

When the constructor is defined to accept a File, the class is easy to use and hard to misuse. All instances return a useful string from their toString method. For example, if file is not null, the return value of toString is com.kblaney.FileNotFoundException: D:\temp\kyle.txt
How's that for a string that includes everything I need to know to debug a problem? The exception's class name tells me what went wrong (a file was not found) and the exception's message provides the absolute path of the file that was not found.

Note that FileNotFoundException has already been defined in Java and its existing constructors can't be removed for backwards-compatibility reasons. However, in this case Java is not something to be emulated. We can do better!

Abstract base class of exception hierarchy

An abstract base class of an exception hierarchy should provide constructors that help its derived classes follow the above rules. If there are many derived classes, it is probably best to mimic the four constructors offered by java.lang.Exception. If there are few derived classes, the abstract base class should mimic the constructors required for any of the derived classes.

Using Package-private constructor to Test With Static Objects

In my previous post, I explained the difficulty of testing with static objects. This post shows how to use a package-private constructor to deal with that problem.

Here's the class we need to fix:

final class Foo
{
  private static final String KEY = "SOME_KEY";
  private final Preferences preferences;

  public Foo()
  {
    preferences = Preferences.systemNodeForPackage(SomeOtherClass.class);
  }

  public void disable()
  {
    preferences.putBoolean(KEY, false);
  }
  
  public void enable()
  {
    preferences.putBoolean(KEY, true);
  }
 
  public boolean isEnabled()
  {
    final boolean defaultValue = false;
    return this.preferences.getBoolean(KEY, defaultValue);
  }
}
This class needs fixing because it's hard to write independent tests because each Foo instance uses the same Preferences object. We fix the problem by introducing a package-private constructor:
final class Foo
{
  private static final String KEY = "SOME_KEY";
  private final Preferences preferences;

  public Foo()
  {
    this(Preferences.systemNodeForPackage(Foo.class));
  }

  Foo(final Preferences preferences)
  {
    this.preferences = preferences;
  }

  public void enable()
  {
    preferences.putBoolean(KEY, true);
  }

  public void disable()
  {
    preferences.putBoolean(KEY, false);
  }

  public boolean isEnabled()
  {
    final boolean defaultValue = false;
    return this.preferences.getBoolean(KEY, defaultValue);
  }
}
Note that the package-private constructor does not impact normal consumers of Foo in any way. Consumers continue to use the default constructor that (unbeknown to them) delegates to the package-private constructor.

Rather than use Foo's default constructor, our tests use the package-private constructor and Mockito to inject a mock Preferences object that each Foo instance uses:

public class FooTest extends TestCase
{
private Foo foo;

public void setUp() throws Exception
{
final Preferences preferences = mock(Preferences.class);
this.foo = new Foo(preferences);
}
public void testDisable() throws Exception
{
this.foo.disable();
assertFalse(this.foo.isEnabled());
}
public void testEnable() throws Exception
{
this.foo.enable();
assertTrue(this.foo.isEnabled());
}
public void testConstructor() throws Exception
{
assertFalse(this.foo.isEnabled());
}
}

Now testEnable fails because its Foo instance calls Preferences.putBoolean on a mock. By default, when a void method (such as Preferences.putBoolean) is called on a mock created by Mockito, the mock does nothing. Therefore, we need to change our testing approach.

To test a Foo method, we assert that the Foo instance delegates to its Preferences object. To test the Foo constructor, we assert that no delegation to its Preferences object has occurred:

public class FooTest extends TestCase
{
private Preferences preferences;
private Foo foo;

public void setUp() throws Exception
{
this.preferences = mock(Preferences.class);
this.foo = new Foo(preferences);
}
public void testDisable() throws Exception
{
this.foo.disable();
verify(this.preferences).putBoolean(Foo.KEY, false);
}
public void testEnable() throws Exception
{
this.foo.enable();
verify(this.preferences).putBoolean(Foo.KEY, true);
}
public void testConstructor() throws Exception
{
verifyNoMoreInteractions(this.preferences);
}
}

Note that to test delegation, we changed Foo.KEY from a private member to a package-private member. That's a perfectly natural event that occurs during testing. Treat the test class as a first-class consumer of the production class and let the test class dictate the API of the production class.

Difficulty Testing With Static Objects

I was going through some old code the other day improving test coverage when I came across an untested class that looked like this:

public class Foo
{
private static final String KEY = "SOME_KEY";
private final Preferences preferences;

public Foo()
{
this.preferences = Preferences.systemNodeForPackage(
SomeOtherClass.class);
}
public void disable()
{
this.preferences.putBoolean(KEY, false);
}
public void enable()
{
this.preferences.putBoolean(KEY, true);
}
public boolean isEnabled()
{
final boolean defaultValue = false;
return this.preferences.getBoolean(KEY, defaultValue);
}
}

Do you see anything wrong yet?

Nothing looks terribly wrong, but there's a lurking problem - it's hard to test the class because it calls a method that returns a static object. Let me explain by going through the process of adding tests of this class.

My first test method confirms that isEnabled returns false after disable is called:

public class FooTest extends TestCase
{
public void testDisable() throws Exception
{
final Foo foo = new Foo();
foo.disable();
assertFalse(foo.isEnabled());
}
}
There's still apparently nothing wrong. As expected, this test passes, so let's add a test of the enable method:
public class FooTest extends TestCase
{
public void testDisable() throws Exception
{
final Foo foo = new Foo();
foo.disable();
assertFalse(foo.isEnabled());
}
public void testEnable() throws Exception
{
final Foo foo = new Foo();
foo.enable();
assertTrue(foo.isEnabled());
}
}
Everything still works fine. Both tests pass, so let's add a test that confirms that the property defaults to false immediately after construction:
public class FooTest extends TestCase
{
public void testDisable() throws Exception
{
final Foo foo = new Foo();
foo.disable();
assertFalse(foo.isEnabled());
}
public void testEnable() throws Exception
{
final Foo foo = new Foo();
foo.enable();
assertTrue(foo.isEnabled());
}
public void testConstructor() throws Exception
{
final Foo foo = new Foo();
assertFalse(foo.isEnabled());
}
}
Woah - the assertion in testConstructor fails! Even though each Foo instance has its own Preferences member, each member refers to the same Preferences object. Each member gets its value from a static method (Preferences.systemNodeForPackage) that returns the same instance every time it's called by the Foo constructor. This makes our tests lack independence - the call to foo.enable in testEnable impacts the Foo instance in testConstructor.

In my next post, I'll explain how to use a package-private constructor to handle this problem.

Using Java's CountDownLatch for multi-threading

CountDownLatch is a class that was added to the java.util.concurrent package in Java 1.5. As its Javadoc states it "allows one or more threads to wait until a set of operations being performed in other threads completes". I never had the chance to use it until this week, but now that I have, I fully appreciate its power!

A CountDownLatch is instantiated with a positive count and provides a countDown method that decrements the count. The await method "causes the current thread to wait until the latch has counted down to zero". (The maximum amount of time that await actually waits can be configured, although I didn't use that feature.)

I use a CountDownLatch when I am gathering results on a number of hosts and aggregating the results from each host into a single object that is returned to the caller. This is a perfect place to use a CountDownLatch because the number of hosts can be used as the count that is passed to its constructor.

Here's my code (with error-handling removed for brevity):

  public Results getResultsOnAllHosts()
{
final List<String> hosts = getHosts();
final CountDownLatch countDownLatch = new CountDownLatch(hosts.size());
final List<GetResultsOnHostThread> threads =
getThreadsThatAreStarted(hosts, countDownLatch);
waitForAllThreadsToFinish(countDownLatch);
return aggregateResults(threads);
}


private List<GetResultsOnHostThread> getThreadsThatAreStarted(
final List<String> hosts, final CountDownLatch countDownLatch)
{
final List<GetResultsOnHostThread> threads =
Lists.newArrayListWithCapacity(hosts.size());
for (final String host : hosts)
{
final GetResultsOnHostThread thread =
new GetResultsOnHostThread(host, countDownLatch);
thread.start();
threads.add(thread);
}
return threads;
}

private void waitForAllThreadsToFinish(final CountDownLatch countDownLatch)
{
try
{
countDownLatch.await();
}
catch (final InterruptedException e)
{
Thread.currentThread().interrupt();
}
}

class GetResultsOnHostThread extends Thread
{
private final String host;
private final CountDownLatch countDownLatch;
private ResultsOnOneHost resultsOnThisHost;

GetResultsOnHostThread(final String host, final CountDownLatch countDownLatch)
{
this.host = host;
this.countDownLatch = countDownLatch;
}

@Override
public void run()
{
try
{
results = getResultsOnHost(this.host);
}
finally
{
countDownLatch.countDown();
}
}
}