Sunday, June 21, 2009

Computer Architecture


A simplified picture with the major features of a computer identified. The CPU is where
all the work gets done, the memory is where all the code and data is stored. The path
connecting the two is known as the "bus."


CPU
While memory stores the program and the data, the Central Processing Unit does all the
work. The CPU has two parts— registers and an Arithmetic Logic Unit (ALU). A register
is like the temporary memory in a calculator— each register can store a value that can be
used in subsequent computations. Each register can usually hold one word. Sometimes
there are separate specialized registers for holding specific types of data— floating point
numbers, addresses, etc. The registers can be accessed much more quickly than memory,
but the number of registers available is quite small compared to the size of memory. One
of the specialized registers is the Program Counter, or PC, which holds the address of
which instruction is currently being executed.


The ALU is the part of the CPU that performs the actual computations such as addition
and multiplication along with comparison and other logical operations. Most modern,
high-performance CPUs actually contain several specialized logic units in addition to the
ALU which allow them to work on several computations in parallel. However, that
complexity is kept (literally) within the CPU. The abstraction of the CPU is that it
executes its instructions one at a time, in the order presented in memory.
For the balance of this handout, we will concentrate on the instruction set of a typical
Reduced Instruction Set Computer (RISC) processor. RISC processors are distinguished
by a relatively lean instruction set. It's turned out that you can get the best overall
performance by giving your processor a simple instruction set, and then concentrating
on making the processor's instructions/second performance as high as possible. So RISC
processors don't have some of the fancier instructions or addressing modes featured by
older Complex Instruction Set (CISC) designs. Thankfully, RISC has the added benefit
that it's easy to study since the instruction set is, well, reduced.
Our fictitious processor has 32 registers, each of which can hold a 4-byte word. We will
support three types of instructions: Load/Store instructions which move bytes back and
forth between registers and memory, ALU instructions which operate on the registers,
and Branch/Jump instructions that alter which instruction is executed next.


Load
Load instructions read bytes into a register. The source may be a constant value, another
register, or a location in memory. In our simple language, a memory location is
expressed Mem[address] where address may be a constant number, a register, or a register
plus a constant offset. Load and Store normally move a whole word at a time starting at
the given address. To move less than a whole word at a time, use the
variants “=.1” (1 byte) and “=.2” (2 bytes).
Load the constant 23 into register 4
R4 = 23
Copy the contents of register 2 into register 3
R3 = R2
Load char (one byte) starting at memory address 244 into register 6
R6 =.1 Mem[244]
Load R5 with the word whose memory address is in R1
R5 = Mem[R1]
Load the word that begins 8 bytes after the address in R1.
This is known as "constant offset" mode and is about the fanciest
addressing mode a RISC processor will support.
R4 = Mem[R1+8]
Just to give a sense of how this relates to the "real world", the load instruction in
Motorola 68000 assembly looks like this:

Move long (4 bytes) constant 15 to register d2
movel #15, d2
Move long at mem address 0x40c to register a0
movel @#0x40c, a0
Or in Sparc assembly, it looks like this:
Load from mem address at register o0 + constant offset 20 into register o1
ld [ %o0 + 20 ], %o1
The syntax of our assembly language is designed to make it easier to read and learn
quickly, but the basic functionality is quite similar to any current RISC instruction set.


Store
Store instructions are basically the reverse of load instructions— they move values from
registers back out to memory. There is no path in a RISC architecture to move bytes
directly from one place in memory to somewhere else in memory. Instead, you need to
use loads to get bytes into registers, and then stores to move them back to memory.
Store the constant number 37 into the word beginning at 400
Mem[400] = 37
Store the value in R6 into the word whose address is in R1
Mem[R1] = R6
Store lower half-word from R2 into 2 bytes starting at address 1024
Mem[1024] =.2 R2
Store R7 into the word whose address is 12 more than the address in R1
Mem[R1+12] = R7


ALU
Arithmetic Logical Unit (ALU) instructions are much like the operation keys of a
calculator. ALU operations only work with the registers or constants. Some processors
don't even allow constants (i.e. you would need to load the constant into a register first).
Add 6 to R3 and store the result in R1
R1 = 6 + R3
Subtract R3 from R2 and store the result in R1
R1 = R2 - R3
Although we will use '+' for both indiscriminately, the processor usually has two
different versions of the arithmetic operations, one for integers and one for floating point
numbers, invoked by two different instructions, for example, the Sparc has add and
fadd. Integer arithmetic is often much more efficient than floating point since the
operations are simpler (e.g. require no normalization). Division is by far the most
expensive of the arithmetic operations on either type and often is not a single instruction,
but a small "micro-coded" routine (think of it as very fast hand-tuned function).


Branching
By default, the CPU fetches and executes instructions from memory in order, working
from low memory to high. Branch instructions alter this default order. Branch
instructions test a condition and possibly change which instruction should be executed
next by changing the value of the PC register. One condition that all processors make
heavy use of is testing whether two values are equal or if a value is less (or greater) than
some other value. The operands in the test of a branch statement must be in registers or
constant values. Branches are used to implement control structures like if and switch as
well as loops like for and while.
Begin executing at address 344 if R1 equals 0
BEQ R1, 0, 344 "branch if equal"
Begin executing at addr 8 past current instruction if R2 less than R3
BLT R2, R3, PC+8 "branch if less than"
The full set of branch variants:
BLT branch if first argument is less than second
BLE less than or equal
BGT greater than
BGE greater than or equal
BEQ equal
BNE not equal
Any branch instruction compares its first two arguments (which both must be registers
or constants) and then potentially branches to the address given as the third argument.
The destination address can be specified as an absolute address, such as 356, or a PCrelative
address, such as PC-8 or PC+12. The later allows you to skip over a few
instructions or jump to a previous instruction, which are the most common patterns for
loops and conditionals.
In addition, there is an unconditional jump that has no test, but just immediately diverts
execution to a new address. Like branch, the address can be specified absolute or PCrelative.
Begin executing at address 2000 unconditionally- like a goto
Jmp 2000
Begin executing at address 12 before current instruction
Jmp PC-12
Data conversion
Two additional instructions are utilities that convert values between integer and floating
point formats. Remember a floating point 1.0 has a completely different arrangement of
bits than the integer 1 and instructions are required to do those conversions. These
instructions are also used to move values for computers that store floating point and
integer values in different sets of registers.
For our purposes, we will have instructions that convert between the 4-byte integer and
the 4-byte float value. The destination and source locations for the conversion operations
must both be registers.
Take bits in R3 that represent integer, convert to float, store in R2
R2 = ItoF R3
Take bits in R4, convert from float to int, and store back in same Note
that converting in this direction loses information, the fractional
component is truncated and lost
R4 = FtoI R4


Summary
Although there are a few things we bypassed (the logical and/or/not and some of the
operations that support function call/return), this simple set of instructions gives you a
pretty good idea of exactly what our average CPU can do in terms of instructions. The
richness and complexity of a programming language like C is provided by the compiler
which takes something complex like a for-loop, array reference, or function call and
translates it into an appropriate sequence of the above simple instructions


Arrays - The full story







Pointers in C








Monday, June 15, 2009

The Role of Caching in Large Scale Architecture

Thanks to the original author and source

Pre-Internet, lots of systems were built without caches. The need to scale has led to the widespread deployment of caching. Most of the open source caching projects grew out of internal efforts to solve performance problems for specific web sites. Ehcache, the most widely deployed Java cache, was originally developed at Wotif.com. And Memcached was developed for LiveJournal.com.

In this article we look at how to go about caching, and the different types of caches. We look at the different types of caching problems and recipes for each. We will also see how to use Ehcache and Ehcache Server to solve these problems.

Why does Caching Work?

Requests for data are not randomly distributed.

If requests for data were entirely random it would be hard to cache a subset of it. Caching works in computer systems because of the phenomenon of Locality Of Reference. More generally it works because many natural systems follows a power law probability distribution, often called a Pareto Distribution, which is shown in figure 1. This is sometimes called the long tail. But the flip side of the long tail is the fat head. The fat head is what we cache.

Figure 1: Pareto Distribution

If you are in doubt take a look at your own systems and create a chart of the frequency of data of a given type.

These observations allow us to create hierarchical approaches where we try to match frequency of use to the speed of access of the cache and the capacity of the cache.

It is useful to use the example of computer hardware. Computer memory is structured in a hierarchy consisting of registers, L1, L2 and L3 caches and then main memory. With each step down the hierarchy speed is multiple times slower but the capacity goes up.

Data is often written once and read many times

This is know as the read-write ratio. Almost all systems have data with read-write ratios > 1. Lots of systems have data with read-write ratios > 20. These are good candidates for caching. When a read can be satisfied by a cache, it is called a cache hit ratio.

The cache hit ratio is improved by holding more data in the cache, and by holding on to the data for longer periods.

Stale Data is often acceptable

If the data is changed in the system of record, but that change is not reflected in the cache, then the data is stale. Sometimes this can be avoided through a cache invalidation protocol. But in many cases it is perfectly acceptable for data to be stale for a time – the time to live.

Take Google search as an example. Crawlers index periodically. commonly once per day. As a result the entire cache is always stale. Yet no one complains because the staleness does not affect the usage.

Reasons to Cache

The first reason to cache is performance. The speed increase for a given fetch is orders of magnitude faster.

The second reason to cache is to reduce costs by reducing resource consumption. Some resource usage which can be reduced are CPU, database, external API usage, S3 fees, and datastore fees. Lastly, caching helps systems scale.

What and where to Cache? Amdahl's Law

Most people think they know where the bottlenecks are in their systems. Many times they turn out to be wrong. Without realising it, everyone has natural bias to solve problems using the toolset they already have - if you all have is a hammer then every problem looks like a nail.

I thoroughly recommend taking a formal empirical approach to break out of the bias. Take timing observations across the entire system.

Then use Amdahl's Law to tell you what part of the system to start optimising performance in. The benefit of a performance optimisation has an upper bound being the time taken by the part of the system being optimised. Take an example: a web page takes 21 seconds to render on a browser. The server time was 5 seconds, load across the Internet was 13 seconds and browser rendering, was 3 seconds. Applying Amdhahl's law we should fix the Internet load time, perhaps with an edge cache, then the server side and finally page rendering.

The process is iterative. Eventually you always get to the server side. Repeat the process. What typically comes out as expensive is:

  • creating the page/ajax response
  • creating page fragments
  • external API calls
  • assembling collections of data/processing of data
  • database calls

Apply Amdahl's law again. If you are using an Object-Relational framework, the best candidate to start with is caching database calls.

Once you identify an area where caching may be applied start data analysis. Analyse your production data to understand request distributions, data sizes and read-write. Determine the acceptable staleness or identify a cache invalidation protocol. With just the analysis you can determine the effectiveness of adding a cache.

Cache Hierarchies

Caches are available with varying latencies and capacities. We need to match out requirements to the cache.

Local Caches

In-process Caches

Here the cache is in the JVM. Access is call-by-reference so is really fast, measured in microseconds. That is an awful lot of speed.

O-R frameworks in particular need this speed. Hibernate may do hundreds or thousands of calls to the database for a single page. It is designed to work with a high performance in -memory cache.

Some examples of local caches are Ehcache core, OSCache, Oracle Coherence Near Cache. Also for web users, edge caches, such as Akamai qualify. Ehcache is available for Hibernate and OpenJPA. It has a general purpose caching API for in-process. It also has a Servlet caching filter for pages/ajax responses and page fragments. Finally there is a Jruby gem for JRuby in-process caching.

These types of caches are limited to the maximum memory size of your process. 32 bit architectures were limited to 32GB. Thankfully, with 64 bit architectures, we have moved past that limitation. In Java's case, the heap size is practically limited by garbage collector performance. As garbage collection has improved so have these cache sizes. The largest Ehcache local caches top out around 20GB. Based on production experience, 6GB heap sizes should work well, with the right garbage collector settings.

In-process caches offer a unique benefit over all other types – data is represented exactly as it needs to be for the programming language. There is no unmarshalling required.

Disk Caches

To move beyond the constaints of memory, some local caches offers storage to disk. The disk store is local to a machine but can theoretically be as large as the locally mounted file systems.

Ehcache offers a disk store set up in a natural hierarchy to the memory store. Once the memory store fills up, data overflows to disk. Ehcache also has disk persistence so that the entire cache can be persisted to disk on demand or at shutdown. It is there when you next restart the JVM saving the cost of repopulating the cache.

Clusters of Local Caches

There is a problem with local caches when you have multiple machines running an application. A request could hit one machine which then caches the data, but then the next request could hit another machine which does not have the data. If nothing is done the cache hit ratio equals the local cache hit ratio / machines in the cluster.

Cache replication solves this problem. Cache operations are replicated to other nodes in the cluster. Ehcache comes with peer to peer replication using either RMI or JGroups. It also provides centralised replication via JMS. Ehcache RMI based replication architecture is illustrated in figure 2.



Figure 2: Ehcache RMI based replication

In Ehcache, turning on replication is easy. For example, to add RMIReplication with defaults to a cache declaration, add the following:

1.<cache name="sampleDistributedCache"
2. ...
3. <cacheEventListenerFactory
4. class="net.sf.ehcache.distribution.RMICacheReplicatorFactory"/>
5. </cache>

Client-Server Caches

Here the cache is somewhere over the network running on a cache server. Some examples are Ehcache Server, memcached, Apache, Squid and in-memory databases such as TimesTen and MySQL (using in-memory tables).

Access times are measured in milliseconds, compared with microseconds for local caches.

As an example here is the time reported by cloudstatus.com for Google's AppEngine memcached.


Cache System

Response Time

Get Time (1 MB)

Google AppEngine memcached 6.5ms 62ms

These speeds are not suited to usages requiring rapid lookup like O/R frameworks. In the computer hardware analogy, why would we need L1-L3 caches if memory was fast enough?

But client-server caches have a key benefit – they effectively have infinite capacity. While the resources of a single server are limited, caches can be partitioned or sharded across any number of cache servers. Partitioning addressing works by using consistent hashing, where the same key always resolves to the same partition or shard.

As a sidenote, it is worth noting that there are large scale systems out there that apply partitioning at a higher level. In these approaches a whole system is broken into workgroups. Each workgroup has an instance of the system with it's users' data. A fronting load balancers directs user requests to their workgroup.

Ehcache Server (http://ehcache.sourceforge.net/documentation/cache_server.html) is a REST based client-server cache. It has a simple API, so simple you do not need a client – interacting with it is just a few lines of code.

For example here is a Ruby client which reads data out of the cache:

1.require 'rubygems'
2.require 'open-uri'
3.require 'rexml/document'
4.
5.response = open('http://localhost:8080/ehcache/rest/sampleCache2/2')
6.xml = response.read
7.puts xml

Ehcache Server is typically deployed behind a load balancer for redundancy. Because the keys are encoded in the path of the URL, consistent hashing can either be applied in the client or on the load balancer via URI routing. It offers redundant configuration for each partition via master-master replication using one of Ehcache's replication methods.

Scaling is simple. If load increases, simply add more servers to each partition. If data increases, simply create more partitions. Ehcache Server can scale to any size architecture. An example topology with load balancer based consistent hashing and partition redundancy is illustrated in figure 3.



Figure 3: Ehcache Topology

Memcached uses its own custom protocol, not HTTP. It relies on client-side consistent hashing. Redundancy is not provided. This is typically worked around by creating two sets of memcached servers. The client writes twice.
The Memcached topology is illustrated in figure 4.



Figure 4: Memcached Topology

Recipes for Caching - Bringing it all together

The following table shows recipes for various caching scenarios.

Data Characterestics

Caching model

Data Size <>Use a local cache with no replication.
Data Size <> Use a local cache with replication.
Large Data Size, Inexpensive Fetching Use a client-server cache with no redundancy. If you have a cache failure you regenerate the cache.
Large Data Size, Expensive Fetching
Use a client-server cache with redundancy.

Finally, if you have a large data size and need fast cache performance, use a hybrid approach with a local cache backed by a client-server cache. The client-server cache becomes a backing cache. This is very easily done in Ehcache with CacheEventListeners to put changes into the backing cache and CacheLoaders to automatically load from the backing cache.

The different types of caching, and how they relate are well illustrated by the Ehcache Architecture diagram in figure 5.


Sunday, June 7, 2009

Saturday, June 6, 2009

Machine Structures2: Assembler and the Activation Record


An Assembler is a program which does the following

· Converts the mnemonics [mov] to opcodes [21]

· Does Macro expansion

· Does Constant Expression Evaluation &

· Maintains a Symbol Memory Table having the info on the symbols used and there values, as shown below









Note:

Variable declarations will be in big memory and the value is loaded into register using load operation and similarly with the store to put the value back from register to big memory.





















Monday, June 1, 2009

What should one look for in a Code Review

I recently put this bullet point list together for the team I’m currently working with.

Naming Conventions

General Principles

  • The core imperative is to organise complexity.
  • Clarity and readability is central. “Intention Revealing”
  • Do not prematurely optimise for performance.
  • Do not repeat yourself. Never copy-and-paste code.
  • Decouple.
  • Always try to leave the code you work on in a better state than before you started (the ‘boy scout’ principle)

Keep the source clean

  • Always delete unused code. Including variables and using statements
  • Don’t comment out code, delete it. We have source control to manage change.

Naming things

  • The name should accurately describe what the thing does.
  • Do not use shortenings, only use well understood abbreviations.
  • If the name looks awkward, the code is probably awkward.

Namespaces

  • Namespaces should match the project name + path inside the project. This is what VS will give you by default.
  • Classes that together provide similar functions should be grouped in a single namespace.
  • Avoid namespace dependency cycles.

Variables

  • Use constants where possible. Avoid magic strings.
  • Use readonly where possible
  • Avoid many temporary variables.
  • Never use a single variable for two different puposes.
  • Keep scope as narrow as possible. (declaration close to use)

Methods

  • The name should accurately describe what the method does.
  • It should only do one thing.
  • It should be small (more than 10 lines of code is questionable).
  • The number of parameters should be small.
  • Public methods should validate all parameters.
  • Assert expectations and throw an appropriate error if invalid.
  • Avoid deep nesting of loops and conditionals. (Cyclomatic complexity).

Classes

  • The name should accurately describe what the class does.
  • Classes typically represent data or services, be clear which your class is.
  • Design your object oriented schema deliberately.
  • A class should be small.
  • A class should have one responsibility only.
  • A class should have a clear contract.
  • A class should be decoupled from its dependencies.
  • Favour composition over inheritance.
  • Avoid static classes and methods.
  • Make the class immutable if possible.

Interfaces

  • Rely on interfaces rather than concrete classes wherever possible.
  • An interface is a contract for interaction.
  • An interface should have a single purpose (ISP)

Tests

  • All code should have unit tests if possible.
  • Test code should have the same quality as production code.
  • Write code test-first wherever possible.

Error Handling

  • Only wrap code with a try..catch statement if you are expecting it to throw a specific exception.
  • Unexpected errors should only be handled at process boundaries.
  • Never ‘bury’ exceptions.
Thanks to the original author and the source

Why do we use Nested Classes in Java?

Nested Classes are classes defined inside the body of another enclosing class. They are of two types - static and non-static. Read more about them in this article - Nested Class & Inner Classes in Java >>

Why do we use Nested Classes?

Well... there may be a variety of reasons, but the main reason is to use them is probably for a better grouping of classes and hence an improved redability of the code. This will in turn make the code better from the maintainability point of view as well as all these are quite inter-related. By better grouping we mean that the code will be right there where it'll be used. Nesting of classes will provide better packaging convenience as well.

If a class is of use for only one other class, it'll never be a good idea to make that class a top-level class. Putting the definition of the class inside the body of the top-level class (which will use it) will make the code available right at the place where it'll be used. If the code is being used by only one class, why to have a separate class then? Well... this is a design issue and it's normally done to logically group similar information inside a class. As creating a nested class will represent a collective data contained by that nested class, which otherwise would be scattered inside the body of the enclosing class.

Nested classes are treated as members of the enclosing classes and hence we can specify any of the four access specifiers - private, package, protected, or public. We don't have this luxury with top-level classes, which can only be declared public or package.

Potential Disadvantages of Nested Classes in Java

There are no serious disadvantages, but one can certainly figure out at least few including:-
  • Difficult to understand - especially for non-experiences programmers, who may find it difficult to code, enhance, and maintain.
  • More number of classes - it certainly increases the total number of classes being used by the application. For every class loaded into the memory, JVM creates an object of type Class for it. There may be some other routine tasks, which JVM might be required to do for all the extra classes. This may result in a slightly slower performance if the application is using several nested/inner classes (may be due to a poor design).
  • Limited support by the Tools/IDE - Nested classes don't enjoy the same support as the top-level classes get in most of the tools and IDEs. This may irritate the developer at times.

Thanks to the original author and the source

How to Write an Equality Method in Java

Summary
This article describes a technique for overriding the equals method that preserves the contract of equals even when subclassses of concrete classes add new fields.

In Item 8 of Effective Java1, Josh Bloch describes the difficulty of preserving the equals contract when subclassing as a “fundamental problem of equivalence relations in object-oriented languages.” Bloch writes:

There is no way to extend an instantiable class and add a value component while preserving the equals contract, unless you are willing to forgo the benefits of object-oriented abstraction.

Chapter 28 of Programming in Scala shows an approach that allows subclasses to extend an instantiable class, add a value component, and nevertheless preserve the equals contract. Although in the technique is described in the book in the context of defining Scala classes, it is also applicable to classes defined in Java. In this article, we present the technique using text adapted from the relevant section of Programming in Scala, but with the code examples translated from Scala into Java.

Commmon equality pitfalls

Class java.lang.Object defines an equals method, which subclasses may override. Unfortunately, it turns out that writing a correct equality method is surprisingly difficult in object-oriented languages. In fact, after studying a large body of Java code, the authors of a 2007 paper concluded that almost all implementations of equals methods are faulty.2

This is problematic, because equality is at the basis of many other things. For one, a faulty equality method for a type C might mean that you cannot reliably put an object of type C in a collection. You might have two elements elem1, elem2 of type C that are equal, i.e., “elem1.equals(elem2)” yields true. Nevertheless, with commonly occurring faulty implementations of the equals method, you could still see behavior like the following:
Set hashSet = new java.util.HashSet();
hashSet.add(elem1);
hashSet.contains(elem2); // returns false!

Here are four common pitfalls3 that can cause inconsistent behavior when overriding equals:

  1. Defining equals with the wrong signature.
  2. Changing equals without also changing hashCode.
  3. Defining equals in terms of mutable fields.
  4. Failing to define equals as an equivalence relation.

These four pitfalls are discussed in the remainder of this section.

Pitfall #1: Defining equals with the wrong signature.

Consider adding an equality method to the following class of simple points:

public class Point {

private final int x;
private final int y;

public Point(int x, int y) {
this.x = x;
this.y = y;
}

public int getX() {
return x;
}

public int getY() {
return y;
}

// ...
}

A seemingly obvious, but wrong way would be to define it like this:

// An utterly wrong definition of equals
public boolean equals(Point other) {
return (this.getX() == other.getX() && this.getY() == other.getY());
}

What's wrong with this method? At first glance, it seems to work OK:

Point p1 = new Point(1, 2);
Point p2 = new Point(1, 2);

Point q = new Point(2, 3);

System.out.println(p1.equals(p2)); // prints true

System.out.println(p1.equals(q)); // prints false

However, trouble starts once you start putting points into a collection:

import java.util.HashSet;

HashSet coll = new HashSet();
coll.add(p1);

System.out.println(coll.contains(p2)); // prints false

How can it be that coll does not contain p2, even though p1 was added to it, and p1 and p2 are equal objects? The reason becomes clear in the following interaction, where the precise type of one of the compared points is masked. Define p2a as an alias of p2, but with type Object instead of Point:

Object p2a = p2;

Now, were you to repeat the first comparison, but with the alias p2a instead of p2, you would get:

System.out.println(p1.equals(p2a)); // prints false

What went wrong? In fact, the version of equals given previously does not override the standard method equals, because its type is different. Here is the type of the equals method as it is defined in the root class Object:

public boolean equals(Object other)

Because the equals method in Point takes a Point instead of an Object as an argument, it does not override equals in Object. Instead, it is just an overloaded alternative. Overloading in Java is resolved by the static type of the argument, not the run-time type. So as long as the static type of the argument is Point, the equals method in Point is called. However, once the static argument is of type Object, the equals method in Object is called instead. This method has not been overridden, so it is still implemented by comparing object identity. That's why the comparison “p1.equals(p2a)” yields false even though points p1 and p2a have the same x and y values. That's also why the contains method in HashSet returned false. Since that method operates on generic sets, it calls the generic equals method in Object instead of the overloaded variant in Point.

A better equals method is the following:

// A better definition, but still not perfect
@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof Point) {
Point that = (Point) other;
result = (this.getX() == that.getX() && this.getY() == that.getY());
}
return result;
}

Now equals has the correct type. It takes a value of type Object as parameter and it yields a boolean result. The implementation of this method uses instanceof and a cast. It first tests whether the other object is also of type Point. If it is, it compares the coordinates of the two points and returns the result. Otherwise the result is false.

Pitfall #2: Changing equals without also changing hashCode

If you repeat the comparison of p1 and p2a with the latest definition of Point defined previously, you will get true, as expected. However, if you repeat the HashSet.contains test, you will probably still get false:

Point p1 = new Point(1, 2);
Point p2 = new Point(1, 2);

HashSet coll = new HashSet();
coll.add(p1);

System.out.println(coll.contains(p2)); // prints false (probably)

In fact, this outcome is not 100% certain. You might also get true from the experiment. If you do, you can try with some other points with coordinates 1 and 2. Eventually, you'll get one which is not contained in the set. What goes wrong here is that Point redefined equals without also redefining hashCode.

Note that the collection in the example above is a HashSet. This means elements of the collection are put in “hash buckets” determined by their hash code. The contains test first determines a hash bucket to look in and then compares the given elements with all elements in that bucket. Now, the last version of class Point did redefine equals, but it did not at the same time redefine hashCode. So hashCode is still what it was in its version in class Object: some transformation of the address of the allocated object. The hash codes of p1 and p2 are almost certainly different, even though the fields of both points are the same. Different hash codes mean with high probability different hash buckets in the set. The contains test will look for a matching element in the bucket which corresponds to p2's hash code. In most cases, point p1 will be in another bucket, so it will never be found. p1 and p2 might also end up by chance in the same hash bucket. In that case the test would return true.

The problem was that the last implementation of Point violated the contract on hashCode as defined for class Object:

If two objects are equal according to the equals(Object) method, then calling the hashCode method on each of the two objects must produce the same integer result.

In fact, it's well known in Java that hashCode and equals should always be redefined together. Furthermore, hashCode may only depend on fields that equals depends on. For the Point class, the following would be a suitable definition of hashCode:

public class Point {

private final int x;
private final int y;

public Point(int x, int y) {
this.x = x;
this.y = y;
}

public int getX() {
return x;
}

public int getY() {
return y;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof Point) {
Point that = (Point) other;
result = (this.getX() == that.getX() && this.getY() == that.getY());
}
return result;
}

@Override public int hashCode() {
return (41 * (41 + getX()) + getY());
}

}

This is just one of many possible implementations of hashCode. Adding the constant 41 to one integer field x, multiplying the result with the prime number 41, and adding to that result the other integer field y gives a reasonable distribution of hash codes at a low cost in running time and code size.

Adding hashCode fixes the problems of equality when defining classes like Point. However, there are still other trouble spots to watch out for.

Pitfall #3: Defining equals in terms of mutable fields

Consider the following slight variation of class Point:

public class Point {

private int x;
private int y;


public Point(int x, int y) {
this.x = x;
this.y = y;
}

public int getX() {
return x;
}

public int getY() {
return y;
}

public void setX(int x) { // Problematic
this.x = x;
}

public void setY(int y) {
this.y = y;
}


@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof Point) {
Point that = (Point) other;
result = (this.getX() == that.getX() && this.getY() == that.getY());
}
return result;
}

@Override public int hashCode() {
return (41 * (41 + getX()) + getY());
}
}

The only difference is that the fields x and y are no longer final, and two set methods have been added that allow clients to change the x and y values. The equals and hashCode methods are now defined in terms of these mutable fields, so their results change when the fields change. This can have strange effects once you put points in collections:

Point p = new Point(1, 2);

HashSet coll = new HashSet();
coll.add(p);

System.out.println(coll.contains(p)); // prints true

Now, if you change a field in point p, does the collection still contain the point? We'll try it:

p.setX(p.getX() + 1);

System.out.println(coll.contains(p)); // prints false (probably)

This looks strange. Where did p go? More strangeness results if you check whether the iterator of the set contains p:

Iterator it = coll.iterator();
boolean containedP = false;
while (it.hasNext()) {
Point nextP = it.next();
if (nextP.equals(p)) {
containedP = true;
break;
}
}

System.out.println(containedP); // prints true

So here's a set that does not contain p, yet p is among the elements of the set! What happened, of course, is that after the change to the x field, the point p ended up in the wrong hash bucket of the set coll. That is, its original hash bucket no longer corresponded to the new value of its hash code. In a manner of speaking, the point p “dropped out of sight” in the set coll even though it still belonged to its elements.

The lesson to be drawn from this example is that when equals and hashCode depend on mutable state, it may cause problems for users. If they put such objects into collections, they have to be careful never to modify the depended-on state, and this is tricky. If you need a comparison that takes the current state of an object into account, you should usually name it something else, not equals. Considering the last definition of Point, it would have been preferable to omit a redefinition of hashCode and to name the comparison method equalContents, or some other name different from equals. Point would then have inherited the default implementation of equals and hashCode. So p would have stayed locatable in coll even after the modification to its x field.

Pitfall #4: Failing to define equals as an equivalence relation

The contract of the equals method in Object specifies that equals must implement an equivalence relation on non-null objects:

  • It is reflexive: for any non-null value x, the expression x.equals(x) should return true.
  • It is symmetric: for any non-null values x and y, x.equals(y) should return true if and only if y.equals(x) returns true.
  • It is transitive: for any non-null values x, y, and z, if x.equals(y) returns true and y.equals(z) returns true, then x.equals(z) should return true.
  • It is consistent: for any non-null values x and y, multiple invocations of x.equals(y) should consistently return true or consistently return false, provided no information used in equals comparisons on the objects is modified.
  • For any non-null value x, x.equals(null) should return false.

The definition of equals developed so far for class Point satisfies the contract for equals. However, things become more complicated once subclasses are considered. Say there is a subclass ColoredPoint of Point that adds a field color of type Color. Assume Color is defined as an enum:

public enum Color {
RED, ORANGE, YELLOW, GREEN, BLUE, INDIGO, VIOLET;
}

ColoredPoint overrides equals to take the new color field into account:

public class ColoredPoint extends Point { // Problem: equals not symmetric

private final Color color;

public ColoredPoint(int x, int y, Color color) {
super(x, y);
this.color = color;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof ColoredPoint) {
ColoredPoint that = (ColoredPoint) other;
result = (this.color.equals(that.color) && super.equals(that));
}
return result;
}
}

This is what many programmers would likely write. Note that in this case, class ColoredPoint need not override hashCode. Because the new definition of equals on ColoredPoint is stricter than the overridden definition in Point (meaning it equates fewer pairs of objects), the contract for hashCode stays valid. If two colored points are equal, they must have the same coordinates, so their hash codes are guaranteed to be equal as well.

Taking the class ColoredPoint by itself, its definition of equals looks OK. However, the contract for equals is broken once points and colored points are mixed. Consider:

Point p = new Point(1, 2);

ColoredPoint cp = new ColoredPoint(1, 2, Color.RED);

System.out.println(p.equals(cp)); // prints true

System.out.println(cp.equals(p)); // prints false

The comparison “p equals cp” invokes p's equals method, which is defined in class Point. This method only takes into account the coordinates of the two points. Consequently, the comparison yields true. On the other hand, the comparison “cp equals p” invokes cp's equals method, which is defined in class ColoredPoint. This method returns false, because p is not a ColoredPoint. So the relation defined by equals is not symmetric.

The loss in symmetry can have unexpected consequences for collections. Here's an example:

Set hashSet1 = new java.util.HashSet();
hashSet1.add(p);
System.out.println(hashSet1.contains(cp)); // prints false

Set hashSet2 = new java.util.HashSet();
hashSet2.add(cp);
System.out.println(hashSet2.contains(p)); // prints true

So even though p and cp are equal, one contains test succeeds whereas the other one fails.

How can you change the definition of equals so that it becomes symmetric? Essentially there are two ways. You can either make the relation more general or more strict. Making it more general means that a pair of two objects, a and b, is taken to be equal if either comparing a with b or comparing b with a yields true. Here's code that does this:

public class ColoredPoint extends Point { // Problem: equals not transitive

private final Color color;

public ColoredPoint(int x, int y, Color color) {
super(x, y);
this.color = color;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof ColoredPoint) {
ColoredPoint that = (ColoredPoint) other;
result = (this.color.equals(that.color) && super.equals(that));
}
else if (other instanceof Point) {
Point that = (Point) other;
result = that.equals(this);
}
return result;
}
}

The new definition of equals in ColoredPoint checks one more case than the old one: If the other object is a Point but not a ColoredPoint, the method forwards to the equals method of Point. This has the desired effect of making equals symmetric. Now, both “cp.equals(p)” and “p.equals(cp)” result in true. However, the contract for equals is still broken. Now the problem is that the new relation is no longer transitive! Here's a sequence of statements that demonstrates this. Define a point and two colored points of different colors, all at the same position:

ColoredPoint redP = new ColoredPoint(1, 2, Color.RED);
ColoredPoint blueP = new ColoredPoint(1, 2, Color.BLUE);

Taken individually, redp is equal to p and p is equal to bluep:

System.out.println(redP.equals(p)); // prints true

System.out.println(p.equals(blueP)); // prints true

However, comparing redP and blueP yields false:

System.out.println(redP.equals(blueP)); // prints false

Hence, the transitivity clause of equals's contract is violated.

Making the equals relation more general seems to lead to a dead end. We'll try to make it stricter instead. One way to make equals stricter is to always treat objects of different classes as different. That could be achieved by modifying the equals methods in classes Point and ColoredPoint. In class Point, you could add an extra comparison that checks whether the run-time class of the other Point is exactly the same as this Point's class, as follows:

// A technically valid, but unsatisfying, equals method
public class Point {

private final int x;
private final int y;

public Point(int x, int y) {
this.x = x;
this.y = y;
}

public int getX() {
return x;
}

public int getY() {
return y;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof Point) {
Point that = (Point) other;
result = (this.getX() == that.getX() && this.getY() == that.getY()
&& this.getClass().equals(that.getClass()));
}
return result;
}

@Override public int hashCode() {
return (41 * (41 + getX()) + getY());
}
}

You can then revert class ColoredPoint's implementation back to the version that previously had violated the symmetry requirement:4

public class ColoredPoint extends Point { // No longer violates symmetry requirement

private final Color color;

public ColoredPoint(int x, int y, Color color) {
super(x, y);
this.color = color;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof ColoredPoint) {
ColoredPoint that = (ColoredPoint) other;
result = (this.color.equals(that.color) && super.equals(that));
}
return result;
}
}

Here, an instance of class Point is considered to be equal to some other instance of the same class only if the objects have the same coordinates and they have the same run-time class, meaning .getClass() on either object returns the same value. The new definitions satisfy symmetry and transitivity because now every comparison between objects of different classes yields false. So a colored point can never be equal to a point. This convention looks reasonable, but one could argue that the new definition is too strict.

Consider the following slightly roundabout way to define a point at coordinates (1, 2):

Point pAnon = new Point(1, 1) {
@Override public int getY() {
return 2;
}
};

Is pAnon equal to p? The answer is no because the java.lang.Class objects associated with p and pAnon are different. For p it is Point, whereas for pAnon it is an anonymous subclass of Point. But clearly, pAnon is just another point at coordinates (1, 2). It does not seem reasonable to treat it as being different from p.

The canEqual method

So it seems we are stuck. Is there a sane way to redefine equality on several levels of the class hierarchy while keeping its contract? In fact, there is such a way, but it requires one more method to redefine together with equals and hashCode. The idea is that as soon as a class redefines equals (and hashCode), it should also explicitly state that objects of this class are never equal to objects of some superclass that implement a different equality method. This is achieved by adding a method canEqual to every class that redefines equals. Here's the method's signature:

public boolean canEqual(Object other)

The method should return true if the other object is an instance of the class in which canEqual is (re)defined, false otherwise. It is called from equals to make sure that the objects are comparable both ways. Here is a new (and final) implementation of class Point along these lines:

public class Point {

private final int x;
private final int y;

public Point(int x, int y) {
this.x = x;
this.y = y;
}

public int getX() {
return x;
}

public int getY() {
return y;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof Point) {
Point that = (Point) other;
result = (that.canEqual(this) && this.getX() == that.getX() && this.getY() == that.getY());
}
return result;
}

@Override public int hashCode() {
return (41 * (41 + getX()) + getY());
}

public boolean canEqual(Object other) {
return (other instanceof Point);
}

}

The equals method in this version of class Point contains the additional requirement that the other object can equal this one, as determined by the canEqual method. The implementation of canEqual in Point states that all instances of Point can be equal.

Here's the corresponding implementation of ColoredPoint:

public class ColoredPoint extends Point { // No longer violates symmetry requirement

private final Color color;

public ColoredPoint(int x, int y, Color color) {
super(x, y);
this.color = color;
}

@Override public boolean equals(Object other) {
boolean result = false;
if (other instanceof ColoredPoint) {
ColoredPoint that = (ColoredPoint) other;
result = (that.canEqual(this) && this.color.equals(that.color) && super.equals(that));
}
return result;
}

@Override public int hashCode() {
return (41 * super.hashCode() + color.hashCode());
}

@Override public boolean canEqual(Object other) {
return (other instanceof ColoredPoint);
}

}

It can be shown that the new definition of Point and ColoredPoint keeps the contract of equals. Equality is symmetric and transitive. Comparing a Point to a ColoredPoint always yields false. Indeed, for any point p and colored point cp, “p.equals(cp)” will return false because “cp.canEqual(p)” will return false. The reverse comparison, “cp.equals(p)”, will also return false, because p is not a ColoredPoint, so the first instanceof check in the body of equals in ColoredPoint will fail.

On the other hand, instances of different subclasses of Point can be equal, as long as none of the classes redefines the equality method. For instance, with the new class definitions, the comparison of p and pAnon would yield true. Here are some examples:

Point p = new Point(1, 2);

ColoredPoint cp = new ColoredPoint(1, 2, Color.INDIGO);

Point pAnon = new Point(1, 1) {
@Override public int getY() {
return 2;
}
};

Set coll = new java.util.HashSet();
coll.add(p);

System.out.println(coll.contains(p)); // prints true

System.out.println(coll.contains(cp)); // prints false

System.out.println(coll.contains(pAnon)); // prints true

These examples demonstrate that if a superclass equals implementation defines and calls canEqual, then programmers who implement subclasses can decide whether or not their subclasses may be equal to instances of the superclass. Because ColoredPoint overrides canEqual, for example, a colored point may never be equal to a plain-old point. But because the anonymous subclass referenced from pAnon does not override canEqual, its instance can be equal to a Point instance.

One potential criticism of the canEqual approach is that it violates the Liskov Substitution Principle (LSP). For example, the technique of implementing equals by comparing run-time classes, which led to the inability to define a subclass whose instances can equal instances of the superclass, has been described as a violation of the LSP.5 The reasoning is that the LSP states you should be able to use (substitute) a subclass instance where a superclass instance is required. In the previous example, however, “coll.contains(cp)” returned false even though cp's x and y values matched those of the point in the collection. Thus it may seem like a violation of the LSP, because you can't use a ColoredPoint here where a Point is expected. We believe this is the wrong interpretation, though, because the LSP doesn't require that a subclass behaves identically to its superclass, just that it behaves in a way that fulfills the contract of its superclass.

The problem with writing an equals method that compares run-time classes is not that it violates the LSP, but that it doesn't give you a way to create a subclass whose instances can equal superclass instances. For example, had we used the run-time class technique in the previous example, “coll.contains(pAnon)” would have returned false, and that's not what we wanted. By contrast, we really did want “coll.contains(cp)” to return false, because by overriding equals in ColoredPoint, we were basically saying that an indigo-colored point at coordinates (1, 2) is not the same thing as an uncolored point at (1, 2). Thus, in the previous example we were able to pass two different Point subclass instances to the collection's contains method, and we got back two different answers, both correct