Sunday, August 30, 2009

JPA Performance: Dont Ignore the Database

Database Schema

Good Database schema design is important for performance. One of the most basic optimizations is to design your tables to take as little space on the disk as possible , this makes disk reads faster and uses less memory for query processing.

Data Types

You should use the smallest data types possible, especially for indexed fields. The smaller your data types, the more indexes (and data) can fit into a block of memory, the faster your queries will be.

Normalization

Database Normalization eliminates redundant data, which usually makes updates faster since there is less data to change. However a Normalized schema causes joins for queries, which makes queries slower, denormalization speeds retrieval. More normalized schemas are better for applications involving many transactions, less normalized are better for reporting types of applications. You should normalize your schema first, then de-normalize later. Applications often need to mix the approaches, for example use a partially normalized schema, and duplicate, or cache, selected columns from one table in another table. With JPA O/R mapping you can use the @Embedded annotation for denormalized columns to specify a persistent field whose @Embeddable type can be stored as an intrinsic part of the owning entity and share the identity of the entity.


Database Normalization and Mapping Inheritance Hiearchies

The Class Inheritance hierarchy shown below will be used as an example of JPA O/R mapping.


In the Single table per class mapping shown below, all classes in the hierarchy are mapped to a single table in the database. This table has a discriminator column (mapped by @DiscriminatorColumn), which identifies the subclass. Advantages: This is fast for querying, no joins are required. Disadvantages: wastage of space since all inherited fields are in every row, a deep inheritance hierarchy will result in wide tables with many, some empty columns.



In the Joined Subclass mapping shown below, the root of the class hierarchy is represented by a single table, and each subclass has a separate table that only contains those fields specific to that subclass. This is normalized (eliminates redundant data) which is better for storage and updates. However queries cause joins which makes queries slower especially for deep hierachies, polymorphic queries and relationships.


In the Table per Class mapping (in JPA 2.0, optional in JPA 1.0), every concrete class is mapped to a table in the database and all the inherited state is repeated in that table. This is not normlalized, inherited data is repeated which wastes space. Queries for Entities of the same type are fast, however polymorphic queries cause unions which are slower.



Know what SQL is executed

You need to understand the SQL queries your application makes and evaluate their performance. Its a good idea to enable SQL logging, then go through a use case scenario to check the executed SQL. Logging is not part of the JPA specification, With EclipseLink you can enable logging of SQL by setting the following property in the persistence.xml file:



<properties>
<property name="eclipselink.logging.level" value="FINE"/>
</properties>

With Hibernate you set the following property in the persistence.xml file:




<properties>
<property name="hibernate.show_sql" value="true" />
</properties>

Basically you want to make your queries access less data, is your application retrieving more data than it needs, are queries accessing too many rows or columns? Is the database query analyzing more rows than it needs? Watch out for the following:
  • queries which execute too often to retrieve needed data
  • retrieving more data than needed
  • queries which are too slow
    • you can use EXPLAIN to see where you should add indexes

With MySQL you can use the slow query log to see which queries are executing slowly, or you can use the MySQL query analyzer to see slow queries, query execution counts, and results of EXPLAIN statements.

Understanding EXPLAIN

For slow queries, you can precede a SELECT statement with the keyword EXPLAIN to get information about the query execution plan, which explains how it would process the SELECT, including information about how tables are joined and in which order. This helps find missing indexes early in the development process.




You should index columns that are frequently used in Query WHERE, GROUP BY clauses, and columns frequently used in joins, but be aware that indexes can slow down inserts and updates.

Lazy Loading and JPA

With JPA many-to-one and many-to-many relationships lazy load by default, meaning they will be loaded when the entity in the relationship is accessed. Lazy loading is usually good, but if you need to access all of the "many" objects in a relationship, it will cause n+1 selects where n is the number of "many" objects.




You can change the relationship to be loaded eagerly as follows :




However you should be careful with eager loading which could cause SELECT statements that fetch too much data. It can cause a Cartesian product if you eagerly load entities with several related collections.


If you want to override the LAZY fetch type for specific use cases, you can use Fetch Join. For example this query would eagerly load the employee addresses:

In General you should lazily load relationships, test your use case scenarios, check the SQL log, and use @NameQueries with JOIN FETCH to eagerly load when needed.

Partitioning

the main goal of partitioning is to reduce the amount of data read for particular SQL operations so that the overall response time is reduced

Vertical Partitioning splits tables with many columns into multiple tables with fewer columns, so that only certain columns are included in a particular dataset, with each partition including all rows.

Horizontal Partitioning segments table rows so that distinct groups of physical row-based datasets are formed. All columns defined to a table are found in each set of partitions. An example of horizontal partitioning might be a table that contains historical data being partitioned by date.

Vertical Partitioning


In the example of vertical partitioning below a table that contains a number of very wide text or BLOB columns that aren't referenced often is split into two tables with the most referenced columns in one table and the seldom-referenced text or BLOB columns in another.

By removing the large data columns from the table, you get a faster query response time for the more frequently accessed Customer data. Wide tables can slow down queries, so you should always ensure that all columns defined to a table are actually needed.

The example below shows the JPA mapping for the tables above. The Customer data table with the more frequently accessed and smaller data types is mapped to the Customer Entity, the CustomerInfo table with the less frequently accessed and larger data types is mapped to the CustomerInfo Entity with a lazily loaded one to one relationship to the Customer.


Horizontal Partitioning

The major forms of horizontal partitioning are by Range, Hash, Hash Key, List, and Composite.

Horizontal partitioning can make queries faster because the query optimizer knows what partitions contain the data that will satisfy a particular query and will access only those necessary partitions during query execution. Horizontal Partitioning works best for large database Applications that contain a lot of query activity that targets specific ranges of database tables.


Hibernate Shards

Partitioning data horizontally into "Shards" is used by google, linkedin, and others to give extreme scalability for very large amounts of data. eBay "shards" data horizontally along its primary access path.

Hibernate Shards is a framework that is designed to encapsulate support for horizontal partitioning into the Hibernate Core.


Caching

JPA Level 2 caching avoids database access for already loaded entities, this make reading reading frequently accessed unmodified entities faster, however it can give bad scalability for frequent or concurrently updated entities.

You should configure L2 caching for entities that are:
  • read often
  • modified infrequently
  • Not critical if stale
You should also configure L2 (vendor specific) caching for maxElements, time to expire, refresh...

References and More Information:

JPA Best Practices presentation
MySQL for Developers Article
MySQL for developers presentation
MySQL for developers screencast
Keeping a Relational Perspective for Optimizing Java Persistence
Java Persistence with Hibernate
Pro EJB 3: Java Persistence API
Java Persistence API 2.0: What's New ?
High Performance MySQL book
Pro MySQL, Chapter 6: Benchmarking and Profiling
EJB 3 in Action
sharding the hibernate way
JPA Caching
Best Practices for Large-Scale Web Sites: Lessons from eBay

Original Source

No comments:

Post a Comment