The size and complexity of reservoir simulation models are growing continuously to better capture the heterogeneities of geological models and to simulate advanced physics. Generally reservoir simulators' speed and efficiency are still major bottlenecks requiring, in many cases, the simplification of the models by the user. We present here some recent insights of High Performance Computing (HPC) prototyping in an In-House Research Reservoir Simulator (IHRRS). Firstly, we explain how we build a cache-efficient Jacobian matrix in the case of a general multi-phase multi-component simulator using numerical discretizations based on Two-Point Flux Approximation (TPFA) or Multi-Point Flux Approximation (MPFA). Secondly, we describe the implementation of two levels of distributed memory parallelism (MPI) in addition to one level of shared memory parallelism (OpenMP). Two levels of distributed memory parallelism are used because some algorithms like AMG-coarsening or parallel I/O of grid data have difficulties to scale with a large number of distributed memory processes, while other parts of the simulator still scale with additional processes. A second level of shared memory parallelism is used in each distributed process. Some numerical validation is presented to investigate the optimum ratio of MPI and OpenMP processes. In addition we test our algorithms on the simulation of a coreflood experiment at an adverse mobility ratio using about 30 million gridblocks in 3D, and on a large-scale model of nearly half a billion cells.