DSpace Recovery & Troubleshooting Runbook

Step-by-step guide to recover from 500 errors in DSpace (Solr + Tomcat10 + Apache).

1 β€” Quick triage (1 minute)

curl -sS -I http://localhost:8080/server/api
ss -ltnp | grep -E ':8983|LISTEN'
sudo systemctl status tomcat10 --no-pager
sudo tail -n 40 /var/log/apache2/error.log

2 β€” Full recovery procedure

A. Stop leftover Solr instances

sudo -u solr /opt/solr/bin/solr stop -all

B. Start Solr

sudo systemctl start solr

C. Restart Tomcat10

sudo systemctl restart tomcat10

Expanded Recovery & Runbook β€” DSpace (Apache β†’ Tomcat10 β†’ Solr β†’ Postgres)

Below is a compact, copy-pasteable runbook you can follow the next time the site shows 500s or Apache logs error reading status line from remote server localhost:8080. It contains (A) a one-minute triage, (B) the full step-by-step recovery, (C) diagnostics to capture, (D) a safe restart script you can install, and (E) forward-looking prevention steps.

All commands assume you have sudo access (run as root or prefix with sudo).


1 β€” Quick triage (1 minute)

Run these four commands and share the failures if anything looks wrong:

# 1) Does the frontend/API respond?
curl -sS -I http://localhost:8080/server/api || echo "no response on :8080"

# 2) Is Solr listening?
ss -ltnp | grep -E ':8983|LISTEN' || sudo lsof -i:8983

# 3) Tomcat service status
sudo systemctl status tomcat10 --no-pager

# 4) Recent Apache error messages
sudo tail -n 40 /var/log/apache2/error.log

If curl fails and/or Tomcat is inactive or Solr not listening β†’ follow the full recovery below.


2 β€” Full step-by-step recovery (safe order)

High level rule: Start services in this order: Solr β†’ Tomcat β†’ Apache (Apache may stay running; backend needs to be healthy before it proxies).

A. Stop risky/manual leftovers

Avoid mixed methods (systemctl + manual /opt/solr/bin/solr start) β€” stop stray manual instances first.

# Stop any manual Solr run (safe, by solr user)
sudo -u solr /opt/solr/bin/solr stop -all 2>/dev/null || true

# Ensure no java/solr processes are left for port 8983
sudo lsof -i :8983 || ss -ltnp | grep 8983 || true

B. Start Solr (use systemd if available)

Prefer systemctl for reliability:

sudo systemctl start solr
sudo systemctl status solr --no-pager
# Wait & check health
for i in {1..12}; do curl -sSf http://localhost:8983/solr/ && break || sleep 5; done

If systemctl start solr fails with port 8983 already in use, run:

sudo lsof -i :8983 -Pn
# note pid -> if it's a stale solr process owned by solr, stop it:
sudo -u solr /opt/solr/bin/solr stop -all
# If an unrelated process is on 8983, investigate that PID and free the port
ps -fp <pid>

C. Start Tomcat 10 (only after Solr healthy)

sudo systemctl restart tomcat10
sudo systemctl status tomcat10 --no-pager
# tail Tomcat logs while it starts
sudo tail -n 200 /var/log/tomcat10/catalina.out

D. Verify DSpace REST API

Wait for /server/api to answer:

for i in {1..18}; do
  http_status=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/server/api || echo 000)
  echo "$(date '+%F %T') status=$http_status"
  [ "$http_status" -ge 200 -a "$http_status" -lt 600 ] && break
  sleep 5
done

Expect 200 (or JSON body). If you get 500 or 000, check Tomcat/Dspace logs next.

E. Check Apache proxy (if backend healthy but browser still shows 500)

Validate Apache config and reload:

sudo apachectl -t
sudo systemctl reload apache2
# tail Apache error log
sudo tail -n 80 /var/log/apache2/error.log

3 β€” Diagnostic commands & log locations (collect these when reporting)

Collect and save these snippets whenever you open an incident β€” they show the most useful info.

# Processes & ports
ps -ef | egrep 'tomcat|java|solr' 
ss -ltnp | egrep ':8080|:8983|:80|:443'

# Systemd status
sudo systemctl status solr tomcat10 apache2 --no-pager

# Key logs (tail last 200 lines)
sudo tail -n 200 /dspace/log/dspace.log
sudo tail -n 200 /var/log/tomcat10/catalina.out
sudo journalctl -u solr -n 200 --no-pager
sudo tail -n 200 /var/log/apache2/error.log

# Check disk, memory & OOM
df -h
free -m
dmesg | egrep -i 'killed process|oom|out of memory' || true

# Search for Exceptions quickly
sudo grep -iR "Exception" /dspace/log /var/log/tomcat10 2>/dev/null | tail -n 50

Save these outputs to a file and attach to a ticket if needed.


4 β€” Common failure patterns & fixes

A. Solr fails to start β€” port already in use

Cause: manual Solr process running or another app occupying 8983. Fix:

  • Identify PID (sudo lsof -i :8983)
  • If it’s Solr started manually, stop it via /opt/solr/bin/solr stop -all
  • Use systemctl start solr after cleaning up

B. Tomcat crashes / DSpace logs show startup exceptions

Cause: misconfiguration, missing DB, OOM, or incomplete deployment. Fix:

  • Inspect /var/log/tomcat10/catalina.out and /dspace/log/dspace.log for stacktrace

  • Common quick checks:

    • DB reachable (psql connectivity from server)
    • Disk full (df -h)
    • JVM OOM (check dmesg)
  • Redeploy war if missing:

    ls -l /var/lib/tomcat10/webapps/dspace.war
    # if missing, copy war and restart tomcat:
    sudo cp /path/to/dspace.war /var/lib/tomcat10/webapps/
    sudo systemctl restart tomcat10
    

C. Apache reverse proxy errors error reading status line from remote server

Cause: backend closed connection (Tomcat crashed or refused). Fix:

  • Confirm Tomcat is alive and responding on 8080 using curl locally
  • Check Apache Proxy settings and ProxyTimeout, then reload Apache

D. Sudden outages after editing submission-forms.xml

If you edited DSpace config files, always restart Solr (if config affects indexing) and Tomcat in the correct order and check logs for parsing errors. If a bad XML causes the app to throw at startup, revert to backup.


5 β€” Recommended safe restart script

Create /usr/local/bin/dspace-safe-restart.sh and make executable. This performs a safe stop/start and health checks.

#!/bin/bash
set -euo pipefail
LOG="/var/log/dspace-restart-$(date +%F_%H%M%S).log"
exec > >(tee -a "$LOG") 2>&1

echo "===== dspace-safe-restart started: $(date) ====="

echo "--- 1) Check current statuses ---"
systemctl is-active --quiet solr && echo "solr: active" || echo "solr: inactive"
systemctl is-active --quiet tomcat10 && echo "tomcat10: active" || echo "tomcat10: inactive"

echo "--- 2) Ensure no stray solr instance on 8983 ---"
if sudo lsof -i :8983 -Pn -sTCP:LISTEN >/dev/null 2>&1; then
  echo "Port 8983 in use by:"
  sudo lsof -i :8983 -Pn -sTCP:LISTEN
  echo "Attempting to stop solr via systemctl..."
  sudo systemctl stop solr || true
  sleep 3
  sudo -u solr /opt/solr/bin/solr stop -all || true
fi

echo "--- 3) Start Solr (systemd) ---"
sudo systemctl start solr
sleep 3
if ! curl -sSf http://localhost:8983/solr/ >/dev/null; then
  echo "Solr not healthy after start; aborting."
  exit 1
fi
echo "Solr healthy."

echo "--- 4) Restart tomcat10 ---"
sudo systemctl restart tomcat10
sleep 5

echo "--- 5) Wait for DSpace API ---"
for i in {1..18}; do
  status=$(curl -s -o /dev/null -w '%{http_code}' http://localhost:8080/server/api || echo 000)
  echo "$(date '+%F %T') HTTP $status"
  if [[ "$status" =~ ^2|3 ]]; then
    echo "API OK."
    break
  fi
  sleep 5
done

echo "===== dspace-safe-restart completed: $(date) ====="

Make executable:

sudo tee /usr/local/bin/dspace-safe-restart.sh >/dev/null <<'EOF'
# (paste script here)
EOF
sudo chmod +x /usr/local/bin/dspace-safe-restart.sh

Run with:

sudo /usr/local/bin/dspace-safe-restart.sh

6 β€” Forward-looking (prevent recurrence)

  1. Use only one method to manage Solr β€” either systemctl or /opt/solr/bin/solr. Prefer systemctl for production.

  2. Enable systemd auto-restart for solr/tomcat:

    # example override
    sudo mkdir -p /etc/systemd/system/solr.service.d
    sudo tee /etc/systemd/system/solr.service.d/override.conf >/dev/null <<'EOF'
    [Service]
    Restart=on-failure
    RestartSec=5
    EOF
    sudo systemctl daemon-reload
    sudo systemctl enable solr
    
  3. Increase Solr ulimits (the warning you saw). Add limits in /etc/security/limits.conf:

    solr soft nproc 65000
    solr hard nproc 65000
    

    And adjust systemd unit if required (use LimitNPROC=65000 in systemd override).

  4. Enable simple monitoring: a cron job or monitoring tool that calls http://localhost:8080/server/api/ and alerts if non-200.

  5. Avoid pkill -f solr β€” it’s blunt and can leave locks. Use proper stop commands.

  6. Daily health check script (optional): curl checks for Solr and API, run from cron to notify on failure.

  7. Keep backups of edited config files (submission-forms.xml) before changing (e.g., cp submission-forms.xml submission-forms.xml.bak.$(date +%F_%T)).


7 β€” Bash history (how to check & enable timestamps)

To view the current shell history:

history | tail -n 200

To view root user saved history:

sudo cat /root/.bash_history | tail -n 200

To enable timestamps for history going forward (add to /root/.bashrc and any dspace user shells):

# add these lines to ~/.bashrc
export HISTTIMEFORMAT="%F %T "
export HISTSIZE=10000
export HISTFILESIZE=20000
shopt -s histappend
PROMPT_COMMAND='history -a; history -n;'

This ensures future commands have timestamps and are appended in real time.


8 β€” Quick Apache proxy snippet (for reference)

Make sure your Apache proxy block for the API is correct:

ProxyPreserveHost On
ProxyRequests Off

ProxyPass        /server/api  http://localhost:8080/server/api
ProxyPassReverse /server/api  http://localhost:8080/server/api

# optional
ProxyTimeout 60

Validate with:

sudo apachectl -t
sudo systemctl reload apache2

TL;DR checklist you can paste on a sticky note

  1. curl -I http://localhost:8080/server/api
  2. sudo systemctl status solr && sudo systemctl status tomcat10
  3. If Solr down β†’ sudo systemctl start solr β†’ wait for http://localhost:8983/solr/
  4. Then sudo systemctl restart tomcat10 β†’ wait for API.
  5. If ports in use: sudo lsof -i :8983 / sudo lsof -i :8080 β†’ identify PID β†’ stop correct process.
  6. Check logs: /dspace/log/dspace.log, /var/log/tomcat10/catalina.out, /var/log/apache2/error.log, journalctl -u solr

Built with Hugo
Theme Stack designed by Jimmy