Wrangler User Guide
- Wrangler users are required to have an XSEDE User Portal (XUP) account, even for projects not allocated by XSEDE.
- Request an XSEDE Startup Allocation. XSEDE Startup Allocations are for one year and are typically quickly granted within a week. We encourage users who want to test out Wrangler or demonstrate the need for a larger allocation on the system to request an XSEDE Startup Allocation. Please visit the XSEDE Allocations page for more information on the processes.
- Once the allocation is granted, login to the Wrangler Data Portal using either your TACC or XSEDE credentials, to manage the group of users who can use the allocation and create reservations of blocks of time and/or use of high speed storage process and analyze data.
- While working with data on Wrangler, users can share data and results with colleagues and collaborators.
- Once results have been published or reviewed, users can publish data for use by the general research community.
Please see the Architecture section below for technical details about the makeup of the computing and storage capabilities of the Wrangler system.
The Wrangler Data Analysis and Storage* system is designed for the needs for modern data researchers. Wrangler's unique architecture handles the many aspects of the volume, velocity, and variety that can make digital data research difficult to handle on standard high performance systems. The system is designed around a 0.5 PB high speed flash storage system that can be used to handle data analysis and processing workflows not practical on other systems with slower spinning disks or significantly smaller internal SSD storage devices.
Wrangler is dynamically provisioned by users in different ways to handle the different data workflows, including databases (both relational database systems and the newer noSQL style databases), Hadoop/HDFS based workflows (including MapReduce and Spark), and more custom workflows leveraging a flash-based parallel file system. In addition to data analysis and processing, Wrangler supports the data preservation, sharing, and publication needs for many data projects.
With two 10 PB file systems at TACC and the Indiana University, Wrangler presents users with both iRODS and Globus based data management systems that can be used separately or together to store data and results for the duration of a research project, share those results with collaborators and colleagues, and eventually publish the data into systems such as DataOne.
The Wrangler Data Analytics System provides the following environments to support different workflows:
Computations on Wrangler will be constrained by three different components:
- amount of Flash memory needed for a project
- amount of RAM available on an individual node for computations
- throughput of the CPUs in each node
We will allocate the analytics system based on a combination of all three of these. Each node allocated provides users with the two Haswell CPUs, 128 GB of DDR4 memory, and 4 TB of flash storage. These will be allocated as a single Node Unit for an hour. Project should calculate the number of Node Units needed for their project looking at these three components and request the maximum number of nodes needed to support their research. CPU cycles for shared dedicated services (e.g. databases) will not be charged and should only justify allocations based on storage needs.
For example, a project may need to work with their 40 TB dataset for 3 different months in the year with a code that needs 80 cores each with 5 GB of memory for adequate performance. Their need is then 4 nodes of processing and 10 nodes worth of flash storage. Thus they will request 21600 Node Hours.
The second portion of the allocation on Wrangler is for the long term disk storage. This storage is for holding the interim data needed by a project between their flash storage computation campaigns, for computations that can be carried out on the disk based storage, and for the long term housing, sharing, and exploration of data and results from the research. Users should request this as they do any other storage allocation. Users will have full read/write access to this storage, be it housed in databases, iRODS collections, or files in the Lustre file system for the full duration of the allocation. TACC reserves the right to remove write access to the storage once the allocation is completed.
Once a Wrangler allocation has been awarded, the Project PI or delegate must log into the Wrangler Data Portal to manage users on the allocation and select the services needed, e.g., storage, database instances, iRODS collection partitions, Hadoop reservations.
Wrangler is the first XSEDE system hosted at multiple sites. This configuration leads to some unique authentication mechanisms. Users accessing Wrangler via the XSEDE User Portal GSISSH mechanism will use their XSEDE credentials to login. Access to both the Indiana University systems and the TACC systems via
ssh and other protocols will use the users TACC credentials.
ssh to Wrangler directly at either TACC or XSEDE end points with respective credentials.
login1$ ssh -l TACC-username wrangler.tacc.utexas.edu
login1$ ssh -l XSEDE-username wrangler.iu.xsede.org
Wrangler can also be accessed using the Grid Sign In mechanisms supported by XSEDE using your XSEDE credentials and GSI certificates. The following commands authenticate using the XSEDE myproxy server, then connecting to the GSISSH port 2222 on Wrangler at TACC:
localhost$ myproxy-logon -s myproxy.teragrid.org localhost$ gsissh -p 2222 XSEDEfirstname.lastname@example.org Last login: Wed Jun 24 11:02:00 2015 Wrangler LosF managed host Provisioned on 26-Jan-2015 at 17:43 login1.wrangler(1)$
Alternately users can login to the XSEDE Single Sign-On Hub via their XSEDE credentials and then issue the
gsissh command from that host:
localhost$ ssh XSEDEemail@example.com [userid@gw69 ~]$ gsissh -p 2222 XSEDEfirstname.lastname@example.org Last login: Wed Jun 24 11:02:00 2015 Wrangler LosF managed host Provisioned on 26-Jan-2015 at 17:43 login1.wrangler(1)$
Users must use their XSEDE credentials to login to the myproxy or XSEDE User Portal sessions. Please consult NCSA's detailed documentation on installing and using myproxy and gsissh, as well as the GSI-OpenSSH User's Guide for more info.
Structured Query Language (SQL) database services, and the many related technologies which are grouped under the rubric "NoSQL", may be utilized in many different contexts, as part of data analysis workflows, for data collecting and warehouse activities, and as part of more complex software stacks used to support persistent web or other data services. Wrangler is intended to support most application areas where database technologies may be useful, through either persistent database services or "ad hoc" database services which can run as part of a compute node reservation and job workflows. This documentation will primarily cover the provisioning and usage of a persistent database, though many of the details will also be applicable to ad hoc databases.
Wrangler provisions persistent databases at both the Indiana and TACC sites. Users who intend to employ the flash storage component of Wrangler should select "TACC" in the Wrangler Data Portal.
Wrangler currently supports Postgres (with or without spatial extensions) and MySQL databases as persistent services which can be automatically provisioned through the Wrangler Data Portal. Other database technologies, including NoSQL software tools such as MongoDB, are also supported on a request basis; if you need a persistent database service which is not available through the Wrangler portal, please submit a help ticket and in most cases we will be able to assist you with provisioning such a service on Wrangler.
Once your database is provisioned, there are many mechanisms that will allow you to access this database, from simple command-line SQL utilities to ODBC plug-ins for graphical tools such as Excel and SAS, APIs for various compiled and interpreted programming languages, and web front-ends. We will not attempt to exhaustively document all the potential interfaces to these tools, but will provide the basic information required to configure most connection utilities along with examples for some common workflows including the use of command-line utilities, and encourage users to check the documentation for their intended client tools for more information on connecting to a remote database.
Wrangler provides MySQL support through the MariaDB variant of MySQL. MariaDB is binary-compatible with MySQL but has a richer feature set than traditional MySQL. See http://www.mariadb.org for more information on MariaDB-specific features.
Wrangler will host instances of MySQL, Postgres/PostGIS, MongoDB, and other database technologies at both TACC and Indiana University. Databases will not support active transactional replication between the two sites, but replication of the database files will allow for fast service migration from one site to the other. Users will be responsible for their own database and its design, optimization and backups. The name of the database or schema will be defined in the Wrangler portal, and the username and password will be the same as those used to access other services at TACC and/or IU. Because sensitive information may be communicated over the database connection, SSL must be configured for all connections to the database.
Users with existing databases may upload to Wrangler as follows:
- Login to the Wrangler Data Portal: http://portal.wrangler.tacc.utexas.edu.
- Set up database instance, specify storage needs, and designate the Database Administrator
- A URL is generated for the new instance. A confirmation email will be sent containing directions on how to connect to the database and configure the local client.
You will need a few connection parameters in order to connect to your persistent database, primarily the hostname, network port, and database or schema name. The hostname of the database server will be supplied to you via e-mail once your database is provisioned, and can also be viewed in the Wrangler portal. Persistent databases will always be configured to utilize the default ports - these are listed below for each of the supported technologies:
Since you will use your TACC (or IU) username and password to access persistent database services on Wrangler, it is very important that you connect to the database using a Secure Socket Layer (SSL) network connection. All major database clients will have an option to enable and to configure SSL connections to the database server; you may need to consult the documentation for your client application, or if you are using the standard MariaDB/MySQL and Postgres clients, example connection commands are listed below.
mysql -v --ssl --ssl-cipher=AES256-SHA -u TACC-username -h db1 dbname -p
psql "sslmode=require host=db1.wrangler.tacc.utexas.edu user=TACC-username dbname=dbname"
Each PI will designate a Database Administrator (DBA) who will have full administrative privileges to manage the database and grant permissions for other users. All other users within the project will receive read-only access to the database, unless the DBA user grants them additional privileges.
By default all users on a project will have full read access to all tables, but only DBAs will have create/insert/alter/delete privileges. DBAs will also be responsible for enforcing any additional security implementations (e.g. restricting connection to the database from specific IP addresses or domains). TACC staff will be available to help new administrators work with the system.
For PostgreSQL, both a "public" and "restricted" schema are created under the project's database. The public schema has full privileges for all users associated with the project. The "restricted schema" implements the permissions described above. For additional information on how to change permissions for users, please consult the PostgreSQL documentation.
For MySQL/MariaDB, a single database is created, with the above permissions. The concepts of "schema" and "database" are one in the same. For additional information on how to change permissions for users, please consult the MariaDB documentation.
There are many different clients for connecting to the databases supported on Wrangler. As a starting point for new users, we recommend familiarizing yourself with these standard clients for database administration.
PostGIS is an extension of PostgreSQL that provides support for the storage and analysis of spatial data (GIS - Geographic Information System). It provides two new data types (geometry and geography) which may represent points, lines or polygons; and a large number of functions that can be used to manipulate and query those spatial data types. For example, a characteristic query might ask whether or not a particular point falls within the confines of a given polygon or whether or not a line and a polygon intersect. Other manipulations might project a data set from one set of coordinates to another or calculate the area of a collection of polygons (e.g. counties). A tutorial is provided here: http://workshops.boundlessgeo.com/postgis-intro/
Any regular instance of PostgreSQL may be converted into a PostGIS instance by issuing the SQL command:
CREATE EXTENSION postgis;
Since Wrangler is designed for data analysis, many researchers will be working with large and/or persistent datasets. A nifty new Wrangler feature is the ability to make system reservations, reserve all or part of the machine, specifying the number of nodes needed and accompanying data storage, for a specified date and duration. Users wishing to invoke a hadoop cluster can request a Hadoop Queue reservation and users wishing to to run one or a series of jobs against a persistent data set can request a Normal Queue reservation. The advantages of making System Reservations is the to perform numerous operations or runs on the persistent data for a period of up to 30 days. This obviates the need for constant staging and re-staging of large datasets over the same period. Users are responsible for transferring their data once the reservation ends (what happens if user forgets? before, or after reservation ends? how long after reservation ends? A Service Reservation also facilitates multiple users having access to the same persistent data
We strongly encourage all users to make service reservations. A service reservation will allocate a fixed set of nodes and accompanying flash storage resources for an extended period of time up to 30 days. This will allow for staging of data into the high-performance storage tier and for this data to persist over the course of many jobs during the course of the reservation. Users can reserve nodes and flash storage for Hadoop jobs or other non-HDFS based workflows or applications.
Users wishing to analyze persistent data for an extended period can make a reservation in the
normal queue. These reservations are synonymous with the standard batch jobs a user would submit to the
normal queue on Stampede. Like Stampede, Wrangler employs the SLURM job scheduler.
A reservation sets aside both the compute nodes and their associated flash storage for a project's sole use. Once the reservation begins, the storage for all of the nodes will be made available to all of the nodes in the reservation.
During this reservation, users will need to submit jobs to use the reserved nodes by adding the reservation information to the SLURM job script or command line. For example, to use the reservation made for the project named "big_project253" the SLURM jobs would be submitted with the following additional command line argument:
login1$ sbatch --reservation=big_project253 myscript sbatch: Submitted batch job 65540
Alternately the "
idev" command can be used with the same "
-r" argument to get interactive access to a compute node. Jobs will run on the reserved nodes when they are available. If members of a project are using all of the reserved nodes, new jobs will queue until they can begin.
For workflows using the flash storage systems with a reservation, users will be responsible for importing the data from the persistent disk based /data file system.
Typically this can be done in the first (or subsequent) scripts run against the reservation or using the "
idev" command to get an interactive session and import the data by hand. Similarly it is the responsibility of the users to copy all data to be preserved from the jobs prior to the end of the reservation period. As the flash storage system is a limited commodity on Wrangler, data are not preserved between reservations. Thus we encourage users to make reservations to maximize their productivity by minimizing the need for frequent data migrations to and from this storage system.
- estimate size of your job
- storage needs for data and analysis during your computations
- computing/cores needed for analysis
- memory needed for analysis
For each node reserved, users will have access to
- 24 cores on a node,
- the 128 GB of memory on that node,
- and up to 4 TB of storage on the DSSD storage system.
Users should employ the
debug queue or other mechanisms to estimate the memory footprint and cores needed to run their computations on a single dataset and scale that to find the total size needed. For some projects the duration of a reservation will correspond to the period the project needs for analysis of a dataset, which may be significantly longer than the pure processing time. As transferring data to and from different storage devices can often be a significant time sync, we encourage users to stage data in the system for longer periods of time than the individual computations will take, especially for more interactive or iterative workflows. Once the scale and duration of a reservation is known, users will
Request a reservation via the Wrangler Data Portal and specify:
- number of nodes required
- starting and ending time period of the reservation.
During the reservation period, users can submit analysis jobs as SLURM batch scripts to the nodes under the reservations. SLURM is documented extensively in the Stampede User Guide.
- Multiple users may submit the jobs to the same reservation
- Multiple jobs will be queued until there is enough resources available in the reservation.
The reservation also includes access to the flash storage. The flash based file system will be available for users of the same project to access during the reservation period.
Prior to the end of reservation period, the user will be responsible for migrating all the essential data from the flash file system to the local file system. Data is subject to purge immediately after the end of the reservation.
Users wishing to do analysis with Hadoop cluster must also make reservations via the TACC Data Portal.
Create a hadoop cluster reservation through the Wrangler Data Portal.
- a user should specific the at least one of the following: number of nodes required, total distributed data storage in HDFS, or the expected data to be processed. These information will be used by the reservation system to determine the number of data nodes needed for the reservation.
- Specify the reservation start and end time
- To enable additional users to access the Hadoop cluster, the user also need to specify a project of which all members will be added to the access of the reserved Hadoop cluster
- An email notification will be sent to the requestor's containing instructions on how to access the cluster once the reservation has started.
- During the Hadoop cluster reservation period, all users in the project specified at the time of the reservation will have access to the Hadoop cluster. The user need to submit a slurm job to the reservation in order to access the hadoop cluster and submit Mapreduce jobs. The slurm job can be an interactive job, a batch job or an VNC session job. For example, a user can start an interactive session to submit hadoop job for hadoop cluster or move data between hdfs and NFS. Once a hadoop job has been submitted to the Hadoop cluster, it will be managed and scheduled via YARN resource manager which is part of the Hadoop cluster. The slurm job session can be terminated once the user do not require further interaction with the hadoop cluster. The hadoop analysis job has been submitted to YARN scheduler will run to its finish even if the slurm job session is expired.
- The default Hadoop cluster will include hadoop core, spark, hdfs, mapreduce, hadoop streaming, and mahout packages for user to use. The users can install and/or request additional packages to be available with their hadoop reservation.
- Prior to the end of reservation period, the user will be responsible for migrating all the essential data from the HDFS to the local file system. The hdfs created during the reservation will not be preserved and extended beyond the reservation period.
showres utility to view all current Wrangler reservations:
login1$ showres -h Usage: /usr/local/bin/showres [OPTION] list current reservations in a clean format: -a : list reservation(s) for all users on system -u : list reservation(s) for a specific user -h : prints this message
login1$ showres -a Reservation Name State Queue #Nodes Start Time End time Duration hadoop+TG-CCR150011+817 ACTIVE hadoop 2 09-14T14:05 10-14T14:05 30-00:00 dssd+TG-ASC150021+838 ACTIVE normal 1 09-18T17:05 10-18T17:05 30-00:00 hadoop+WranglerTeam+896 ACTIVE hadoop 4 09-29T15:45 10-08T15:45 9-00:00
Most data management tasks in Wrangler will require that your data be stored in iRODS. To get started with iRODS, log in to the Wrangler Data Portal and create a collection in iRODS; once the collection is created, you can begin loading data using an iRODS client or the iDrop web interface. Data will be checksummed as it is ingested, and the checksums will be saved for later comparison against checksums generated prior to transfer or at a later date, to ensure the fixity of your data over time. Audit logs of all iRODS activity are also collected, allowing for tracking of all operations on data stored in iRODS, who initiated the operations, and when they were performed. Additional data management functionality will be added over time, and custom policies can also be added to the iRODS system. Users with advanced data management needs or specific policies they are interested in employing are encouraged to contact the Wrangler team by submitting a ticket in the portal.
To create your iRODS collection, log in to the Wrangler portal and select the project with which the collection will be associated. In the project details page, you will see an "iRODS collections" table in the lower right corner of the page, as shown here:
Click the "Create iRODS collection" button to start the process of setting up your iRODS collection. You will then be asked to give the collection a name. An iRODS "collection" is like a special version of a directory in a normal Unix file system - the collection corresponds to a path within the iRODS hierarchy, and this path can also be used to apply policies for data ingest, post-processing, auditing and so forth. Since the collection name will also be part of a path name used for accessing your data, we suggest you choose a short, descriptive name without spaces and other special characters that could cause problems when scripting or working with the path from the command line.
You may also select whether this collection will require public web access when creating the collection in iRODS; if this checkbox is selected, the collection will be placed in a special location that can also be accessed by a public web server, meaning that all files and subdirectories you store in this collection will be immediately available on the open web. If you do create a public web collection, we suggest that you also create another collection without open web access enabled; this will allow you to upload data, verify that you wish to share it, and then move it from the private to the public collection. If you wish to share data with only a limited subset of collaborators, you do not need to select the "public web access" option - this option is only needed if you wish to share your data without any limitations on access.
The most performant and functional mechanism for accessing iRODS is through the command-line utilities, known as the "i-commands". There is some general information on using the i-commands to perform various tasks in the TACC "iRODS User Guide" at https://portal.tacc.utexas.edu/software/irods. You can also utilize the iDrop web interface at https://web.wrangler.tacc.utexas.edu/idrop-web/ - there is additional documentation on using the iDrop web interface at https://portal.tacc.utexas.edu/tutorials/web-based-irods.
You can get access to the iRODS command line utilities on the Wrangler login and compute nodes by typing "
module load irods". You will need to configure your environment for the Wrangler iRODS zone before attempting to connect; the simplest way to do this is to copy the example file into your home directory:
% mkdir ~/.irods % cp /work/irods/irods_environment.json ~/.irods/irods_environment.json
Then open the file "
~/.irods/irods_environment.json" in your favorite text editor and change "USERNAME" to your Wrangler username everywhere that it appears. Once this step is completed you can run "
iinit" to log in to iRODS and begin managing your data. Your password will be the same as your TACC user portal and Wrangler login password.
You can use the iRODS utilities to retrieve data into the Flash storage before running a job or from within a job, and you can also store your data in iRODS at the end of a job using the same utilities. For example, at the beginning of your job script, you could put the following commands to retrieve a full directory of data:
% cd FLASH_JOB_DIRECTORY % iget -r /wranglerZ/home/USERNAME/JOB_DATA
At the end of a job or campaign, you can use a similar command to store data back to iRODS - the following command will copy a whole directory and all it's contents from the Flash storage into the iRODS system.
% iput -r FLASH_JOB_DIRECTORY /wranglerZ/home/USERNAME/JOB_OUTPUT_DIRECTORY/
Checksum and auditing functions are automatically enforced by the iRODS rule engine, so you do not need to explicitly specify that you wish to store checksums, though it does no harm to do so.
Wrangler provides services and tools for users to perform general and science domain specific data management and curation tasks on their datasets across the lifecycle of their projects, from data ingest to analysis and publication stages.
Through the Wrangler Data Portal, users will be able to track the size, growth, authenticity and integrity, and file format composition of their datasets.
Curation on Wrangler/iRODS: Users that have data stored within iRODS can create collections and add collaborators to their data projects in the Wrangler Data Portal. Within this space, the authenticity and integrity of the data is checked on a regular basis
Using icommands, through the iRODS Web interface, or using the iDROP client users can add annotations and metadata to their files. Such metadata gets registered on the iCAT iRODS database.
iRODS has a rule engine that allows customizing and automating data management functions. Users can create scripts to make use of the iRODS rule engine for scheduling data, applying metadata, reorganizing data and other automated functions. User developed rules have to be implemented and tested on Wrangler's iRODS instance by our system administrators. Please submit a helpdesk ticket to consult with our team on how to create and implement iRODS rules.
Users that are ready to publish data stored in Wrangler can request a DOI by submitting a helddesk ticket. Publishing data through the Wrangler Data Portal is especially useful for very large datasets that can be easily reused in Wrangler or other large-scale computational resources available through XSEDE.
Note that in order to be published, data should be complete and well described. Data published with a DOI will include a full Data Cite metadata record, but users can add other descriptions, help me files, and papers to the publication package. Once the data is published with a DOI, it cannot be changed, amended or deleted. If the user wants to make a change to their data, then the new version of the dataset can be published under a new DOI and the relationship to the first version will be recorded as metadata.
If users decide at some point that they want to move their published data to another repository, they can do so by submitting a ticket with the new location of their data and we will make changes in the system so that the same DOI points to the right target.
Wrangler will provide a number of data curation software packages allowing you to perform common tasks needed for data management and curation tasks. For an exhaustive list of all software supported by Wrangler, see the TACC Software page.
Users logging into Wrangler directly (SSH) are presented with a choice of login shells. The shell interprets the command-line and as well as statements in shell scripts. Wrangler's default shell is BASH. Users requiring a different login shell may submit a support ticket requesting this account modification. After your support ticket is closed, please allow several hours for the change to take effect.
Unix shells allow users to customize their environment via startup files containing scripts. Customizing your environment with startup scripts is not entirely trivial. Below are some simple instructions, as well as an explanation of the shell set up operations.
TACC Bash users should consult the Bash Users' Startup Files: Quick Start Guide document for instructions on how best to set up the user environment.
All UNIX systems set up a default environment that provides administrators and users with the ability to execute additional UNIX commands to alter that environment. These commands are sourced; that is, they are executed by the login shell, and the variables (both normal and environmental), as well as aliases and functions, are included in the present environment. The Xeon E5 hosts on Stampede support the Bourne shell and its variants (
/bin/zsh) and the C shell and its variants (
/bin/tcsh). The Linux operating system on the Xeon Phi coprocessors supports only the Bash (
/bin/bash) and Bourne (
/bin/sh) shells. Each shell's environment is controlled by system-wide and user startup files. TACC deploys system-specific startup files in the
/etc/profile.d/ directory. User owned startup files are dot files (begin with a period and are viewed with the "
ls -a" command) in the user's
Each UNIX shell may be invoked in three different ways: as a login shell, as an interactive shell or as a non-interactive shell. The differences between a login and interactive shell are rather arcane. For our purposes, just be aware that each type of shell runs different startup scripts at different times depending on how it's invoked. Both login and interactive shells are shells in which the user interacts with the operating system via a terminal window. A user issues standard command-line instructions interactively. A non-interactive shell is launched by a script and does not interact with the user, for example, when a queued job script runs.
Bash shell users should understand that login shells, for example, shells launched via
ssh, source one and only one of the files
~/.profile (whichever the command finds first in file-list order), and will not automatically source
~/.bashrc. Interactive non-login shells, for example shells launched by typing "
bash" on the command-line, will source
~/.bashrc and nothing else.
TACC staff recommends that Bash shell users use
~/.profile rather than
.bash_login. Please see Bash Users' Startup Files: Quick Start Guide
You may also want to restrict yourself to POSIX-compliant syntax so both shells correctly interpret your commands.
The system-wide startup scripts,
/etc/profile for Bash and
/etc/csh.cshrc for C type shells, set system-wide variables such as
umask, and environment variables such as
$HOSTNAME and the initial
$PATH. They also source command scripts in the
/etc/profile.d/directory that site administrators may use to set up the environments for common user tools (e.g.,
less) and system utilities (e.g., Modules, Globus).
Another important component of a user's environment is the set of environment variables. Many of the UNIX commands and tools, such as the compilers, debuggers, profilers, editors, and just about all applications that have GUIs (Graphical User Interfaces), will inspect the user's environment for application-specific variables. To see the variables in your environment execute the command:
The variables are listed as keyword/value pairs separated by an equal (=) sign, as illustrated below by the
Notice that the
$PATH environment variable consists of a colon (:) separated list of directories. Variables set in the environment (with
setenv for C-type shells and export for Bourne-type shells) are carried to the environment of shell scripts and new shell invocations, while normal shell variables (created with the set command) are useful only in the present shell. Only environment variables are displayed by the "
env" (or "
printenv") command. Execute "
set" to see the (normal) shell variables.
TACC continually updates application packages, compilers, communications libraries, tools, and math libraries. To facilitate this task and to provide a uniform mechanism for accessing different revisions of software, TACC uses the modules utility.
At login, modules commands set up a basic environment for the default compilers, tools, and libraries. For example: the
$LIBPATH environment variables, directory locations (e.g.,
$HOME), aliases (e.g.,
cdh) and license paths are set by the login modules. Therefore, there is no need for you to set them or update them when updates are made to system and application software.
Users that require third party applications, special libraries, and tools for their projects can quickly tailor their environment with only the applications and tools they need. Using modules to define a specific application environment allows you to keep your environment free from the clutter of all the application environments you don't need.
The environment for executing each major TACC application can be set with a module command. The specifics are defined in a modulefile file, which sets, unsets, appends to, or prepends to environment variables (e.g.,
$MANPATH) for the specific application. Each modulefile also sets functions or aliases for use with the application. You only need to invoke a single command to configure the application/programming environment properly. The general format of this command is:
module load modulename
modulename is the name of the module to load. If you often need a specific application, see Controlling Modules Loaded at Login below for details.
Most of the package directories are in
$APPS) and are named after the package. In each package directory there are subdirectories that contain the specific versions of the package.
As an example, the fftw3 package requires several environment variables that point to its home, libraries, include files, and documentation. These can be set in your shell environment by loading the fftw3 module:
login1$ module load fftw3
To view a synopsis about using an application in the Modules environment (in this case, fftw3), or to see a list of currently loaded modules, execute the following commands:
login1$ module help fftw3 login1$ module list
TACC's module system is organized hierarchically to prevent users from loading software that will not function properly with the currently loaded compiler/MPI environment (configuration). Two methods exist for viewing the availability of modules: Looking at modules available with the currently loaded compiler/MPI, and looking at all of the modules installed on the system.
To see a list of modules available to the user with the current compiler/MPI configuration, users can execute the following command:
login1$ module avail
This will allow the user to see which software packages are available with the current compiler/MPI configuration (e.g., Intel 15 with MVAPICH2).
To see a list of modules available to the user with any compiler/MPI configuration, users can execute the following command:
login1$ module spider
This command will display all available packages on the system. To get specific information about a particular package, including the possible compiler/MPI configurations for that package, execute the following command:
login1$ module spider modulename
During upgrades, new module files are created to reflect the changes made to the environment variables. TACC will generally announce upgrades and module changes in advance.
Each user's computing environment is initially loaded with a default set of modules. This module set may customized at any time. During login startup, the following command is run:
login1$ module restore
This command loads the user's personal set of modules (if it exists) or the system default. If a user wishes to have their own personal collection of modules they can create this by loading the modules they want and unloading the modules they don't and then do:
login1$ module save
This marks the collection as their personal default collection of modules that they will have every time they login. It is also possible to have named collections, run "
module help" for more details.
There is a second method for controlling the module specified at login. The "
.modules" file is sourced by the startup scripts at TACC and is read after the "
module restore" command. This file can contain any list of module commands required. You can also place module commands in shell scripts and batch scripts. We do not recommend putting module commands in personal startup files (.bashrc, .cshrc), however; doing so can cause subtle problems with your environment on compute nodes.
TACC's HPC staff have recently implemented the "
idev" application on Wrangler. The
idev utility provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes.
idev is simple to use, bypassing the arcane syntax of the srun command. Further
idev documentation can be found here:https://portal.tacc.utexas.edu/software/idev
In the sample session below, a user requests interactive access to a single node for 15 minutes in order to debug the progindevelopment application.
idev returns a compute node login prompt:
login1$ idev -m 15 ... --> Sleeping for 7 seconds...OK ... --> Creating interactive terminal session (login) on master node c557-704. ... c557-704$ vim myprog.c c557-704$ make myprog
Now the user may open another window to run the newly-compiled application, while continuing to debug in the original terminal session:
WINDOW2 c557-704$ ibrun -np 24 ./progindevelopment WINDOW2 ...program output ... WINDOW2 c557-704$
Use the "
-h" switch to see more options:
login1$ idev -h
idev provides interactive access to a single node and then spawns the resulting interactive environment to as many terminal sessions as needed for debugging purposes. Please consult the
"idev User Guide".
Jobs on Wrangler are charged differently than other HPC resources. Wrangler introduces the metric of a NODE HOUR - the use of a node and it's storage for a period of time. The user is charged for reserved nodes, whether those nodes are used for computation or not. Reservation cancellations will be refunded for the unused portions. Jobs submitted without reservations are charged for the duration of the job.
The Wrangler production queues and their characteristics (wall-clock and processor limits; priority charge factor; and purpose) are listed in Table 1 below.
Currently the compute nodes of Wrangler can only be connected to 200 TB of Flash Storage at any time. With this limitation, we are currently operating Wrangler as two smaller clusters both attached to the same 200 TB of storage. Thus, there are two separate queues for processing using the DSSD. Your reservation will be made on one of the two, and you will need to submit jobs to the queue on which the reservation resides, either
In addition to the
hadoop queues, a smaller
debug queue exists with four nodes at TACC for on-demand access to a Wrangler node or nodes for code development or debugging. These nodes cannot be reserved via the Portal and have a 4 hour maximum run time.
|Queue Name||Max Runtime||# Nodes in Queue||SU Charge Rate||Purpose|
| ||48 hrs||48||1 SU per node hour||access the GPFS flash storage|
| ||48 hrs||48||1 SU per node hour||run hadoop jobs|
| ||4 hrs||10||1 SU per node hour||debugging code|
|Queue Name||Max Runtime||# Nodes in Queue||SU Charge Rate||Purpose|
| ||5 days||23||1 SU per node hour||production|
Batch facilities such as LoadLeveler, LSF, SGE and SLURM differ in their user interface as well as the implementation of the batch environment. Common to all, however, is the availability of tools and commands to perform the most important operations in batch processing: job submission, job monitoring, and job control (cancel, resource request modification, etc.).
The SLURM job scheduler is documented extensively in the Stampede User Guide. Please refer to the respective sections in that guide for information on SLURM commands to view queue status, submit and monitor jobs and more.
The Parametric Launcher is a simple way to encapsulate many small tasks within a single SLURM job. The system starts up as many individual tasks as it has capacity for and then continue to run subsequent jobs once others complete. The benefit is that one can easily run multiple tasks on a single or suite of nodes and use the full CPU, memory, or IO capabilities of that node in a single managed job rather than having to orchestrate the distribution of tasks themselves.
Follow the steps below to submit a parametric job:
Load the "
launcher" module to set some default parameters:
login1$ module load launcher
This sets the
$TACC_LAUNCHER_DIRenvironment variable that points to the the launcher files. Copy the "
launcher.slurm" script into your home directory or where you keep your slurm scripts.
login1$ ls $TACC_LAUNCHER_DIR hello init_launcher launcher.sge paramlist paramrun.sge phiforward README hello.f90 launcher launcher.slurm paramrun paramrun.slurm phiparamlist tskserver login1$ cp $TACC_LAUNCHER_DIR/launcher.slurm .
Edit the "
launcher.slurm" script and make the following changes to customize for Wrangler:
#------------------Scheduler Options-------------------- #SBATCH -J Parametric # Job name #SBATCH -N 1 # Total number of nodes (24 cores/node) #SBATCH -n 24 # Total number of tasks #SBATCH -p normal # Queue name #SBATCH -o Parametric.o%j # Name of stdout output file (%j expands to jobid) #SBATCH -t 01:00:00 # Run time (hh:mm:ss) ... ... #------------------General Options--------------------- ... export TACC_LAUNCHER_PPN=24
- Change the "
-p" option to "
normal" or "
- Set the "
-t" option to the maximum time the job is allowed to run.
- Set the "
-N" and "
-n" options to the number of nodes needed, and how many jobs a node can support, respectively. For most single threaded jobs "
-n" should be set to 24 unless the jobs have other memory or IO constraints when running in parallel on a single node.
- Change the
$TACC_LAUNCHER_PPNvalue to 24
- Change the "
Create and edit a new file, "
paramlist", with each line being a single command line task to be run
myprogram -i data1 -o output1 >& run1.out myprogram -i data2 -o output2 >& run2.out myprogram -i data3 -o output3 >& run3.out myprogram -i data4 -o output4 >& run4.out myprogram -i data5 -o output5 >& run5.out myprogram -i data6 -o output6 >& run6.out
Submit the job:
login1$ sbatch launcher.slurm
It will then run until all jobs have completed or the time expires from the -t parameter. If there are more jobs than the number of nodes times the number of tasks, it will run as many as it can and when individual jobs complete, will then start the next job in the list until all jobs have been run. So it is not necessary to have the number of tasks in paramlist be equal to the number of nodes times the number of tasks on a node (and in many cases of short jobs lasting each minutes rather than hours, you can simply have it run on a small number of nodes but for a long enough time to do all the tasks).
Once the Hadoop cluster started by the start of the reservations for the project, each user who will interact with the Hadoop cluster will need to submit another individual jobs to access a node in the reservation. Typically this job will be a simple
idev session (launched from the command line on the login node that automatically log the user onto a node), but can be any other batch processing session (e.g. a slurm batch job that does the data upload, computation, and/or data retrieval for the user in a non-interactive session). Each job will only need request one node from the reservation pool for the purpose of interacting with the hadoop cluster.
Typically the user can simply type "
idev" to get a login. The user need specify on the command line to use the reservation has been made for Hadoop cluster. For example:
login1$ idev -A myBigAllocation -r hadoopReservation2
will start a ssh session against the myBigAllocation project using the hadoopReservation2 reservation. Note that each user will be tied to a node of the Hadoop cluster and that only as many users as there are nodes in the hadoop reservation will be able to login to the system at a time. We encourage users to login to a node to do their data ingest, submit jobs, and data retrieval, but to not stay logged into a node during idle time to free the session up for others to use.
Once a hadoop application command has been submitted for YARN scheduler of the Hadoop cluster, the user can log out the
idev sessions if there is no further interaction needed. The application remains in Hadoop cluster will run to their finish independent from the slurm job session.
Here is an example slurm job script file for submitting a Hadoop job.
#SBATCH -J MyHadoop # Job name #SBATCH -o myjob.%j.out # Name of stdout output file #SBATCH -p hadoop # Queue name #SBATCH --reservation=myHadoopRes # Name of the reservation to be used #SBATCH -N 1 # Total number of nodes requested (24 cores/node) #SBATCH -n 1 # Total number of MPI tasks requested #SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours #SBATCH -A MyAllocaiton # Allocation name to charge job against hadoop jar MyJarFile.jar MainClass args
Users can invoke
hadoop commands directly from the node they are logged into.
hadoop command [generic_options] [command_options]
Run a Java application:
login1$ hadoop jar MyJarFile.jar MainClass [args]
Run a Hadoop streaming job:
login1$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -mapper mapper_cmd \ -reducer reducer_cmd \ -input input_path_HDFS \ -output output_path_HDFS
In addition to running a "
.jar" file directly with a Hadoop command, a user can also submit a job specified by a job script with the "
mapred [--config confdir] command
To submit a job specified by MyMR.job, a user can run following:
login1$ mapred job -submit MyMR.job
Lists current running job in Hadoop cluster:
login1$ mapred job -list
To check the status of the job:
login1$ mapred job -status jobid
To kill an active MapReduce job:
login1$ mapred job -kill jobid
The default working directory under hdfs for each user also is also the user's home directory under the hdfs at "/user/username". Please contact us if the directory is not available.
A user can interact with HDFS using
hdfs commands. Here are a few common commands:
List the current working directory contents:
login1$ hdfs dfs -ls
Copy file "
foo.test" from the local file system to HDFS and rename it to "
login1$ hdfs dfs --copyFromLocal foo.test foo_in_hdfs.test
Copy file "
foo_in_hdfs.test" from HDFS to the local file system and rename it to "
login1$ hdfs dfs --copyToLocal foo_in_hdfs.test foo_from_hdfs.test
Remove file "
foo.test" in the current HDFS working directory
login1$ hdfs dfs -rm foo.test
Please see the Hadoop command line guide for further information.
The Wrangler Data and Analytics system is comprised of three primary subsystems to accommodate data research:
- A 10 PB disk based storage system
- A cluster of 96 Intel Haswell based analytics servers
- A 0.5 PB shared flash storage system able to support data I/O at unprecedented rates across the analytics system
Each analytics node has 24 cores and 128 GB of volatile memory with both Infniband FDR and 40 Gb/s Ethernet connectivity. The flash storage system, a product of DSSD, will be capable of supporting throughput to the analytics cluster at rates of 1 TB/s and is able to support transactions at the rate of more than 200 million IOPS. This primary system will reside in Austin at TACC. In parallel to these systems is a replicated copy of the disk based file system and a smaller 24 node analytics cluster resides at Indiana University. Both systems are connected to Internet 2 via the TACC and IU 100 Gb/s link, thus giving Wrangler a maximum potential network throughput of 200 Gb/s for ingesting and accessing data.
Supporting research on this hardware will be a software stack providing support for the transfer, access, curation, processing, analysis, and cataloging of data. Data transfer and management via Globus will allow users to transfer information in and out of the system at higher rates than standard single threaded transfer mechanisms. Storage at both sites will be set aside for the delivery of data, the extraction, loading, and transformation of data prior to processing and analysis. Data can then be stored either directly within the replicated Lustre file system allocated to house users working data and results, within an iRODS instance for users wanting a higher level of data curation and sharing opportunities, or within the RDMBS and noSQL based systems hosted on Wrangler.
Two primary environments for analytics will be supported on Wrangler. Users will be able to submit jobs in a UNIX batch scripting environment familiar to many users of other HPC systems. Support optimizing many commonly used data processing and analysis packages will be provided, including optimized versions of R, Python, as well as domain specific applications and packages such as BLAST, astroPy, and spider. Support for common libraries providing parallelism in both computation and IO will also be provided. Users will be able to leverage either or both storage systems as well as databases in their workflows.
In addition to the UNIX environment, Wrangler also provides users with a Hadoop environment. Built on the DSSD storage system, this somewhat unique environment will take advantage of both its speed and the all-to-all capability. Users will be able to allocate data nodes scaled to their processing needs rather than their storage needs. Support will also be provided for Pig and Mahout based jobs.
Access to the flash storage will be presented to the user in three different modes:
- A single common HDFS file system supporting Hadoop jobs
- A POSIX file system supporting workflows using POSIX file access
- an object store based on the flood API from DSSD. Details about the usage of this API will be provided for application developers
Migration of data to and from the DSSD storage will be a key component of working with Wrangler. As this migration has the potential to add significant overhead to workloads using the storage, we intend to schedule usage of the Flash memory system in more of a campaign mode than one supporting individual jobs. Projects will be given a storage quota for a given period of time. These longer term reservations will be scheduled by TACC staff to ensure proper allocation of this resource.
The Lustre file systems at TACC and Indiana are two identical systems. Each houses 10 PB of 6 TB disks hosted in 35 Object Storage Servers (OSS). Each OSS is a Dell MD3220 and MD1220 storage arrays housing 48 300GB 15K RPM 6Gbps SAS drives in RAID-10 configuration at each site providing enough capacity to support more than 3 billion inodes at both TACC and IU. The Lustre filesystems at each site will use 34 OSSes and will provide more than 90 GB/s of performance. The Lustre Metadata Servers (MDS) are hosted on two Dell R720 with dual Intel Xeon E5-2680 processors. The two MDS will be configured to act as active/passive failover pairs. All storage is connected to the analytics cluster via the 54 Gb/s Mellenox FDR fabric with 120 lanes providing full all-to-all connectivity.
The analytic systems are Dell R730 servers each with two Intel Haswell E5-2680-v3 CPUs each with 12 cores, 128 GB of 1600 MHz DDR4 memory, and one 146 GB local SAS hard drive for the local OS and software installation.
Each of the 96 nodes located at TACC will connect to three network fabrics: a switched PCI environment to connect to the high speed storage tier, an InfiniBand (IB) fabric to connect to the bulk storage tier, and a 40 Gbps Ethernet (40GigE) fabric to connect to public networks. This subsystem is sized to take maximum advantage of the available bandwidth in the high speed DSSD subsystem. Each node will use one of its PCI Gen3 x16 slots to hold a card supporting 12 GB/s of bandwidth and 10 PCI Gen3 x4 ports, one to connect to each DSSD chassis. The other PCI slot in these nodes will hold a dual-port Mellanox ConnectX-3 FDR IB and 40GigE card. The 24 nodes at IU will have similar IB and 40GigE connectivity.
The nodes in this subsystem will serve multiple roles. They will function as data movers to move data between the subsystems as well as to users and other systems. They will provide embedded analytics capabilities to run user jobs directly on these compute platforms without the need to migrate large datasets outside of Wrangler. Finally, they will act as data servers to external applications that make use of the datasets hosted within the system.
|Compute Nodes (TACC)||dual CPU 12 Core Xeon E5-2680-v3||96 nodes/ 2304 cores|
|Memory (TACC)||Distributed DDR4||12.2 TB|
|Flash Storage||DSSD PCI attached NAND Flash||0.5 PB (200 TB Shared POSIX, 200 TB HDFS, 100 TB local attached/Object store)|
|Compute Nodes (IU)||dual CPU 12 core Xeon E5-2680-v3||24 nodes/576 cores|
|Shared Disk||Lustre, parallel file system||10 PB replicated, 34 OSS, 153 OST|
|Interconnect||Infiniband Mellanox Switch||FDR 54 Gbit/s Infiiband 40 Gb/s Ethernet|
Wrangler differentiates between two types of storage, persistent and reserved. Persistent storage, familiar to users as the standard HOME, WORK and SCRATCH file systems, each with their own quotas, backup and replication policies, is available to the user to store data between reservations on the system. Data on these file systems will not be purged during the duration of projects allocation and, in the case of the project storage area, may be available for data retrieval even after the allocation has finished. These areas are all quotaed, giving users a reliable storage area that will not be filled up by other projects of Wrangler. Persistent storage is meant to hold files, applications, and other items during the entirety of a project allocation. They are not tied to reservations or jobs periods and leverage Wranglers 10+ PB of disk based storage at TACC and IU.
Reserved storage is temporary storage lasting for the duration of a job or reservation on the system. Reserved storage is intended for users to ingest data into, process, and extract results from, over the course of an individual job or project reservation. Data left in reserved storage is subject to immediate purge once the reservation has completed.
Each Wrangler user is granted their own HOME directory in the
/home file system with a quota of 50 MB. The path to this directory is stored in the
$HOME environment variable. This storage is local to the TACC and IU environments and is not replicated between the two sites. HOME is backed up by TACC and IU periodically. Use the
HOME filesystem for configuration files, source code, etc.
At TACC the shared work file system, called Stockyard, is mounted on /work. The location of this directory is stored in the users
$WORK environment variable.
Each TACC resource isolates its own WORK file system on Stockyard. To access your WORK area on other systems, please use the
$STOCKYARD variable followed by the name of the system. Wrangler's WORK area is
$STOCKYARD/wrangler while maverick's is
$WORK/maverick). Due to past environments, the WORK area for Stampede is not
$STOCKYARD/stampede but simply $STOCKYARD. This space is not backed up by TACC or IU staff.
Each TACC user has access to 1 TB of storage across all systems at TACC. If a user has accounts on multiple TACC resources, e.g. Stampede, Maverick, then each accounts' WORK space usage counts toward the 1TB quota.
Each user has an individual area on /data pointed to by
$DATA. This space is a shared environment and allocation of space is quota'd for each project. This space is not backed up and is intended for users to store their own data and/or interim results prior to integration with the overall project storage. This area is not backed up to archival media, but is replicated between the TACC and IU storage systems. To check both the individual and group quota usage on
/data please use Lustre's "
login1$ lfs quota /data Disk quotas for user ngaffney (uid 817221): Filesystem kbytes quota limit grace files quota limit grace /data 41943132 0 0 - 3 0 0 - Disk quotas for group G-814305 (gid 814305): Filesystem kbytes quota limit grace files quota limit grace /data 41943144 0 0 - 6 0 0 -
This storage is intended for projects to use to support data for the entire project. Projects will find their storage area in
data/projects/projectname. To find the name or number of your project or projects, use the "
groups" command. Each group available to a user reflects a project they are in. To understand which project relates to which group, please consult the TACC user portal. This storage is also part of the /data quota and users should use the same lfs quota /data command to check their usage. We note that groups will be responsible for managing their overall quota for all users
$DATA and their /data/projects area. This area is not backed up to archival media, but is replicated between the TACC and IU storage systems.
iRODS provides a managed data storage environment with additional functionality to help in the sharing, publishing, and preservation of data. With Wrangler, iRODS can be used to store data and have files sizes and checksums (also known as FIXITY) cataloged and periodically verified to ensure data integrity. Because data are stored within the system, they are not directly available to standard application file interfaces (e.g. you cannot open a file with standard C or Python file open commands). Users can put files into their projects iRODS space, organized them in a familiar directory structure, search for files based on specific metadata cataloged for each file, and retrieve files from iRODS to a local file system. iRODS also provides users with a web-based interface for file ingest, management, and retrieval as well as collaborative sharing features to share data with specific people or the public at large. All files stored in iRODS are replicated between TACC and IU.
Reserved storage is storage on the DSSD Flash storage system at TACC. Because this storage is limited to 0.5 PB and is a shared by projects, it is intended to be used during the a service reservation for the storage only. Note hosted databases can also use this Reserved storage if needed. All usage of Flash storage is billed according to the amount of space set aside for the reservation or database schema, not the amount used. The rate is 1 SU per hour per 4 TB of allocated storage or use of 1 compute node, whichever is greater.
This storage is setup and removed at the start and end of a Hadoop Cluster reservation.
Projects space will be allocated at the start and deallocated at the end of a reservation. Data stored on the system after the end or termination of a reservation is subject to purge based on the needs of other projects using the system.
|/data/projects||Lustre||allocation dependent||10PB||yes|| |
|/gpfs/flash||GPFS||-||reservation dependent||no|Filesystem Type Quota Size Replicated Env var Persistent /home 50MB 6.3TB no
/data/projects Lustre allocation dependent 10PB yes
Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions. There are several transfer mechanism for data to Wrangler, some of which depend on where and how the data are to be stored. Please review the following transfer mechanisms. The Data transfer from any Linux system can be accomplished using the Consult the man pages for more information on The An entire directory can be transferred from source to destination by using For more When executing multiple instantiations of TACC staff recommends the open-source Cyberduck utility for both Mac and Windows users that do not already have a preferred tool. Click on the "Open Connection" button in the top right corner of the Cyberduck window to open a connection configuration window (as shown below) transfer mechanism, and type in the server name " Once connected, you can navigate through your remote file hierarchy using familiar graphical navigation techniques. You may also drag-and-drop files into and out of the Cyberduck window to transfer files to and from Corral. Globus Connect and the Globus command line utilities provide users mechanisms for transferring files using the globus transfer protocols. Users can create a Globus Connect account, download the Globus Connect clients to install on their own systems, interact with the Globus Connect system and learn about all of the features of Globus Connect at the Globus site. XSEDE users may also use Globus' This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the " Each where each XSEDE URL will generally be formatted: Note that The following command copies " Additional command-line transfer utilities supporting standard SSH and grid authentication protocols are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The Users who need to transfer large amounts of data to Wrangler may find it worthwhile to disable to your command-line invocation. Note that not all machines support these options. You must explicitly connect to port 2222 on Wrangler. The following command copies " Please consult Globus' GSI-OpenSSH User's Guide for further info. There are several mechanisms for transferring data to and from Wrangler using iRODS. To access the iRODS command line utilities, please load the iRODS module. Please refer to the TACC iRODS software page for more information. While Wrangler is connected to Internet2 at both TACC and IU with a capacity of 100 GB/s, some projects or users may find their data volume is too great, or their local connectivity too slow to effectively transfer data to Wrangler. TACC staff can work with users to leverage Wranglers Data Dock, allowing users to send physical media to TACC to ingest onto the Wrangler system. For more information about the Data Dock or to set up transfer services, please file a ticket at the TACC User Portal, or contact TACC data staff at email@example.com for more information. TACC, XSEDE, and Indiana University offers several means of user support for Wrangler TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities.
rsync commands are standard data transfer mechanisms used to transfer moderate size files and data collections between systems. These applications use a single thread to transfer each file one at a time. The
rsync utilities are typically the best methods when transferring Gigabytes of data. For larger data transfers, parallel data transfer mechanisms, e.g., Globus, can often improve total throughput and reliability.
scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server by using the command:
localhost% scp filename \
login1$ man scp
rsync command is another way to keep your data up to date. In contrast to
rsync transfers only the actual changed parts of a file (instead of transferring an entire file). Hence, this selective method of data transfer can be much more efficient than scp. The following example demonstrates usage of the
rsync command for transferring a file named "
myfile.c" from its current location on Stampede to Wrangler's
login1$ rsync myfile.c \
rsync as well. For directory transfers the options "
-avtr" will transfer the files recursively ("
-r" option) along with the modification times ("
-t" option) and in the archive mode ("
-a" option) to preserve symbolic links, devices, attributes, permissions, ownerships, etc. The "
-v" option (verbose) increases the amount of information displayed during any transfer. The following example demonstrates the usage of the "
-avtr" options for transferring a directory named "
gauss" from the present working directory on Stampede to a directory named "
data" in the $WORK file system on Wrangler.
login1$ rsync -avtr ./gauss \
rsync options and command details, run the command "
rsync -h" or:
login1$ man rsync
rsync, please limit your transfers to no more than 2-3 processes at a time.
wrangler.tacc.utexas.edu" or "
wrangler.uits.iu.edu". Add your username and password in the spaces provided, and if the "more options" area is not shown click the small triangle or button to expand the window; this will allow you to enter the path to your project area so that when Cyberduck opens the connection you will immediately see your data. Then click the "Connect" button to open your connection.
globus-url-copy command-line utility to transfer data between XSEDE sites.
globus-url-copy, like Globus Connect, is an implementation of the GridFTP protocol, providing high speed transport between GridFTP servers at XSEDE sites. The GridFTP servers mount the specific file systems of the target machine, thereby providing access to your files or directories. Users can also use thier own personal Globus endpoints to transfer data to and from their own systems.
myproxy-logon" command with your XSEDE User Portal (XUP) username and password to obtain a proxy certificate. The proxy is valid for 12 hours for all logins on the local machine. On Wrangler, the
myproxy-logon command is located in the CTSSV4 module (not loaded by default).
login1$ module load CTSSV4
login1$ myproxy-logon -T -l XSEDE-username
globus-url-copy invocation must include the name of the server and a full path to the file. The general syntax looks like:
globus-url-copy [options] source_url destination_url
globus-url-copy supports multiple protocols e.g., HTTP, FTP in addtion to the GridFTP protocol. Please consult the following references for more information.
directory1" from TACC's Wrangler to Indiana University's Mason system, renaming it to "
directory2". Note that when transferring directories, the directory path must end with a slash ( "/"):
login1$ globus-url-copy -r -vb \
gsissh, gsiscp and gsiftp commands are analogous to the OpenSSH
sftp commands. Grid authentication is provided to XSEDE users by first executing the
myproxy-logon command (see above).
gsiscp's default data stream encryption. To do so, add the following three options:
file1" on your local machine to Wrangler renaming it to "
localhost$ gsiscp -oTcpRcvBufPoll=yes -oNoneEnabled=yes -oNoneSwitch=yes \
-P2222 file1 wrangler.tacc.xsede.org:file2
login1$ module load irods
Users often wish to collaborate with fellow project members by sharing files and data with each other. Project managers or delegates can create shared workspaces, areas that are private and accessible only to other project members, using UNIX group permissions and commands. Shared workspaces may be created as read-only or read-write, functioning as data repositories and providing a common work area to all project members. Please see Sharing Project Files on TACC Systems for step-by-step instructions.
There are several transfer mechanism for data to Wrangler, some of which depend on where and how the data are to be stored. Please review the following transfer mechanisms.
Data transfer from any Linux system can be accomplished using the
Consult the man pages for more information on
An entire directory can be transferred from source to destination by using
When executing multiple instantiations of
TACC staff recommends the open-source Cyberduck utility for both Mac and Windows users that do not already have a preferred tool.
Click on the "Open Connection" button in the top right corner of the Cyberduck window to open a connection configuration window (as shown below) transfer mechanism, and type in the server name "
Once connected, you can navigate through your remote file hierarchy using familiar graphical navigation techniques. You may also drag-and-drop files into and out of the Cyberduck window to transfer files to and from Corral.
Globus Connect and the Globus command line utilities provide users mechanisms for transferring files using the globus transfer protocols. Users can create a Globus Connect account, download the Globus Connect clients to install on their own systems, interact with the Globus Connect system and learn about all of the features of Globus Connect at the Globus site.
XSEDE users may also use Globus'
This command requires the use of an XSEDE certificate to create a proxy for passwordless transfers. To obtain a proxy, use the "
where each XSEDE URL will generally be formatted:
The following command copies "
Additional command-line transfer utilities supporting standard SSH and grid authentication protocols are offered by the Globus GSI-OpenSSH implementation of OpenSSH. The
Users who need to transfer large amounts of data to Wrangler may find it worthwhile to disable
to your command-line invocation. Note that not all machines support these options. You must explicitly connect to port 2222 on Wrangler. The following command copies "
Please consult Globus' GSI-OpenSSH User's Guide for further info.
There are several mechanisms for transferring data to and from Wrangler using iRODS. To access the iRODS command line utilities, please load the iRODS module.
Please refer to the TACC iRODS software page for more information.
While Wrangler is connected to Internet2 at both TACC and IU with a capacity of 100 GB/s, some projects or users may find their data volume is too great, or their local connectivity too slow to effectively transfer data to Wrangler. TACC staff can work with users to leverage Wranglers Data Dock, allowing users to send physical media to TACC to ingest onto the Wrangler system. For more information about the Data Dock or to set up transfer services, please file a ticket at the TACC User Portal, or contact TACC data staff at firstname.lastname@example.org for more information.
TACC, XSEDE, and Indiana University offers several means of user support for Wrangler
TACC resources are deployed, configured, and operated to serve a large, diverse user community. It is important that all users are aware of and abide by TACC Usage Policies. Failure to do so may result in suspension or cancellation of the project and associated allocation and closure of all associated logins. Illegal transgressions will be addressed through UT and/or legal authorities.
*Wrangler is generously funded by the National Science Foundation (NSF) through award ACI-1447307, "Wrangler: A Transformational Data Intensive Resource for the Open Science Community".