QQ登录

只需一步,快速开始

 找回密码
 注册

QQ登录

只需一步,快速开始

查看: 1745|回复: 4

请教一个小问题:如何把html网页文件输出成txt文本格式?

[复制链接]
发表于 2005-8-18 08:56:46 | 显示全部楼层 |阅读模式
如何把html网页文件输出成txt文本格式?要求段落格式不变。望能人解答指点!
发表于 2005-8-18 10:57:29 | 显示全部楼层
[code:1]#! /bin/bash
# #############################################################################

       NAME_="html2txt"
    PURPOSE_="convert html file to ascii text; write the converted file to disk"
   SYNOPSIS_="$NAME_ [-vhmlr] <file> [file...]"
   REQUIRES_="standard GNU commands, lynx"
    VERSION_="1.1"
       DATE_="2004-04-18; last update: 2005-03-03"
     AUTHOR_="Dawid Michalczyk <[email protected]>"
        URL_="www.comp.eonworks.com"
   CATEGORY_="text"
   PLATFORM_="Linux"
      SHELL_="bash"
DISTRIBUTE_="yes"

# #############################################################################
# This program is distributed under the terms of the GNU General Public License

usage () {

echo >&2 "$NAME_ $VERSION_ - $PURPOSE_
Usage: $SYNOPSIS_
Requires: $REQUIRES_
Options:
     -r, remove input file after conversion
     -v, verbose
     -h, usage and options (help)
     -m, manual
     -l, see this script"
exit 1
}

manual () { echo >&2 "

NAME

    $NAME_ $VERSION_ - $PURPOSE_

SYNOPSIS

    $SYNOPSIS_

DESCRIPTION

    $NAME_ converts ascii files with html content to plain text. It replaces the
    previous suffix, if any, with a \"txt\" suffix. It skips the following files:

    - binary files
    - directories
    - files that already have the same name as <input_file>.txt

    Option -r, removes the input file after conversion.

EXAMPLES

    Use find with xargs to run the script recursively on multiple files. For
    example, to convert all html files to text recursively:

    find . -name \"*.html\" | xargs $NAME_

"; exit 1; }

# arg check
[ $# -eq 0 ] && { echo >&2 missing argument, type $NAME_ -h for help; exit 1; }

# var initializing
rm_html=
verbose=

# option and argument handling
while getopts vhmlr options; do

    case $options in
        r) rm_html=on ;;
        v) verbose=on ;;
        h) usage ;;
        m) manual | more; exit 1 ;;
        l) more $0 ;;
       \?) echo invalid or missing argument, type $NAME_ -h for help; exit 1 ;;
    esac

done

shift $(( $OPTIND - 1 ))

# check if required command is in $PATH variable
which lynx &> /dev/null
[[ $? != 0 ]] && { echo >&2 the required \"lynx\" command is not in your PATH variable; exit 1; }

# main
for a in "$@"; do

    file $a | grep -q text # testing if input file is an ascii file
    [ $? == 0 ] && is_text=0 || is_text=1

    if [ -f $a ] && [[ ${a#*.} != txt ]] && (( $is_text == 0 )); then

        if [ -f ${a%.*}.txt ]; then
            echo ${NAME_}: skipping: ${a%.*}.txt file already exist
            continue
        else
            [[ $verbose ]] && echo "${NAME_}: converting: $a -> ${a%.*}.txt"
            lynx -dump $a > ${a%.*}.txt
            [[ $? == 0 ]] && stat=0 || stat=1
            [[ $stat == 0 ]] && [[ $verbose ]] && [[ $rm_html ]] && echo ${NAME_}: removing: $a
            [[ $stat == 0 ]] && [[ $rm_html ]] && rm -f -- $a
        fi

    else

        [[ $is_text == 1 ]] && echo ${NAME_}: skipping: $a not an ascii file

    fi

done

[/code:1]
回复

使用道具 举报

发表于 2005-8-18 11:34:59 | 显示全部楼层
这shell“长”
回复

使用道具 举报

 楼主| 发表于 2005-8-18 15:52:37 | 显示全部楼层
多谢版主,我down下去好好研究。。。
回复

使用道具 举报

发表于 2005-8-24 01:18:09 | 显示全部楼层
curl?
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 注册

本版积分规则

GMT+8, 2024-11-18 08:26 , Processed in 0.072362 second(s), 15 queries .

© 2021 Powered by Discuz! X3.5.

快速回复 返回顶部 返回列表